2024-06-19 13:37:48,540 INFO [train.py:1096] (1/2) Training started 2024-06-19 13:37:48,540 INFO [train.py:1106] (1/2) Device: cuda:1 2024-06-19 13:37:48,548 INFO [train.py:1118] (1/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '8f976a1e1407e330e2a233d68f81b1eb5269fdaa', 'k2-git-date': 'Thu Jun 6 02:13:08 2024', 'lhotse-version': '1.24.0.dev+git.4d57d53d.dirty', 'torch-version': '2.3.1+cu121', 'torch-cuda-available': True, 'torch-cuda-version': '12.1', 'python-version': '3.9', 'icefall-git-branch': 'feature/ksponspeech_zipformer', 'icefall-git-sha1': '7dda45c9-dirty', 'icefall-git-date': 'Tue Jun 18 16:40:30 2024', 'icefall-path': '/home/ubuntu/icefall', 'k2-path': '/home/ubuntu/miniforge3/envs/lhotse/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/home/ubuntu/lhotse/lhotse/__init__.py', 'hostname': 'gpu-1', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 23456, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/lang_bpe_5000/bpe.model', 'base_lr': 0.035, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 550, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 24, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 5000} 2024-06-19 13:37:48,548 INFO [train.py:1120] (1/2) About to create model 2024-06-19 13:37:49,289 INFO [train.py:1124] (1/2) Number of model parameters: 74778511 2024-06-19 13:37:49,373 INFO [train.py:1139] (1/2) Using DDP 2024-06-19 13:37:51,964 INFO [asr_datamodule.py:391] (1/2) About to get train cuts. 2024-06-19 13:37:52,022 INFO [asr_datamodule.py:215] (1/2) Enable MUSAN 2024-06-19 13:37:52,022 INFO [asr_datamodule.py:216] (1/2) About to get Musan cuts 2024-06-19 13:37:53,873 INFO [asr_datamodule.py:240] (1/2) Enable SpecAugment 2024-06-19 13:37:53,874 INFO [asr_datamodule.py:241] (1/2) Time warp factor: 80 2024-06-19 13:37:53,874 INFO [asr_datamodule.py:251] (1/2) Num frame mask: 10 2024-06-19 13:37:53,874 INFO [asr_datamodule.py:264] (1/2) About to create train dataset 2024-06-19 13:37:53,874 INFO [asr_datamodule.py:291] (1/2) Using DynamicBucketingSampler. 2024-06-19 13:37:54,404 INFO [asr_datamodule.py:308] (1/2) About to create train dataloader 2024-06-19 13:37:54,404 INFO [asr_datamodule.py:398] (1/2) About to get dev cuts 2024-06-19 13:37:54,422 INFO [asr_datamodule.py:339] (1/2) About to create dev dataset 2024-06-19 13:37:54,552 INFO [asr_datamodule.py:356] (1/2) About to create dev dataloader 2024-06-19 13:37:54,553 INFO [train.py:1330] (1/2) Sanity check -- see if any of the batches in epoch 1 would cause OOM. 2024-06-19 13:46:44,037 INFO [scaling.py:1023] (1/2) Whitening: name=None, num_groups=1, num_channels=192, metric=39.06 vs. limit=7.5 2024-06-19 13:46:44,113 INFO [train.py:1358] (1/2) Maximum memory allocated so far is 15006MB 2024-06-19 13:46:46,242 INFO [train.py:1358] (1/2) Maximum memory allocated so far is 15218MB 2024-06-19 13:46:51,240 INFO [train.py:1358] (1/2) Maximum memory allocated so far is 15218MB 2024-06-19 13:46:54,097 INFO [train.py:1358] (1/2) Maximum memory allocated so far is 15218MB 2024-06-19 13:47:12,136 INFO [train.py:1358] (1/2) Maximum memory allocated so far is 15218MB 2024-06-19 13:47:15,468 INFO [train.py:1358] (1/2) Maximum memory allocated so far is 15218MB 2024-06-19 13:49:24,855 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.12 vs. limit=7.5 2024-06-19 13:49:25,116 INFO [train.py:1028] (1/2) Epoch 1, batch 0, loss[loss=9.819, simple_loss=8.911, pruned_loss=9.068, over 12833.00 frames. ], tot_loss[loss=9.819, simple_loss=8.911, pruned_loss=9.068, over 12833.00 frames. ], batch size: 36, lr: 1.75e-02, grad_scale: 1.0 2024-06-19 13:49:25,117 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 13:49:40,383 INFO [train.py:1060] (1/2) Epoch 1, validation: loss=9.767, simple_loss=8.861, pruned_loss=9.037, over 351949.00 frames. 2024-06-19 13:49:40,384 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 15218MB 2024-06-19 13:49:42,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=0.0, ans=0.1 2024-06-19 13:49:42,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=0.0, ans=0.5 2024-06-19 13:49:42,639 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.90 vs. limit=5.0 2024-06-19 13:49:43,446 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=3.0 2024-06-19 13:49:49,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=18.333333333333332, ans=0.24981666666666666 2024-06-19 13:49:49,865 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.798e+03 2.880e+03 3.055e+03 3.457e+03 3.724e+03, threshold=1.222e+04, percent-clipped=0.0 2024-06-19 13:49:50,320 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=28.85 vs. limit=7.506875 2024-06-19 13:49:50,943 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.67 vs. limit=5.004583333333334 2024-06-19 13:49:57,330 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=316.08 vs. limit=7.51375 2024-06-19 13:49:57,672 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.515e+02 2.290e+03 2.974e+03 3.166e+03 3.853e+03, threshold=1.190e+04, percent-clipped=0.0 2024-06-19 13:50:02,070 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=253.13 vs. limit=7.5275 2024-06-19 13:50:05,402 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=293.24 vs. limit=7.520625 2024-06-19 13:50:07,762 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=493.02 vs. limit=7.520625 2024-06-19 13:50:16,793 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.649e+02 5.515e+02 1.101e+03 3.024e+03 7.840e+03, threshold=4.406e+03, percent-clipped=0.0 2024-06-19 13:50:17,213 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=342.01 vs. limit=7.5275 2024-06-19 13:50:17,368 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=315.79 vs. limit=7.5275 2024-06-19 13:50:19,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=171.99 vs. limit=5.036666666666667 2024-06-19 13:50:21,027 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 13:50:22,017 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=12.28 vs. limit=3.011 2024-06-19 13:50:23,409 INFO [train.py:1028] (1/2) Epoch 1, batch 50, loss[loss=1.677, simple_loss=1.556, pruned_loss=1.144, over 12647.00 frames. ], tot_loss[loss=4.257, simple_loss=3.967, pruned_loss=2.906, over 574098.92 frames. ], batch size: 29, lr: 1.93e-02, grad_scale: 0.5 2024-06-19 13:50:26,525 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=296.42 vs. limit=5.045833333333333 2024-06-19 13:50:28,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=91.66666666666667, ans=0.1965625 2024-06-19 13:50:30,234 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=250.50 vs. limit=7.56875 2024-06-19 13:50:31,682 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=373.02 vs. limit=7.5825 2024-06-19 13:50:32,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=110.0, ans=7.54125 2024-06-19 13:50:35,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=306.67 vs. limit=7.54125 2024-06-19 13:50:40,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=128.33333333333334, ans=0.493984375 2024-06-19 13:50:43,628 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.93 vs. limit=3.01925 2024-06-19 13:50:48,009 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.67 vs. limit=3.01925 2024-06-19 13:50:49,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=146.66666666666666, ans=7.555 2024-06-19 13:50:57,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=165.0, ans=0.09896875000000001 2024-06-19 13:50:57,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=165.0, ans=0.492265625 2024-06-19 13:51:05,860 INFO [train.py:1028] (1/2) Epoch 1, batch 100, loss[loss=1.055, simple_loss=0.9278, pruned_loss=1.034, over 13320.00 frames. ], tot_loss[loss=2.455, simple_loss=2.266, pruned_loss=1.812, over 1016667.49 frames. ], batch size: 46, lr: 2.10e-02, grad_scale: 1.0 2024-06-19 13:51:07,463 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.229e+01 6.855e+01 4.562e+02 9.632e+02 7.840e+03, threshold=9.124e+02, percent-clipped=0.0 2024-06-19 13:51:08,087 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=260.39 vs. limit=5.091666666666667 2024-06-19 13:51:08,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=107.56 vs. limit=7.56875 2024-06-19 13:51:11,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=183.33333333333334, ans=0.2396875 2024-06-19 13:51:12,829 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=155.09 vs. limit=7.56875 2024-06-19 13:51:13,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=201.66666666666666, ans=0.1924375 2024-06-19 13:51:23,020 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=300.95 vs. limit=7.5825 2024-06-19 13:51:28,873 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=117.27 vs. limit=7.665 2024-06-19 13:51:32,017 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=26.18 vs. limit=7.5825 2024-06-19 13:51:32,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=65.23 vs. limit=7.665 2024-06-19 13:51:32,240 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.45 vs. limit=3.033 2024-06-19 13:51:36,772 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=379.27 vs. limit=7.589375 2024-06-19 13:51:37,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=238.33333333333334, ans=0.29761666666666664 2024-06-19 13:51:43,374 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=73.42 vs. limit=7.59625 2024-06-19 13:51:43,691 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=63.30 vs. limit=5.128333333333333 2024-06-19 13:51:49,584 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=185.31 vs. limit=7.59625 2024-06-19 13:51:51,330 INFO [train.py:1028] (1/2) Epoch 1, batch 150, loss[loss=0.9148, simple_loss=0.792, pruned_loss=0.9089, over 12768.00 frames. ], tot_loss[loss=1.823, simple_loss=1.663, pruned_loss=1.44, over 1364119.90 frames. ], batch size: 29, lr: 2.28e-02, grad_scale: 1.0 2024-06-19 13:51:51,684 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=22.89 vs. limit=7.603125 2024-06-19 13:51:52,626 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=18.77 vs. limit=7.603125 2024-06-19 13:51:59,618 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=147.65 vs. limit=7.61 2024-06-19 13:52:00,353 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=13.23 vs. limit=4.117333333333334 2024-06-19 13:52:05,767 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=121.78 vs. limit=7.61 2024-06-19 13:52:08,321 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=135.22 vs. limit=7.616875 2024-06-19 13:52:09,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=311.6666666666667, ans=0.18831250000000002 2024-06-19 13:52:16,900 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=446.77 vs. limit=7.62375 2024-06-19 13:52:19,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=330.0, ans=0.48453125 2024-06-19 13:52:21,175 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=9.83 vs. limit=4.132 2024-06-19 13:52:24,367 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=391.33 vs. limit=7.630625 2024-06-19 13:52:25,919 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=13.13 vs. limit=4.139333333333333 2024-06-19 13:52:26,130 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=40.19 vs. limit=7.630625 2024-06-19 13:52:26,785 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=3.05225 2024-06-19 13:52:28,448 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.10 vs. limit=5.174166666666666 2024-06-19 13:52:30,862 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=63.59 vs. limit=7.630625 2024-06-19 13:52:32,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=366.6666666666667, ans=0.4828125 2024-06-19 13:52:33,325 INFO [train.py:1028] (1/2) Epoch 1, batch 200, loss[loss=0.9247, simple_loss=0.7978, pruned_loss=0.8765, over 12701.00 frames. ], tot_loss[loss=1.505, simple_loss=1.356, pruned_loss=1.25, over 1635143.98 frames. ], batch size: 203, lr: 2.45e-02, grad_scale: 2.0 2024-06-19 13:52:35,219 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.236e+01 3.699e+01 4.197e+01 4.868e+01 7.672e+01, threshold=8.394e+01, percent-clipped=0.0 2024-06-19 13:52:42,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=366.6666666666667, ans=0.09770833333333334 2024-06-19 13:52:44,063 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=45.85 vs. limit=5.183333333333334 2024-06-19 13:52:46,687 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=15.22 vs. limit=7.644375 2024-06-19 13:52:48,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=385.0, ans=0.048796875 2024-06-19 13:52:53,412 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=55.94 vs. limit=7.65125 2024-06-19 13:52:57,546 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.10 vs. limit=7.8025 2024-06-19 13:53:06,180 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=12.80 vs. limit=4.168666666666667 2024-06-19 13:53:09,530 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=15.27 vs. limit=4.168666666666667 2024-06-19 13:53:13,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=440.0, ans=0.2066 2024-06-19 13:53:18,955 INFO [train.py:1028] (1/2) Epoch 1, batch 250, loss[loss=0.764, simple_loss=0.6508, pruned_loss=0.7236, over 13046.00 frames. ], tot_loss[loss=1.306, simple_loss=1.164, pruned_loss=1.126, over 1847149.90 frames. ], batch size: 144, lr: 2.63e-02, grad_scale: 2.0 2024-06-19 13:53:22,003 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=7.84375 2024-06-19 13:53:23,996 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=4.183333333333334 2024-06-19 13:53:25,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=458.3333333333333, ans=0.7545833333333334 2024-06-19 13:53:32,195 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=32.51 vs. limit=7.8575 2024-06-19 13:53:33,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=476.6666666666667, ans=0.47765625 2024-06-19 13:53:35,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=495.0, ans=0.08886250000000001 2024-06-19 13:53:35,310 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=495.0, ans=0.882675 2024-06-19 13:53:35,685 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=169.52 vs. limit=7.685625 2024-06-19 13:53:38,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=495.0, ans=0.08886250000000001 2024-06-19 13:53:39,218 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.55 vs. limit=7.87125 2024-06-19 13:53:51,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=513.3333333333334, ans=0.7551333333333333 2024-06-19 13:53:52,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=531.6666666666666, ans=0.18006250000000001 2024-06-19 13:53:52,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=182.20 vs. limit=7.699375 2024-06-19 13:53:57,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=531.6666666666666, ans=4.212666666666666 2024-06-19 13:53:58,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=531.6666666666666, ans=0.5 2024-06-19 13:54:03,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=531.6666666666666, ans=0.2946833333333333 2024-06-19 13:54:05,541 INFO [train.py:1028] (1/2) Epoch 1, batch 300, loss[loss=0.7947, simple_loss=0.6665, pruned_loss=0.7574, over 13174.00 frames. ], tot_loss[loss=1.175, simple_loss=1.034, pruned_loss=1.039, over 2010404.65 frames. ], batch size: 112, lr: 2.80e-02, grad_scale: 4.0 2024-06-19 13:54:07,283 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.314e+01 4.826e+01 6.437e+01 1.085e+02 3.879e+02, threshold=1.287e+02, percent-clipped=36.0 2024-06-19 13:54:09,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=550.0, ans=0.08762500000000001 2024-06-19 13:54:09,476 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=27.34 vs. limit=7.9125 2024-06-19 13:54:11,175 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=68.49 vs. limit=7.9125 2024-06-19 13:54:12,257 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=375.81 vs. limit=7.70625 2024-06-19 13:54:13,014 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=49.46 vs. limit=7.9125 2024-06-19 13:54:16,517 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=7.92625 2024-06-19 13:54:17,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.24 vs. limit=7.92625 2024-06-19 13:54:19,616 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=5.153e-01 2024-06-19 13:54:19,951 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=106.71 vs. limit=5.284166666666667 2024-06-19 13:54:20,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=568.3333333333334, ans=7.713125 2024-06-19 13:54:26,594 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.16 vs. limit=7.94 2024-06-19 13:54:28,926 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=4.234666666666667 2024-06-19 13:54:31,662 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.22 vs. limit=5.15125 2024-06-19 13:54:33,005 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.13 vs. limit=4.242 2024-06-19 13:54:36,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=24.92 vs. limit=5.3025 2024-06-19 13:54:38,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=623.3333333333334, ans=0.47078125 2024-06-19 13:54:44,548 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=7.9675 2024-06-19 13:54:46,440 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=7.9675 2024-06-19 13:54:47,524 INFO [train.py:1028] (1/2) Epoch 1, batch 350, loss[loss=0.822, simple_loss=0.6748, pruned_loss=0.8002, over 13169.00 frames. ], tot_loss[loss=1.078, simple_loss=0.9385, pruned_loss=0.9695, over 2140001.05 frames. ], batch size: 34, lr: 2.98e-02, grad_scale: 4.0 2024-06-19 13:54:47,856 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=7.98125 2024-06-19 13:54:54,689 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.36 vs. limit=7.740625 2024-06-19 13:54:55,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=660.0, ans=0.8769 2024-06-19 13:54:59,697 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.73 vs. limit=7.995 2024-06-19 13:55:01,532 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=206.08 vs. limit=7.7475 2024-06-19 13:55:07,577 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=32.58 vs. limit=7.754375 2024-06-19 13:55:11,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=678.3333333333334, ans=0.1745625 2024-06-19 13:55:11,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=678.3333333333334, ans=0.468203125 2024-06-19 13:55:16,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=696.6666666666666, ans=0.21045 2024-06-19 13:55:16,821 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.97 vs. limit=4.278666666666667 2024-06-19 13:55:18,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=696.6666666666666, ans=0.46734375 2024-06-19 13:55:25,781 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=7.768125 2024-06-19 13:55:28,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=715.0, ans=0.24285 2024-06-19 13:55:31,739 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.34 vs. limit=7.775 2024-06-19 13:55:32,094 INFO [train.py:1028] (1/2) Epoch 1, batch 400, loss[loss=0.8929, simple_loss=0.7314, pruned_loss=0.8375, over 13289.00 frames. ], tot_loss[loss=1.012, simple_loss=0.8708, pruned_loss=0.9182, over 2241190.71 frames. ], batch size: 63, lr: 3.15e-02, grad_scale: 8.0 2024-06-19 13:55:33,578 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.392e+01 7.864e+01 9.618e+01 1.333e+02 3.416e+02, threshold=1.924e+02, percent-clipped=28.0 2024-06-19 13:55:38,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=733.3333333333334, ans=0.465625 2024-06-19 13:55:50,285 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=48.77 vs. limit=7.78875 2024-06-19 13:55:54,198 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.53 vs. limit=8.0775 2024-06-19 13:55:57,682 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.95 vs. limit=7.795625 2024-06-19 13:55:58,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=788.3333333333334, ans=0.0822625 2024-06-19 13:55:59,238 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.65 vs. limit=8.09125 2024-06-19 13:56:04,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=806.6666666666666, ans=0.8717666666666667 2024-06-19 13:56:09,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=806.6666666666666, ans=0.4621875 2024-06-19 13:56:10,223 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=210.29 vs. limit=7.8025 2024-06-19 13:56:10,872 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.91 vs. limit=7.8025 2024-06-19 13:56:12,876 INFO [train.py:1028] (1/2) Epoch 1, batch 450, loss[loss=0.8417, simple_loss=0.6858, pruned_loss=0.7696, over 13256.00 frames. ], tot_loss[loss=0.964, simple_loss=0.8212, pruned_loss=0.8777, over 2315513.66 frames. ], batch size: 67, lr: 3.33e-02, grad_scale: 8.0 2024-06-19 13:56:15,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=825.0, ans=0.871125 2024-06-19 13:56:21,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=843.3333333333334, ans=0.46046875 2024-06-19 13:56:22,339 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=8.08 vs. limit=4.3373333333333335 2024-06-19 13:56:22,374 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=100.34 vs. limit=5.421666666666667 2024-06-19 13:56:25,035 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.89 vs. limit=8.1325 2024-06-19 13:56:31,125 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.84 vs. limit=8.1325 2024-06-19 13:56:36,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=861.6666666666666, ans=0.459609375 2024-06-19 13:56:38,628 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=7.823125 2024-06-19 13:56:41,931 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.50 vs. limit=7.83 2024-06-19 13:56:42,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=880.0, ans=0.45875 2024-06-19 13:56:46,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=880.0, ans=0.45875 2024-06-19 13:56:46,893 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=85.46 vs. limit=7.83 2024-06-19 13:56:49,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=898.3333333333334, ans=0.457890625 2024-06-19 13:56:50,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=898.3333333333334, ans=0.07978750000000001 2024-06-19 13:56:52,229 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=21.08 vs. limit=7.836875 2024-06-19 13:56:53,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=898.3333333333334, ans=0.457890625 2024-06-19 13:56:54,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=898.3333333333334, ans=0.5 2024-06-19 13:56:56,990 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=28.86 vs. limit=7.84375 2024-06-19 13:56:57,314 INFO [train.py:1028] (1/2) Epoch 1, batch 500, loss[loss=0.7551, simple_loss=0.6157, pruned_loss=0.6653, over 13107.00 frames. ], tot_loss[loss=0.9287, simple_loss=0.783, pruned_loss=0.8451, over 2377669.31 frames. ], batch size: 121, lr: 3.50e-02, grad_scale: 8.0 2024-06-19 13:56:58,893 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.563e+01 6.632e+01 8.943e+01 1.139e+02 2.753e+02, threshold=1.789e+02, percent-clipped=4.0 2024-06-19 13:56:59,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=916.6666666666666, ans=0.047135416666666666 2024-06-19 13:57:00,938 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=13.52 vs. limit=7.84375 2024-06-19 13:57:16,045 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=34.67 vs. limit=7.8575 2024-06-19 13:57:19,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=953.3333333333334, ans=0.29046666666666665 2024-06-19 13:57:29,297 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.62 vs. limit=7.864375 2024-06-19 13:57:38,361 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.86 vs. limit=8.25625 2024-06-19 13:57:38,759 INFO [train.py:1028] (1/2) Epoch 1, batch 550, loss[loss=0.7907, simple_loss=0.6441, pruned_loss=0.6762, over 12950.00 frames. ], tot_loss[loss=0.8998, simple_loss=0.7516, pruned_loss=0.815, over 2422907.11 frames. ], batch size: 158, lr: 3.50e-02, grad_scale: 8.0 2024-06-19 13:57:43,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1008.3333333333334, ans=0.09369791666666667 2024-06-19 13:57:47,661 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=23.84 vs. limit=5.504166666666666 2024-06-19 13:57:56,927 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=4.410666666666667 2024-06-19 13:57:57,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1026.6666666666667, ans=0.0769 2024-06-19 13:58:03,409 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=33.79 vs. limit=7.891875 2024-06-19 13:58:06,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1063.3333333333333, ans=0.28936666666666666 2024-06-19 13:58:08,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=12.18 vs. limit=7.89875 2024-06-19 13:58:09,404 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.423e+01 2024-06-19 13:58:16,221 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.86 vs. limit=7.905625 2024-06-19 13:58:16,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1081.6666666666667, ans=0.0756625 2024-06-19 13:58:20,853 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=35.16 vs. limit=7.905625 2024-06-19 13:58:22,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1100.0, ans=0.15875 2024-06-19 13:58:22,372 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=7.9125 2024-06-19 13:58:23,112 INFO [train.py:1028] (1/2) Epoch 1, batch 600, loss[loss=0.7636, simple_loss=0.6118, pruned_loss=0.6579, over 13013.00 frames. ], tot_loss[loss=0.8805, simple_loss=0.7283, pruned_loss=0.7927, over 2460408.32 frames. ], batch size: 144, lr: 3.49e-02, grad_scale: 8.0 2024-06-19 13:58:23,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1100.0, ans=0.07525 2024-06-19 13:58:24,374 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=39.16 vs. limit=7.9125 2024-06-19 13:58:24,482 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.65 vs. limit=5.55 2024-06-19 13:58:24,728 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.020e+01 7.979e+01 1.031e+02 1.409e+02 2.698e+02, threshold=2.062e+02, percent-clipped=7.0 2024-06-19 13:58:26,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1100.0, ans=0.4484375 2024-06-19 13:58:29,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1100.0, ans=0.8615 2024-06-19 13:58:30,114 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.82 vs. limit=8.325 2024-06-19 13:58:36,175 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.97 vs. limit=7.919375 2024-06-19 13:58:43,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1136.6666666666667, ans=0.44671875 2024-06-19 13:58:46,143 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.60 vs. limit=7.92625 2024-06-19 13:58:50,140 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.53 vs. limit=4.462 2024-06-19 13:58:58,538 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.84 vs. limit=8.38 2024-06-19 13:58:59,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1173.3333333333333, ans=0.156 2024-06-19 13:59:04,972 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.67 vs. limit=4.469333333333333 2024-06-19 13:59:05,102 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=3.176 2024-06-19 13:59:06,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1173.3333333333333, ans=0.445 2024-06-19 13:59:06,729 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.59 vs. limit=5.586666666666667 2024-06-19 13:59:07,944 INFO [train.py:1028] (1/2) Epoch 1, batch 650, loss[loss=0.827, simple_loss=0.6546, pruned_loss=0.7098, over 13221.00 frames. ], tot_loss[loss=0.8662, simple_loss=0.7096, pruned_loss=0.7745, over 2491569.87 frames. ], batch size: 59, lr: 3.49e-02, grad_scale: 8.0 2024-06-19 13:59:08,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1191.6666666666667, ans=0.8582916666666667 2024-06-19 13:59:13,379 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=24.48 vs. limit=7.946875 2024-06-19 13:59:14,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1191.6666666666667, ans=0.2880833333333333 2024-06-19 13:59:15,930 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=7.95375 2024-06-19 13:59:16,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=19.07 vs. limit=7.95375 2024-06-19 13:59:29,256 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=34.81 vs. limit=7.960625 2024-06-19 13:59:31,585 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.97 vs. limit=5.614166666666667 2024-06-19 13:59:36,257 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=7.9675 2024-06-19 13:59:37,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1246.6666666666667, ans=0.44156249999999997 2024-06-19 13:59:41,415 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=7.974375 2024-06-19 13:59:48,843 INFO [train.py:1028] (1/2) Epoch 1, batch 700, loss[loss=0.8456, simple_loss=0.6583, pruned_loss=0.7285, over 13299.00 frames. ], tot_loss[loss=0.8554, simple_loss=0.6945, pruned_loss=0.7581, over 2513880.70 frames. ], batch size: 46, lr: 3.49e-02, grad_scale: 8.0 2024-06-19 13:59:50,391 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.630e+01 9.866e+01 1.210e+02 1.738e+02 9.321e+02, threshold=2.421e+02, percent-clipped=18.0 2024-06-19 13:59:57,073 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=8.47625 2024-06-19 13:59:58,552 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.59 vs. limit=3.19525 2024-06-19 14:00:11,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1320.0, ans=0.438125 2024-06-19 14:00:11,588 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=23.14 vs. limit=7.995 2024-06-19 14:00:12,015 WARNING [optim.py:503] (1/2) Scaling gradients by 0.0939645916223526, model_norm_threshold=242.07273864746094 2024-06-19 14:00:12,229 WARNING [optim.py:575] (1/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.87, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=5.750e+06, grad_sumsq=1.477e+08, orig_rms_sq=3.893e-02 2024-06-19 14:00:14,074 WARNING [optim.py:503] (1/2) Scaling gradients by 0.07021020352840424, model_norm_threshold=242.07273864746094 2024-06-19 14:00:14,253 WARNING [optim.py:575] (1/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.83, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=9.850e+06, grad_sumsq=2.530e+08, orig_rms_sq=3.893e-02 2024-06-19 14:00:14,505 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=1.873e+00 2024-06-19 14:00:14,766 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.26 vs. limit=8.50375 2024-06-19 14:00:15,555 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=34.65 vs. limit=8.001875 2024-06-19 14:00:18,273 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=57.06 vs. limit=8.001875 2024-06-19 14:00:24,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=1338.3333333333333, ans=0.17471875 2024-06-19 14:00:33,218 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=22.79 vs. limit=8.00875 2024-06-19 14:00:34,297 INFO [train.py:1028] (1/2) Epoch 1, batch 750, loss[loss=0.7977, simple_loss=0.6175, pruned_loss=0.6762, over 13261.00 frames. ], tot_loss[loss=0.85, simple_loss=0.6834, pruned_loss=0.7473, over 2528557.51 frames. ], batch size: 63, lr: 3.49e-02, grad_scale: 2.0 2024-06-19 14:00:35,640 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=336.04 vs. limit=8.015625 2024-06-19 14:00:47,910 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.38 vs. limit=4.557333333333333 2024-06-19 14:00:49,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1411.6666666666667, ans=0.14706249999999998 2024-06-19 14:00:50,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1411.6666666666667, ans=0.2858833333333333 2024-06-19 14:00:53,617 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.30 vs. limit=8.55875 2024-06-19 14:00:56,962 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=39.78 vs. limit=8.029375 2024-06-19 14:00:57,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1411.6666666666667, ans=0.3235416666666666 2024-06-19 14:00:57,774 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=48.92 vs. limit=8.029375 2024-06-19 14:01:04,911 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.92 vs. limit=8.03625 2024-06-19 14:01:06,705 WARNING [optim.py:503] (1/2) Scaling gradients by 0.0816488191485405, model_norm_threshold=242.07273864746094 2024-06-19 14:01:06,861 WARNING [optim.py:575] (1/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.81, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=7.115e+06, grad_sumsq=1.943e+08, orig_rms_sq=3.662e-02 2024-06-19 14:01:07,668 WARNING [optim.py:503] (1/2) Scaling gradients by 0.0905035063624382, model_norm_threshold=242.07273864746094 2024-06-19 14:01:07,826 WARNING [optim.py:575] (1/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.89, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=6.392e+06, grad_sumsq=1.746e+08, orig_rms_sq=3.662e-02 2024-06-19 14:01:12,379 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=48.22 vs. limit=8.58625 2024-06-19 14:01:13,048 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=54.80 vs. limit=8.043125 2024-06-19 14:01:15,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1466.6666666666667, ans=0.8486666666666667 2024-06-19 14:01:16,212 WARNING [optim.py:503] (1/2) Scaling gradients by 0.08115486800670624, model_norm_threshold=242.07273864746094 2024-06-19 14:01:16,387 WARNING [optim.py:575] (1/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.81, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=7.232e+06, grad_sumsq=1.982e+08, orig_rms_sq=3.648e-02 2024-06-19 14:01:16,423 INFO [train.py:1028] (1/2) Epoch 1, batch 800, loss[loss=0.8657, simple_loss=0.6568, pruned_loss=0.7394, over 12943.00 frames. ], tot_loss[loss=0.8474, simple_loss=0.6751, pruned_loss=0.7379, over 2540694.49 frames. ], batch size: 36, lr: 3.49e-02, grad_scale: 4.0 2024-06-19 14:01:19,880 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.297e+01 2.238e+02 4.222e+02 7.801e+02 3.448e+03, threshold=8.443e+02, percent-clipped=71.0 2024-06-19 14:01:20,438 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.11 vs. limit=8.05 2024-06-19 14:01:21,028 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=6.523e-01 2024-06-19 14:01:24,682 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.13 vs. limit=8.056875 2024-06-19 14:01:29,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1485.0, ans=0.0665875 2024-06-19 14:01:30,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1485.0, ans=0.430390625 2024-06-19 14:01:30,577 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=26.85 vs. limit=8.056875 2024-06-19 14:01:32,950 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=20.60 vs. limit=8.06375 2024-06-19 14:01:33,014 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=359.31 vs. limit=8.06375 2024-06-19 14:01:47,844 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.65 vs. limit=8.070625 2024-06-19 14:01:49,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1521.6666666666667, ans=0.428671875 2024-06-19 14:01:50,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1521.6666666666667, ans=0.28478333333333333 2024-06-19 14:01:51,082 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=20.22 vs. limit=8.070625 2024-06-19 14:01:51,849 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=17.27 vs. limit=8.070625 2024-06-19 14:01:54,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1540.0, ans=0.06534999999999999 2024-06-19 14:01:56,086 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.79 vs. limit=8.655 2024-06-19 14:01:59,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1540.0, ans=0.8461000000000001 2024-06-19 14:02:00,902 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=24.97 vs. limit=8.084375 2024-06-19 14:02:01,468 INFO [train.py:1028] (1/2) Epoch 1, batch 850, loss[loss=0.8171, simple_loss=0.6249, pruned_loss=0.6731, over 13178.00 frames. ], tot_loss[loss=0.8464, simple_loss=0.6677, pruned_loss=0.7308, over 2551122.01 frames. ], batch size: 95, lr: 3.49e-02, grad_scale: 4.0 2024-06-19 14:02:10,756 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.73 vs. limit=8.682500000000001 2024-06-19 14:02:15,700 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.52 vs. limit=5.788333333333333 2024-06-19 14:02:16,704 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.54 vs. limit=5.394166666666667 2024-06-19 14:02:19,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1595.0, ans=0.425234375 2024-06-19 14:02:21,323 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=19.33 vs. limit=8.098125 2024-06-19 14:02:26,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1613.3333333333333, ans=0.0637 2024-06-19 14:02:27,969 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=67.34 vs. limit=8.105 2024-06-19 14:02:32,156 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.70 vs. limit=5.806666666666667 2024-06-19 14:02:35,574 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=87.43 vs. limit=8.111875 2024-06-19 14:02:42,732 INFO [train.py:1028] (1/2) Epoch 1, batch 900, loss[loss=0.852, simple_loss=0.6356, pruned_loss=0.7114, over 12962.00 frames. ], tot_loss[loss=0.8472, simple_loss=0.6625, pruned_loss=0.7241, over 2556262.01 frames. ], batch size: 36, lr: 3.49e-02, grad_scale: 4.0 2024-06-19 14:02:43,916 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.83 vs. limit=8.11875 2024-06-19 14:02:44,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1650.0, ans=0.2835 2024-06-19 14:02:46,359 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.53 vs. limit=5.825 2024-06-19 14:02:46,746 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 4.980e+02 8.348e+02 1.361e+03 5.092e+03, threshold=1.670e+03, percent-clipped=48.0 2024-06-19 14:02:50,457 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=10.38 vs. limit=8.75125 2024-06-19 14:02:52,151 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=8.75125 2024-06-19 14:02:58,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1668.3333333333333, ans=0.421796875 2024-06-19 14:02:58,533 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=8.125625 2024-06-19 14:03:00,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1668.3333333333333, ans=0.1374375 2024-06-19 14:03:00,837 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=23.60 vs. limit=8.125625 2024-06-19 14:03:05,219 WARNING [optim.py:503] (1/2) Scaling gradients by 0.08229032158851624, model_norm_threshold=1669.5751953125 2024-06-19 14:03:05,392 WARNING [optim.py:575] (1/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.91, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=3.743e+08, grad_sumsq=1.163e+10, orig_rms_sq=3.219e-02 2024-06-19 14:03:05,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1686.6666666666667, ans=0.2891666666666667 2024-06-19 14:03:21,171 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=5.430833333333333 2024-06-19 14:03:26,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1741.6666666666667, ans=0.8390416666666667 2024-06-19 14:03:27,393 INFO [train.py:1028] (1/2) Epoch 1, batch 950, loss[loss=0.8554, simple_loss=0.6337, pruned_loss=0.7053, over 12951.00 frames. ], tot_loss[loss=0.8492, simple_loss=0.6585, pruned_loss=0.7184, over 2559572.35 frames. ], batch size: 39, lr: 3.49e-02, grad_scale: 1.0 2024-06-19 14:03:33,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1741.6666666666667, ans=0.8390416666666667 2024-06-19 14:03:39,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1760.0, ans=0.0604 2024-06-19 14:03:48,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1778.3333333333333, ans=0.1333125 2024-06-19 14:03:49,618 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.51 vs. limit=5.889166666666666 2024-06-19 14:03:53,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1796.6666666666667, ans=0.8371166666666667 2024-06-19 14:04:00,214 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.96 vs. limit=5.9075 2024-06-19 14:04:04,138 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=53.46 vs. limit=8.180625 2024-06-19 14:04:08,202 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=9.39 vs. limit=8.86125 2024-06-19 14:04:10,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1815.0, ans=0.0591625 2024-06-19 14:04:12,038 INFO [train.py:1028] (1/2) Epoch 1, batch 1000, loss[loss=0.8451, simple_loss=0.6276, pruned_loss=0.6806, over 13020.00 frames. ], tot_loss[loss=0.8493, simple_loss=0.6535, pruned_loss=0.7105, over 2561029.39 frames. ], batch size: 48, lr: 3.48e-02, grad_scale: 2.0 2024-06-19 14:04:17,718 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.323e+02 5.068e+02 7.710e+02 1.208e+03 2.029e+04, threshold=1.542e+03, percent-clipped=14.0 2024-06-19 14:04:18,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1833.3333333333333, ans=0.2275 2024-06-19 14:04:18,977 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.50 vs. limit=8.875 2024-06-19 14:04:23,113 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.54 vs. limit=5.462916666666667 2024-06-19 14:04:24,940 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=23.59 vs. limit=8.194375 2024-06-19 14:04:28,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.63 vs. limit=3.2805 2024-06-19 14:04:28,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1870.0, ans=0.057925000000000004 2024-06-19 14:04:32,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.12 vs. limit=8.9025 2024-06-19 14:04:33,419 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.44 vs. limit=8.9025 2024-06-19 14:04:33,423 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.88 vs. limit=8.20125 2024-06-19 14:04:34,033 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=36.58 vs. limit=8.9025 2024-06-19 14:04:35,496 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.343e-01 2024-06-19 14:04:39,950 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.25 vs. limit=5.472083333333333 2024-06-19 14:04:46,579 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=73.33 vs. limit=8.215 2024-06-19 14:04:48,840 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=37.22 vs. limit=8.215 2024-06-19 14:04:49,035 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.55 vs. limit=8.215 2024-06-19 14:04:53,364 INFO [train.py:1028] (1/2) Epoch 1, batch 1050, loss[loss=0.891, simple_loss=0.6556, pruned_loss=0.7115, over 13188.00 frames. ], tot_loss[loss=0.8547, simple_loss=0.6517, pruned_loss=0.7085, over 2563496.55 frames. ], batch size: 77, lr: 3.48e-02, grad_scale: 2.0 2024-06-19 14:04:54,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1925.0, ans=0.409765625 2024-06-19 14:04:58,674 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=8.221875 2024-06-19 14:04:59,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1925.0, ans=0.409765625 2024-06-19 14:05:00,162 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=46.39 vs. limit=8.221875 2024-06-19 14:05:13,092 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.44 vs. limit=8.97125 2024-06-19 14:05:17,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1961.6666666666667, ans=0.5 2024-06-19 14:05:18,190 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=101.63 vs. limit=8.235625 2024-06-19 14:05:19,808 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=46.45 vs. limit=8.985 2024-06-19 14:05:21,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1980.0, ans=0.4071875 2024-06-19 14:05:28,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1998.3333333333333, ans=0.04375520833333334 2024-06-19 14:05:37,086 INFO [train.py:1028] (1/2) Epoch 1, batch 1100, loss[loss=0.9103, simple_loss=0.6617, pruned_loss=0.7233, over 13232.00 frames. ], tot_loss[loss=0.8605, simple_loss=0.6506, pruned_loss=0.706, over 2568333.18 frames. ], batch size: 52, lr: 3.48e-02, grad_scale: 4.0 2024-06-19 14:05:41,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2016.6666666666667, ans=0.24791666666666667 2024-06-19 14:05:42,714 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 4.247e+02 6.049e+02 8.531e+02 7.194e+03, threshold=1.210e+03, percent-clipped=14.0 2024-06-19 14:05:42,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2016.6666666666667, ans=0.2798333333333333 2024-06-19 14:05:43,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=78.68 vs. limit=8.25625 2024-06-19 14:05:45,884 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.17 vs. limit=6.0175 2024-06-19 14:05:46,581 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.91 vs. limit=9.026250000000001 2024-06-19 14:05:53,025 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=8.27 2024-06-19 14:06:07,933 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=156.11 vs. limit=8.276875 2024-06-19 14:06:08,113 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.81 vs. limit=8.276875 2024-06-19 14:06:08,793 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.89 vs. limit=8.276875 2024-06-19 14:06:11,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=2090.0, ans=0.13243749999999999 2024-06-19 14:06:16,208 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.58 vs. limit=8.28375 2024-06-19 14:06:17,983 INFO [train.py:1028] (1/2) Epoch 1, batch 1150, loss[loss=0.929, simple_loss=0.6705, pruned_loss=0.7307, over 13298.00 frames. ], tot_loss[loss=0.8636, simple_loss=0.6481, pruned_loss=0.7012, over 2569426.76 frames. ], batch size: 52, lr: 3.48e-02, grad_scale: 1.0 2024-06-19 14:06:25,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.59 vs. limit=9.08125 2024-06-19 14:06:32,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2126.6666666666665, ans=0.27873333333333333 2024-06-19 14:06:55,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2181.6666666666665, ans=0.397734375 2024-06-19 14:06:58,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=2181.6666666666665, ans=0.27818333333333334 2024-06-19 14:06:58,745 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=9.13625 2024-06-19 14:07:01,082 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.95 vs. limit=9.15 2024-06-19 14:07:01,570 INFO [train.py:1028] (1/2) Epoch 1, batch 1200, loss[loss=0.8317, simple_loss=0.6011, pruned_loss=0.6419, over 13167.00 frames. ], tot_loss[loss=0.8646, simple_loss=0.6444, pruned_loss=0.6944, over 2571831.17 frames. ], batch size: 77, lr: 3.48e-02, grad_scale: 2.0 2024-06-19 14:07:03,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2200.0, ans=0.043125000000000004 2024-06-19 14:07:03,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2200.0, ans=0.8230000000000001 2024-06-19 14:07:08,794 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 4.499e+02 6.061e+02 9.470e+02 6.765e+03, threshold=1.212e+03, percent-clipped=14.0 2024-06-19 14:07:12,728 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.95 vs. limit=8.331875 2024-06-19 14:07:13,614 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.24 vs. limit=9.16375 2024-06-19 14:07:14,367 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=9.77 vs. limit=9.16375 2024-06-19 14:07:24,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2236.6666666666665, ans=0.116125 2024-06-19 14:07:24,587 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.66 vs. limit=5.559166666666666 2024-06-19 14:07:25,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2255.0, ans=0.0492625 2024-06-19 14:07:33,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2255.0, ans=0.394296875 2024-06-19 14:07:36,871 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=34.59 vs. limit=9.205 2024-06-19 14:07:41,025 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.14 vs. limit=8.3525 2024-06-19 14:07:44,621 INFO [train.py:1028] (1/2) Epoch 1, batch 1250, loss[loss=0.8195, simple_loss=0.5962, pruned_loss=0.6181, over 13227.00 frames. ], tot_loss[loss=0.8689, simple_loss=0.6428, pruned_loss=0.6906, over 2582193.11 frames. ], batch size: 112, lr: 3.48e-02, grad_scale: 2.0 2024-06-19 14:07:44,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=2291.6666666666665, ans=0.392578125 2024-06-19 14:07:46,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2291.6666666666665, ans=0.234375 2024-06-19 14:07:57,018 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=26.54 vs. limit=8.36625 2024-06-19 14:08:00,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2328.3333333333335, ans=0.04272395833333333 2024-06-19 14:08:05,039 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=8.373125 2024-06-19 14:08:17,704 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=35.52 vs. limit=8.386875 2024-06-19 14:08:19,223 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.30 vs. limit=6.1825 2024-06-19 14:08:20,790 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=38.84 vs. limit=9.27375 2024-06-19 14:08:21,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=2365.0, ans=0.77365 2024-06-19 14:08:22,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=2365.0, ans=0.11696875000000001 2024-06-19 14:08:24,073 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.43 vs. limit=8.39375 2024-06-19 14:08:24,342 INFO [train.py:1028] (1/2) Epoch 1, batch 1300, loss[loss=0.8813, simple_loss=0.6331, pruned_loss=0.6625, over 12733.00 frames. ], tot_loss[loss=0.8771, simple_loss=0.6429, pruned_loss=0.6912, over 2582614.45 frames. ], batch size: 176, lr: 3.47e-02, grad_scale: 2.0 2024-06-19 14:08:25,543 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.42 vs. limit=8.39375 2024-06-19 14:08:27,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=110.57 vs. limit=8.39375 2024-06-19 14:08:29,564 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.99 vs. limit=8.39375 2024-06-19 14:08:32,507 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 4.415e+02 5.926e+02 9.033e+02 6.266e+03, threshold=1.185e+03, percent-clipped=14.0 2024-06-19 14:08:36,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=2401.6666666666665, ans=8.400625 2024-06-19 14:08:38,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=2401.6666666666665, ans=0.387421875 2024-06-19 14:08:39,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2401.6666666666665, ans=0.2759833333333333 2024-06-19 14:08:43,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=2420.0, ans=0.2363 2024-06-19 14:08:43,963 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=25.96 vs. limit=8.4075 2024-06-19 14:08:47,809 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=85.28 vs. limit=8.414375 2024-06-19 14:08:48,513 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.76 vs. limit=6.219166666666666 2024-06-19 14:08:49,482 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.43 vs. limit=9.32875 2024-06-19 14:08:50,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2438.3333333333335, ans=0.10856249999999999 2024-06-19 14:08:57,109 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=4.816e-02 2024-06-19 14:08:57,513 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=9.95 vs. limit=9.32875 2024-06-19 14:08:58,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2456.6666666666665, ans=6.535416666666666 2024-06-19 14:08:59,199 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=9.73 vs. limit=9.3425 2024-06-19 14:09:07,270 INFO [train.py:1028] (1/2) Epoch 1, batch 1350, loss[loss=0.979, simple_loss=0.6724, pruned_loss=0.7539, over 13165.00 frames. ], tot_loss[loss=0.8915, simple_loss=0.6451, pruned_loss=0.6988, over 2585118.86 frames. ], batch size: 59, lr: 3.47e-02, grad_scale: 1.0 2024-06-19 14:09:12,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2475.0, ans=0.383984375 2024-06-19 14:09:14,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2493.3333333333335, ans=0.383125 2024-06-19 14:09:14,972 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=26.10 vs. limit=8.435 2024-06-19 14:09:33,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2530.0, ans=0.38140625 2024-06-19 14:09:43,083 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=5.019333333333334 2024-06-19 14:09:45,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=2548.3333333333335, ans=0.8108083333333334 2024-06-19 14:09:47,390 INFO [train.py:1028] (1/2) Epoch 1, batch 1400, loss[loss=1.009, simple_loss=0.6868, pruned_loss=0.7704, over 12784.00 frames. ], tot_loss[loss=0.8965, simple_loss=0.6432, pruned_loss=0.6968, over 2586813.36 frames. ], batch size: 26, lr: 3.47e-02, grad_scale: 2.0 2024-06-19 14:09:47,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2566.6666666666665, ans=0.2743333333333333 2024-06-19 14:09:51,181 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.10 vs. limit=9.425 2024-06-19 14:09:53,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2566.6666666666665, ans=0.10375 2024-06-19 14:09:58,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2585.0, ans=0.0418375 2024-06-19 14:09:59,094 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.525e+02 5.050e+02 8.148e+02 1.171e+03 6.097e+03, threshold=1.630e+03, percent-clipped=24.0 2024-06-19 14:09:59,540 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.06 vs. limit=9.43875 2024-06-19 14:10:03,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=2585.0, ans=0.10459375000000001 2024-06-19 14:10:04,268 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=8.469375 2024-06-19 14:10:04,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2585.0, ans=0.378828125 2024-06-19 14:10:05,941 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=31.55 vs. limit=9.4525 2024-06-19 14:10:09,824 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.66 vs. limit=9.4525 2024-06-19 14:10:17,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2621.6666666666665, ans=0.1016875 2024-06-19 14:10:25,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2640.0, ans=0.8076000000000001 2024-06-19 14:10:26,099 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.50 vs. limit=8.49 2024-06-19 14:10:28,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2640.0, ans=0.37625 2024-06-19 14:10:30,442 INFO [train.py:1028] (1/2) Epoch 1, batch 1450, loss[loss=0.8753, simple_loss=0.6156, pruned_loss=0.6415, over 13073.00 frames. ], tot_loss[loss=0.9023, simple_loss=0.6425, pruned_loss=0.6949, over 2587042.49 frames. ], batch size: 121, lr: 3.47e-02, grad_scale: 2.0 2024-06-19 14:10:31,027 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.16 vs. limit=8.496875 2024-06-19 14:10:31,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=2658.3333333333335, ans=8.496875 2024-06-19 14:10:33,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=2658.3333333333335, ans=9.49375 2024-06-19 14:10:36,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=39.41 vs. limit=8.496875 2024-06-19 14:10:38,144 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=50.33 vs. limit=8.50375 2024-06-19 14:10:52,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=2695.0, ans=0.805675 2024-06-19 14:11:10,130 INFO [train.py:1028] (1/2) Epoch 1, batch 1500, loss[loss=0.9411, simple_loss=0.6433, pruned_loss=0.6954, over 13207.00 frames. ], tot_loss[loss=0.9101, simple_loss=0.6424, pruned_loss=0.6952, over 2589772.59 frames. ], batch size: 83, lr: 3.47e-02, grad_scale: 2.0 2024-06-19 14:11:14,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2750.0, ans=0.15625 2024-06-19 14:11:16,133 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=85.93 vs. limit=8.53125 2024-06-19 14:11:16,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2750.0, ans=0.37109375 2024-06-19 14:11:23,521 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.511e+02 9.620e+02 1.404e+03 2.479e+03 1.351e+04, threshold=2.808e+03, percent-clipped=42.0 2024-06-19 14:11:28,341 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.71 vs. limit=5.107333333333333 2024-06-19 14:11:31,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=3.56 vs. limit=5.1146666666666665 2024-06-19 14:11:33,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2786.6666666666665, ans=0.369375 2024-06-19 14:11:41,607 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.23 vs. limit=9.60375 2024-06-19 14:11:44,951 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.59 vs. limit=8.551874999999999 2024-06-19 14:11:54,190 INFO [train.py:1028] (1/2) Epoch 1, batch 1550, loss[loss=0.9149, simple_loss=0.6261, pruned_loss=0.6662, over 12978.00 frames. ], tot_loss[loss=0.9174, simple_loss=0.642, pruned_loss=0.6952, over 2584194.51 frames. ], batch size: 102, lr: 3.46e-02, grad_scale: 1.0 2024-06-19 14:12:01,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=2860.0, ans=0.22139999999999999 2024-06-19 14:12:08,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2860.0, ans=0.09275 2024-06-19 14:12:15,320 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.52 vs. limit=8.579375 2024-06-19 14:12:16,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2878.3333333333335, ans=0.03523749999999999 2024-06-19 14:12:17,798 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.69 vs. limit=8.579375 2024-06-19 14:12:21,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2896.6666666666665, ans=0.36421875000000004 2024-06-19 14:12:31,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2915.0, ans=0.0344125 2024-06-19 14:12:37,443 INFO [train.py:1028] (1/2) Epoch 1, batch 1600, loss[loss=1.015, simple_loss=0.6773, pruned_loss=0.7424, over 13171.00 frames. ], tot_loss[loss=0.9273, simple_loss=0.6436, pruned_loss=0.6969, over 2579346.22 frames. ], batch size: 77, lr: 3.46e-02, grad_scale: 1.0 2024-06-19 14:12:37,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2933.3333333333335, ans=0.27066666666666667 2024-06-19 14:12:41,506 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.45 vs. limit=9.7 2024-06-19 14:12:48,325 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.621e+02 1.632e+03 2.277e+03 3.244e+03 1.287e+04, threshold=4.555e+03, percent-clipped=37.0 2024-06-19 14:12:49,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=38.40 vs. limit=9.713750000000001 2024-06-19 14:12:51,985 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.95 vs. limit=8.606875 2024-06-19 14:12:56,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=2970.0, ans=0.0829375 2024-06-19 14:13:03,555 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=9.92 vs. limit=9.74125 2024-06-19 14:13:08,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=3006.6666666666665, ans=6.879166666666666 2024-06-19 14:13:08,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=3006.6666666666665, ans=0.7947666666666667 2024-06-19 14:13:12,315 WARNING [optim.py:503] (1/2) Scaling gradients by 0.07586484402418137, model_norm_threshold=4554.50439453125 2024-06-19 14:13:12,515 WARNING [optim.py:575] (1/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.53, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=1.904e+09, grad_sumsq=9.839e+10, orig_rms_sq=1.935e-02 2024-06-19 14:13:13,000 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=104.16 vs. limit=8.6275 2024-06-19 14:13:16,597 INFO [train.py:1028] (1/2) Epoch 1, batch 1650, loss[loss=0.9696, simple_loss=0.6487, pruned_loss=0.6983, over 13158.00 frames. ], tot_loss[loss=0.9345, simple_loss=0.6439, pruned_loss=0.6962, over 2575577.94 frames. ], batch size: 95, lr: 3.46e-02, grad_scale: 0.25 2024-06-19 14:13:27,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=3043.3333333333335, ans=0.35734374999999996 2024-06-19 14:13:28,071 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=61.85 vs. limit=8.64125 2024-06-19 14:13:30,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=3043.3333333333335, ans=0.08587499999999998 2024-06-19 14:13:33,341 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.09 vs. limit=9.79625 2024-06-19 14:13:50,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=3098.3333333333335, ans=0.09899494936611666 2024-06-19 14:13:51,819 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.14 vs. limit=9.82375 2024-06-19 14:13:54,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3098.3333333333335, ans=0.2690166666666667 2024-06-19 14:13:59,672 INFO [train.py:1028] (1/2) Epoch 1, batch 1700, loss[loss=1.005, simple_loss=0.6583, pruned_loss=0.7246, over 12417.00 frames. ], tot_loss[loss=0.944, simple_loss=0.6448, pruned_loss=0.6977, over 2580714.91 frames. ], batch size: 25, lr: 3.46e-02, grad_scale: 0.5 2024-06-19 14:14:04,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=3116.6666666666665, ans=0.35390625 2024-06-19 14:14:11,613 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.20 vs. limit=6.5675 2024-06-19 14:14:12,628 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.901e+02 1.889e+03 2.542e+03 4.756e+03 6.003e+04, threshold=5.083e+03, percent-clipped=27.0 2024-06-19 14:14:14,617 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.01 vs. limit=8.675625 2024-06-19 14:14:16,269 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=14.92 vs. limit=8.682500000000001 2024-06-19 14:14:16,358 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=58.80 vs. limit=8.682500000000001 2024-06-19 14:14:17,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=3153.3333333333335, ans=0.26846666666666663 2024-06-19 14:14:19,436 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=92.51 vs. limit=8.682500000000001 2024-06-19 14:14:31,675 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.63 vs. limit=8.69625 2024-06-19 14:14:36,448 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.42 vs. limit=9.8925 2024-06-19 14:14:43,000 INFO [train.py:1028] (1/2) Epoch 1, batch 1750, loss[loss=1.069, simple_loss=0.6768, pruned_loss=0.7752, over 12293.00 frames. ], tot_loss[loss=0.9527, simple_loss=0.6454, pruned_loss=0.6985, over 2581097.22 frames. ], batch size: 22, lr: 3.45e-02, grad_scale: 0.5 2024-06-19 14:14:47,779 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=10.14 vs. limit=9.90625 2024-06-19 14:14:52,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3226.6666666666665, ans=0.2677333333333333 2024-06-19 14:14:57,270 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.90 vs. limit=6.613333333333333 2024-06-19 14:14:57,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=3226.6666666666665, ans=0.07983333333333334 2024-06-19 14:15:05,136 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.81 vs. limit=9.93375 2024-06-19 14:15:05,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=3245.0, ans=0.039859375 2024-06-19 14:15:07,727 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.12 vs. limit=5.815833333333334 2024-06-19 14:15:08,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=3263.3333333333335, ans=0.07762499999999999 2024-06-19 14:15:12,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=3263.3333333333335, ans=0.07762499999999999 2024-06-19 14:15:16,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=3281.6666666666665, ans=0.346171875 2024-06-19 14:15:17,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=3281.6666666666665, ans=0.06540625 2024-06-19 14:15:23,375 INFO [train.py:1028] (1/2) Epoch 1, batch 1800, loss[loss=0.9939, simple_loss=0.6485, pruned_loss=0.7001, over 13206.00 frames. ], tot_loss[loss=0.9595, simple_loss=0.6449, pruned_loss=0.6978, over 2581716.34 frames. ], batch size: 67, lr: 3.45e-02, grad_scale: 1.0 2024-06-19 14:15:31,258 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.96 vs. limit=8.744375 2024-06-19 14:15:35,181 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.94 vs. limit=9.98875 2024-06-19 14:15:37,532 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.089e+02 2.109e+03 3.076e+03 4.523e+03 2.012e+04, threshold=6.152e+03, percent-clipped=20.0 2024-06-19 14:15:48,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=3355.0, ans=0.34273437500000004 2024-06-19 14:15:52,855 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=31.39 vs. limit=8.758125 2024-06-19 14:15:57,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=3373.3333333333335, ans=0.7819333333333334 2024-06-19 14:16:03,798 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.90 vs. limit=8.765 2024-06-19 14:16:05,055 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=7.119e-02 2024-06-19 14:16:06,434 INFO [train.py:1028] (1/2) Epoch 1, batch 1850, loss[loss=0.9346, simple_loss=0.6003, pruned_loss=0.6561, over 13218.00 frames. ], tot_loss[loss=0.9667, simple_loss=0.6445, pruned_loss=0.6976, over 2583327.33 frames. ], batch size: 83, lr: 3.45e-02, grad_scale: 0.25 2024-06-19 14:16:14,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=3410.0, ans=0.072125 2024-06-19 14:16:19,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=3410.0, ans=0.7806500000000001 2024-06-19 14:16:21,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=3428.3333333333335, ans=0.339296875 2024-06-19 14:16:28,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=3428.3333333333335, ans=0.339296875 2024-06-19 14:16:34,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=3446.6666666666665, ans=0.022449999999999998 2024-06-19 14:16:34,753 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.06 vs. limit=8.7925 2024-06-19 14:16:35,626 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=10.04 vs. limit=10.085 2024-06-19 14:16:39,653 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=2.97 vs. limit=5.386 2024-06-19 14:16:40,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3465.0, ans=0.26535 2024-06-19 14:16:42,835 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.26 vs. limit=8.799375 2024-06-19 14:16:43,485 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.09 vs. limit=10.098749999999999 2024-06-19 14:16:46,317 INFO [train.py:1028] (1/2) Epoch 1, batch 1900, loss[loss=0.9779, simple_loss=0.6327, pruned_loss=0.676, over 13160.00 frames. ], tot_loss[loss=0.9718, simple_loss=0.6433, pruned_loss=0.6955, over 2586281.29 frames. ], batch size: 95, lr: 3.45e-02, grad_scale: 0.5 2024-06-19 14:16:48,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=3483.3333333333335, ans=0.33671875 2024-06-19 14:16:51,423 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=24.60 vs. limit=8.80625 2024-06-19 14:16:58,983 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.20 vs. limit=8.813125 2024-06-19 14:16:59,001 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.04 vs. limit=10.12625 2024-06-19 14:17:05,168 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.833e+02 4.394e+03 6.622e+03 1.141e+04 5.665e+04, threshold=1.324e+04, percent-clipped=53.0 2024-06-19 14:17:08,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=3520.0, ans=0.0208 2024-06-19 14:17:09,483 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.26 vs. limit=8.82 2024-06-19 14:17:13,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=3538.3333333333335, ans=0.33414062499999997 2024-06-19 14:17:14,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=3538.3333333333335, ans=0.06731249999999997 2024-06-19 14:17:17,208 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=28.09 vs. limit=8.826875 2024-06-19 14:17:18,687 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=34.55 vs. limit=10.15375 2024-06-19 14:17:20,323 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=44.82 vs. limit=10.15375 2024-06-19 14:17:29,184 INFO [train.py:1028] (1/2) Epoch 1, batch 1950, loss[loss=1.063, simple_loss=0.6679, pruned_loss=0.737, over 13279.00 frames. ], tot_loss[loss=0.978, simple_loss=0.643, pruned_loss=0.6941, over 2592450.84 frames. ], batch size: 52, lr: 3.44e-02, grad_scale: 0.25 2024-06-19 14:17:30,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=3575.0, ans=0.332421875 2024-06-19 14:17:33,992 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=25.98 vs. limit=10.18125 2024-06-19 14:17:36,693 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.80 vs. limit=8.840625 2024-06-19 14:17:41,319 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.22 vs. limit=8.8475 2024-06-19 14:17:44,461 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=79.54 vs. limit=8.8475 2024-06-19 14:17:46,741 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=16.75 vs. limit=8.854375000000001 2024-06-19 14:17:49,192 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=15.21 vs. limit=8.854375000000001 2024-06-19 14:17:50,048 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.91 vs. limit=8.854375000000001 2024-06-19 14:17:51,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=3611.6666666666665, ans=0.330703125 2024-06-19 14:18:09,468 INFO [train.py:1028] (1/2) Epoch 1, batch 2000, loss[loss=1.13, simple_loss=0.6982, pruned_loss=0.7813, over 12809.00 frames. ], tot_loss[loss=0.9873, simple_loss=0.645, pruned_loss=0.6949, over 2588720.45 frames. ], batch size: 23, lr: 3.44e-02, grad_scale: 0.25 2024-06-19 14:18:13,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3666.6666666666665, ans=0.2633333333333333 2024-06-19 14:18:15,401 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.32 vs. limit=5.916666666666667 2024-06-19 14:18:16,814 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.47 vs. limit=10.25 2024-06-19 14:18:17,436 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=39.02 vs. limit=6.833333333333333 2024-06-19 14:18:18,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=3666.6666666666665, ans=0.328125 2024-06-19 14:18:20,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=3685.0, ans=0.327265625 2024-06-19 14:18:20,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=3685.0, ans=0.327265625 2024-06-19 14:18:22,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=3685.0, ans=7.303125 2024-06-19 14:18:24,456 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.79 vs. limit=10.26375 2024-06-19 14:18:27,841 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.017e+03 7.127e+03 1.010e+04 1.722e+04 6.991e+04, threshold=2.019e+04, percent-clipped=40.0 2024-06-19 14:18:28,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=3703.3333333333335, ans=0.32640625 2024-06-19 14:18:33,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3703.3333333333335, ans=0.2629666666666667 2024-06-19 14:18:38,888 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.10 vs. limit=8.895624999999999 2024-06-19 14:18:44,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=3740.0, ans=0.076625 2024-06-19 14:18:44,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=3740.0, ans=0.7874 2024-06-19 14:18:46,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=3740.0, ans=0.015850000000000003 2024-06-19 14:18:46,559 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.93 vs. limit=6.87 2024-06-19 14:18:51,870 INFO [train.py:1028] (1/2) Epoch 1, batch 2050, loss[loss=1.053, simple_loss=0.6359, pruned_loss=0.7351, over 12959.00 frames. ], tot_loss[loss=0.9966, simple_loss=0.6464, pruned_loss=0.6969, over 2583753.50 frames. ], batch size: 30, lr: 3.44e-02, grad_scale: 0.125 2024-06-19 14:18:51,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=3758.3333333333335, ans=0.05906249999999999 2024-06-19 14:18:54,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=3758.3333333333335, ans=0.323828125 2024-06-19 14:19:07,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=3795.0, ans=0.26205 2024-06-19 14:19:07,714 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.73 vs. limit=10.34625 2024-06-19 14:19:18,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=3813.3333333333335, ans=0.32125000000000004 2024-06-19 14:19:32,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=3831.6666666666665, ans=8.936875 2024-06-19 14:19:34,400 INFO [train.py:1028] (1/2) Epoch 1, batch 2100, loss[loss=0.9971, simple_loss=0.6292, pruned_loss=0.6825, over 13268.00 frames. ], tot_loss[loss=1.006, simple_loss=0.648, pruned_loss=0.7002, over 2586305.79 frames. ], batch size: 59, lr: 3.43e-02, grad_scale: 0.125 2024-06-19 14:19:39,716 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=8.94375 2024-06-19 14:19:41,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=3868.3333333333335, ans=0.318671875 2024-06-19 14:19:52,293 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.170e+03 5.486e+03 9.618e+03 1.340e+04 1.020e+05, threshold=1.924e+04, percent-clipped=11.0 2024-06-19 14:20:05,240 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.47 vs. limit=8.964375 2024-06-19 14:20:11,168 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.81 vs. limit=10.442499999999999 2024-06-19 14:20:11,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=3923.3333333333335, ans=0.7626833333333334 2024-06-19 14:20:14,579 INFO [train.py:1028] (1/2) Epoch 1, batch 2150, loss[loss=1.052, simple_loss=0.6449, pruned_loss=0.7294, over 13219.00 frames. ], tot_loss[loss=1.012, simple_loss=0.6488, pruned_loss=0.7022, over 2589028.18 frames. ], batch size: 52, lr: 3.43e-02, grad_scale: 0.125 2024-06-19 14:20:15,851 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.77 vs. limit=10.45625 2024-06-19 14:20:19,032 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.80 vs. limit=8.978125 2024-06-19 14:20:19,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3941.6666666666665, ans=0.26058333333333333 2024-06-19 14:20:21,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=3960.0, ans=0.31437499999999996 2024-06-19 14:20:23,912 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.27 vs. limit=8.985 2024-06-19 14:20:32,224 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.72 vs. limit=10.48375 2024-06-19 14:20:35,473 WARNING [optim.py:503] (1/2) Scaling gradients by 0.09887780249118805, model_norm_threshold=19236.69140625 2024-06-19 14:20:35,649 WARNING [optim.py:575] (1/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.65, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=2.449e+10, grad_sumsq=1.444e+12, orig_rms_sq=1.696e-02 2024-06-19 14:20:39,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=29.40 vs. limit=8.99875 2024-06-19 14:20:50,121 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=43.47 vs. limit=10.51125 2024-06-19 14:20:53,061 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=4015.0, ans=0.07 2024-06-19 14:20:55,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=4015.0, ans=10.51125 2024-06-19 14:20:56,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=4033.3333333333335, ans=0.20966666666666667 2024-06-19 14:20:57,918 INFO [train.py:1028] (1/2) Epoch 1, batch 2200, loss[loss=1.019, simple_loss=0.6462, pruned_loss=0.6961, over 13213.00 frames. ], tot_loss[loss=1.016, simple_loss=0.6497, pruned_loss=0.7022, over 2588052.90 frames. ], batch size: 83, lr: 3.43e-02, grad_scale: 0.125 2024-06-19 14:20:58,503 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.07 vs. limit=9.0125 2024-06-19 14:21:00,891 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=22.92 vs. limit=9.0125 2024-06-19 14:21:01,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.86 vs. limit=5.613333333333333 2024-06-19 14:21:02,347 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=39.96 vs. limit=9.0125 2024-06-19 14:21:02,371 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=90.31 vs. limit=9.0125 2024-06-19 14:21:12,593 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.25 vs. limit=10.53875 2024-06-19 14:21:15,952 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.117e+03 1.038e+04 1.339e+04 2.109e+04 1.946e+05, threshold=2.678e+04, percent-clipped=26.0 2024-06-19 14:21:19,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=4070.0, ans=0.0 2024-06-19 14:21:22,440 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.37 vs. limit=10.56625 2024-06-19 14:21:23,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=4088.3333333333335, ans=0.0 2024-06-19 14:21:29,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=4106.666666666667, ans=9.040000000000001 2024-06-19 14:21:37,522 INFO [train.py:1028] (1/2) Epoch 1, batch 2250, loss[loss=0.9988, simple_loss=0.6276, pruned_loss=0.685, over 13314.00 frames. ], tot_loss[loss=1.017, simple_loss=0.6487, pruned_loss=0.7013, over 2588255.59 frames. ], batch size: 63, lr: 3.43e-02, grad_scale: 0.125 2024-06-19 14:21:46,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=4143.333333333333, ans=0.009968840579710146 2024-06-19 14:21:51,165 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=63.30 vs. limit=10.6075 2024-06-19 14:22:04,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=4180.0, ans=0.3040625 2024-06-19 14:22:06,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=4180.0, ans=0.7537 2024-06-19 14:22:09,559 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.50 vs. limit=9.067499999999999 2024-06-19 14:22:19,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=4216.666666666667, ans=0.30234375 2024-06-19 14:22:19,666 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.77 vs. limit=9.08125 2024-06-19 14:22:19,949 INFO [train.py:1028] (1/2) Epoch 1, batch 2300, loss[loss=1.097, simple_loss=0.6695, pruned_loss=0.7626, over 12974.00 frames. ], tot_loss[loss=1.024, simple_loss=0.6515, pruned_loss=0.7046, over 2582233.75 frames. ], batch size: 33, lr: 3.42e-02, grad_scale: 0.125 2024-06-19 14:22:31,444 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.14 vs. limit=10.67625 2024-06-19 14:22:34,013 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=9.088125 2024-06-19 14:22:35,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4253.333333333333, ans=0.2574666666666667 2024-06-19 14:22:38,937 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.177e+03 1.387e+04 2.044e+04 3.046e+04 1.525e+05, threshold=4.088e+04, percent-clipped=32.0 2024-06-19 14:22:43,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=4271.666666666667, ans=0.264075 2024-06-19 14:22:49,934 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=9.101875 2024-06-19 14:22:50,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=4271.666666666667, ans=0.29976562500000004 2024-06-19 14:22:52,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=4290.0, ans=0.04879166666666667 2024-06-19 14:22:54,834 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.00 vs. limit=9.10875 2024-06-19 14:22:56,251 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.49 vs. limit=10.7175 2024-06-19 14:22:59,675 INFO [train.py:1028] (1/2) Epoch 1, batch 2350, loss[loss=1.083, simple_loss=0.678, pruned_loss=0.7444, over 13230.00 frames. ], tot_loss[loss=1.024, simple_loss=0.65, pruned_loss=0.7039, over 2585255.46 frames. ], batch size: 67, lr: 3.42e-02, grad_scale: 0.125 2024-06-19 14:23:00,173 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=14.57 vs. limit=6.077083333333333 2024-06-19 14:23:04,298 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.46 vs. limit=10.73125 2024-06-19 14:23:10,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=4326.666666666667, ans=0.2971875 2024-06-19 14:23:15,152 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=29.43 vs. limit=9.1225 2024-06-19 14:23:23,335 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.93 vs. limit=9.129375 2024-06-19 14:23:27,350 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.15 vs. limit=10.772499999999999 2024-06-19 14:23:31,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=4363.333333333333, ans=0.035 2024-06-19 14:23:32,032 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=39.84 vs. limit=10.772499999999999 2024-06-19 14:23:34,501 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=24.52 vs. limit=9.143125 2024-06-19 14:23:38,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=4381.666666666667, ans=0.294609375 2024-06-19 14:23:41,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.95 vs. limit=9.143125 2024-06-19 14:23:43,054 INFO [train.py:1028] (1/2) Epoch 1, batch 2400, loss[loss=1.042, simple_loss=0.6463, pruned_loss=0.7185, over 13302.00 frames. ], tot_loss[loss=1.02, simple_loss=0.6478, pruned_loss=0.7006, over 2588226.13 frames. ], batch size: 46, lr: 3.42e-02, grad_scale: 0.25 2024-06-19 14:23:45,890 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.99 vs. limit=10.8 2024-06-19 14:23:53,795 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=35.82 vs. limit=9.156875 2024-06-19 14:23:56,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=4418.333333333333, ans=0.29289062499999996 2024-06-19 14:24:05,695 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.426e+03 1.668e+04 2.149e+04 3.437e+04 1.491e+05, threshold=4.299e+04, percent-clipped=21.0 2024-06-19 14:24:10,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=4455.0, ans=10.84125 2024-06-19 14:24:16,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=4455.0, ans=0.25545 2024-06-19 14:24:17,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=4473.333333333333, ans=0.2903125 2024-06-19 14:24:24,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=4473.333333333333, ans=0.2903125 2024-06-19 14:24:25,471 INFO [train.py:1028] (1/2) Epoch 1, batch 2450, loss[loss=0.9453, simple_loss=0.5858, pruned_loss=0.6524, over 13285.00 frames. ], tot_loss[loss=1.017, simple_loss=0.6467, pruned_loss=0.6967, over 2583916.16 frames. ], batch size: 63, lr: 3.41e-02, grad_scale: 0.0625 2024-06-19 14:24:25,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=4491.666666666667, ans=0.009893115942028985 2024-06-19 14:24:26,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=4491.666666666667, ans=0.289453125 2024-06-19 14:24:27,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=4491.666666666667, ans=0.289453125 2024-06-19 14:24:28,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4491.666666666667, ans=0.25508333333333333 2024-06-19 14:24:37,591 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.53 vs. limit=9.19125 2024-06-19 14:24:40,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=4528.333333333333, ans=0.04779861111111112 2024-06-19 14:24:49,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=4546.666666666667, ans=0.286875 2024-06-19 14:24:50,917 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.81 vs. limit=10.91 2024-06-19 14:24:51,561 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.80 vs. limit=10.91 2024-06-19 14:25:04,604 INFO [train.py:1028] (1/2) Epoch 1, batch 2500, loss[loss=0.9183, simple_loss=0.5948, pruned_loss=0.6209, over 13180.00 frames. ], tot_loss[loss=1.013, simple_loss=0.6443, pruned_loss=0.6937, over 2587569.60 frames. ], batch size: 83, lr: 3.41e-02, grad_scale: 0.125 2024-06-19 14:25:20,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=4601.666666666667, ans=9.225625 2024-06-19 14:25:25,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4620.0, ans=0.2538 2024-06-19 14:25:28,259 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.430e+03 1.773e+04 2.264e+04 3.120e+04 1.795e+05, threshold=4.529e+04, percent-clipped=14.0 2024-06-19 14:25:28,645 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=3.693 2024-06-19 14:25:31,035 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=9.239374999999999 2024-06-19 14:25:31,085 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=15.59 vs. limit=9.239374999999999 2024-06-19 14:25:41,407 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=9.24625 2024-06-19 14:25:43,078 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.81 vs. limit=10.9925 2024-06-19 14:25:47,223 INFO [train.py:1028] (1/2) Epoch 1, batch 2550, loss[loss=1.124, simple_loss=0.6965, pruned_loss=0.776, over 12730.00 frames. ], tot_loss[loss=1.01, simple_loss=0.6416, pruned_loss=0.6907, over 2586302.23 frames. ], batch size: 22, lr: 3.41e-02, grad_scale: 0.125 2024-06-19 14:25:52,841 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.65 vs. limit=7.3375 2024-06-19 14:26:00,759 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.85 vs. limit=11.02 2024-06-19 14:26:08,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=4711.666666666667, ans=0.27914062500000003 2024-06-19 14:26:14,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=4730.0, ans=0.27828125000000004 2024-06-19 14:26:29,080 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.22 vs. limit=9.2875 2024-06-19 14:26:29,507 INFO [train.py:1028] (1/2) Epoch 1, batch 2600, loss[loss=1.05, simple_loss=0.6505, pruned_loss=0.7246, over 13248.00 frames. ], tot_loss[loss=1.007, simple_loss=0.6395, pruned_loss=0.6888, over 2585603.90 frames. ], batch size: 52, lr: 3.40e-02, grad_scale: 0.125 2024-06-19 14:26:36,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=4785.0, ans=0.275703125 2024-06-19 14:26:38,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=4785.0, ans=0.275703125 2024-06-19 14:26:42,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=4785.0, ans=0.79785 2024-06-19 14:26:47,205 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=9.30125 2024-06-19 14:26:49,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=11.1025 2024-06-19 14:26:52,204 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.214e+04 2.319e+04 2.865e+04 4.018e+04 1.762e+05, threshold=5.729e+04, percent-clipped=20.0 2024-06-19 14:26:52,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=4821.666666666667, ans=0.009821376811594203 2024-06-19 14:26:53,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=4821.666666666667, ans=0.273984375 2024-06-19 14:27:08,996 INFO [train.py:1028] (1/2) Epoch 1, batch 2650, loss[loss=0.9264, simple_loss=0.5979, pruned_loss=0.6274, over 13027.00 frames. ], tot_loss[loss=1.003, simple_loss=0.6365, pruned_loss=0.6856, over 2586195.44 frames. ], batch size: 144, lr: 3.40e-02, grad_scale: 0.03125 2024-06-19 14:27:10,927 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.12 vs. limit=7.429166666666666 2024-06-19 14:27:11,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=4858.333333333333, ans=0.272265625 2024-06-19 14:27:24,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=4895.0, ans=0.27054687499999996 2024-06-19 14:27:33,600 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.61 vs. limit=11.17125 2024-06-19 14:27:35,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=4913.333333333333, ans=0.7280333333333333 2024-06-19 14:27:36,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=4913.333333333333, ans=0.7280333333333333 2024-06-19 14:27:37,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4913.333333333333, ans=0.2508666666666667 2024-06-19 14:27:37,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=4913.333333333333, ans=0.26968749999999997 2024-06-19 14:27:47,220 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.88 vs. limit=3.73975 2024-06-19 14:27:50,482 INFO [train.py:1028] (1/2) Epoch 1, batch 2700, loss[loss=0.9077, simple_loss=0.5773, pruned_loss=0.619, over 13248.00 frames. ], tot_loss[loss=0.9963, simple_loss=0.6322, pruned_loss=0.6811, over 2583447.30 frames. ], batch size: 89, lr: 3.39e-02, grad_scale: 0.0625 2024-06-19 14:27:52,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=4950.0, ans=0.04604166666666667 2024-06-19 14:27:53,159 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.96 vs. limit=11.2125 2024-06-19 14:27:53,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=4950.0, ans=0.07 2024-06-19 14:28:07,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=4986.666666666667, ans=0.0 2024-06-19 14:28:08,235 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.62 vs. limit=11.24 2024-06-19 14:28:08,257 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.67 vs. limit=7.493333333333334 2024-06-19 14:28:08,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=4986.666666666667, ans=0.025 2024-06-19 14:28:13,126 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=9.370000000000001 2024-06-19 14:28:13,285 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.068e+03 1.255e+04 1.591e+04 2.053e+04 1.587e+05, threshold=3.182e+04, percent-clipped=5.0 2024-06-19 14:28:17,613 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.24 vs. limit=11.25375 2024-06-19 14:28:19,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=5005.0, ans=0.265390625 2024-06-19 14:28:32,963 INFO [train.py:1028] (1/2) Epoch 1, batch 2750, loss[loss=0.978, simple_loss=0.6167, pruned_loss=0.6697, over 13241.00 frames. ], tot_loss[loss=0.9906, simple_loss=0.6281, pruned_loss=0.6772, over 2579837.38 frames. ], batch size: 43, lr: 3.39e-02, grad_scale: 0.0625 2024-06-19 14:28:37,249 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.63 vs. limit=7.520833333333334 2024-06-19 14:28:39,587 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.45 vs. limit=11.28125 2024-06-19 14:28:41,865 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.79 vs. limit=11.295 2024-06-19 14:28:42,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=5060.0, ans=0.068375 2024-06-19 14:28:43,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=5060.0, ans=0.2494 2024-06-19 14:28:44,090 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=9.3975 2024-06-19 14:28:46,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=5060.0, ans=0.2628125 2024-06-19 14:29:00,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=5096.666666666667, ans=0.26109375 2024-06-19 14:29:01,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5096.666666666667, ans=0.24903333333333333 2024-06-19 14:29:12,210 INFO [train.py:1028] (1/2) Epoch 1, batch 2800, loss[loss=0.9142, simple_loss=0.6273, pruned_loss=0.6006, over 10779.00 frames. ], tot_loss[loss=0.9854, simple_loss=0.6255, pruned_loss=0.6732, over 2577162.45 frames. ], batch size: 303, lr: 3.39e-02, grad_scale: 0.125 2024-06-19 14:29:17,532 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=5133.333333333333, ans=9.425 2024-06-19 14:29:21,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=5151.666666666667, ans=0.277275 2024-06-19 14:29:23,726 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.58 vs. limit=9.431875 2024-06-19 14:29:24,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=5151.666666666667, ans=0.258515625 2024-06-19 14:29:25,907 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=16.29 vs. limit=9.431875 2024-06-19 14:29:27,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=5170.0, ans=0.009745652173913044 2024-06-19 14:29:28,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=5170.0, ans=0.25765625000000003 2024-06-19 14:29:28,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.04 vs. limit=9.43875 2024-06-19 14:29:37,676 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.341e+03 1.417e+04 1.784e+04 2.288e+04 9.876e+04, threshold=3.569e+04, percent-clipped=12.0 2024-06-19 14:29:42,132 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.30 vs. limit=9.445625 2024-06-19 14:29:46,575 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.75 vs. limit=11.405 2024-06-19 14:29:47,933 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.18 vs. limit=11.405 2024-06-19 14:29:52,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=5206.666666666667, ans=0.25593750000000004 2024-06-19 14:29:53,828 INFO [train.py:1028] (1/2) Epoch 1, batch 2850, loss[loss=0.971, simple_loss=0.599, pruned_loss=0.6715, over 13309.00 frames. ], tot_loss[loss=0.9809, simple_loss=0.6235, pruned_loss=0.6696, over 2576425.89 frames. ], batch size: 49, lr: 3.38e-02, grad_scale: 0.125 2024-06-19 14:29:54,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=5225.0, ans=0.255078125 2024-06-19 14:30:03,683 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=3.717e+00 2024-06-19 14:30:05,645 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.46 vs. limit=9.46625 2024-06-19 14:30:05,745 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.04 vs. limit=9.46625 2024-06-19 14:30:06,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=5243.333333333333, ans=0.25421875 2024-06-19 14:30:09,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=5261.666666666667, ans=0.253359375 2024-06-19 14:30:12,977 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.10 vs. limit=9.473125 2024-06-19 14:30:35,535 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.93 vs. limit=9.49375 2024-06-19 14:30:35,783 INFO [train.py:1028] (1/2) Epoch 1, batch 2900, loss[loss=0.973, simple_loss=0.634, pruned_loss=0.656, over 13138.00 frames. ], tot_loss[loss=0.9737, simple_loss=0.6197, pruned_loss=0.6642, over 2584845.28 frames. ], batch size: 55, lr: 3.38e-02, grad_scale: 0.125 2024-06-19 14:30:37,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=5316.666666666667, ans=0.25078125 2024-06-19 14:30:42,741 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.17 vs. limit=11.4875 2024-06-19 14:30:47,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=5335.0, ans=0.24992187500000002 2024-06-19 14:30:50,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=5335.0, ans=3.80025 2024-06-19 14:30:50,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5335.0, ans=0.24664999999999998 2024-06-19 14:30:52,997 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=32.26 vs. limit=7.676666666666666 2024-06-19 14:30:55,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=5353.333333333333, ans=0.24906250000000002 2024-06-19 14:30:55,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=5353.333333333333, ans=0.24906250000000002 2024-06-19 14:31:00,633 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.046e+03 2.074e+04 2.789e+04 4.836e+04 2.596e+05, threshold=5.579e+04, percent-clipped=33.0 2024-06-19 14:31:00,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=5371.666666666667, ans=0.044284722222222225 2024-06-19 14:31:13,613 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=8.923e-02 2024-06-19 14:31:15,770 INFO [train.py:1028] (1/2) Epoch 1, batch 2950, loss[loss=0.9214, simple_loss=0.5831, pruned_loss=0.6299, over 13299.00 frames. ], tot_loss[loss=0.9755, simple_loss=0.6201, pruned_loss=0.6657, over 2580240.20 frames. ], batch size: 43, lr: 3.38e-02, grad_scale: 0.0625 2024-06-19 14:31:25,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=5426.666666666667, ans=0.044055555555555556 2024-06-19 14:31:27,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=5426.666666666667, ans=0.044055555555555556 2024-06-19 14:31:31,331 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.50 vs. limit=9.535 2024-06-19 14:31:32,063 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=9.541875000000001 2024-06-19 14:31:34,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=5445.0, ans=0.244765625 2024-06-19 14:31:35,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=5445.0, ans=0.009685869565217392 2024-06-19 14:31:37,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=5445.0, ans=0.043979166666666666 2024-06-19 14:31:43,302 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=20.00 vs. limit=9.54875 2024-06-19 14:31:44,014 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=9.54875 2024-06-19 14:31:44,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=5463.333333333333, ans=0.09899494936611666 2024-06-19 14:31:46,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=5463.333333333333, ans=0.24390625 2024-06-19 14:31:50,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=5481.666666666667, ans=0.0 2024-06-19 14:31:52,773 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=25.75 vs. limit=11.61125 2024-06-19 14:31:59,947 INFO [train.py:1028] (1/2) Epoch 1, batch 3000, loss[loss=1.069, simple_loss=0.677, pruned_loss=0.7304, over 13232.00 frames. ], tot_loss[loss=0.9714, simple_loss=0.6175, pruned_loss=0.6628, over 2578413.11 frames. ], batch size: 59, lr: 3.37e-02, grad_scale: 0.125 2024-06-19 14:31:59,947 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 14:32:07,986 INFO [train.py:1060] (1/2) Epoch 1, validation: loss=1.03, simple_loss=0.6516, pruned_loss=0.704, over 351949.00 frames. 2024-06-19 14:32:07,988 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 16658MB 2024-06-19 14:32:08,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=5500.0, ans=0.2421875 2024-06-19 14:32:17,648 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.67 vs. limit=11.63875 2024-06-19 14:32:23,423 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.00 vs. limit=9.57625 2024-06-19 14:32:28,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=5536.666666666667, ans=0.2446333333333333 2024-06-19 14:32:31,212 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=12.79 vs. limit=9.583124999999999 2024-06-19 14:32:32,362 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.151e+04 1.896e+04 2.552e+04 3.548e+04 1.643e+05, threshold=5.104e+04, percent-clipped=9.0 2024-06-19 14:32:32,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=5555.0, ans=0.239609375 2024-06-19 14:32:33,613 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.66 vs. limit=7.7775 2024-06-19 14:32:43,222 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.40 vs. limit=9.59 2024-06-19 14:32:50,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=5591.666666666667, ans=0.043368055555555556 2024-06-19 14:32:50,955 INFO [train.py:1028] (1/2) Epoch 1, batch 3050, loss[loss=0.9257, simple_loss=0.5823, pruned_loss=0.6345, over 13320.00 frames. ], tot_loss[loss=0.9667, simple_loss=0.616, pruned_loss=0.6589, over 2578822.42 frames. ], batch size: 46, lr: 3.37e-02, grad_scale: 0.125 2024-06-19 14:32:51,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=5591.666666666667, ans=0.025 2024-06-19 14:32:52,087 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=6.236666666666666 2024-06-19 14:32:59,857 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.94 vs. limit=7.805 2024-06-19 14:33:01,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=5610.0, ans=0.23703125000000003 2024-06-19 14:33:06,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5628.333333333333, ans=0.24371666666666666 2024-06-19 14:33:12,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=5628.333333333333, ans=0.009646014492753624 2024-06-19 14:33:27,329 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=32.61 vs. limit=11.748750000000001 2024-06-19 14:33:27,506 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.45 vs. limit=9.624375 2024-06-19 14:33:28,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=5665.0, ans=0.009638043478260869 2024-06-19 14:33:28,767 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=32.22 vs. limit=9.624375 2024-06-19 14:33:29,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=5683.333333333333, ans=0.23359375 2024-06-19 14:33:29,611 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=52.01 vs. limit=9.63125 2024-06-19 14:33:29,993 INFO [train.py:1028] (1/2) Epoch 1, batch 3100, loss[loss=0.9218, simple_loss=0.6044, pruned_loss=0.6196, over 13038.00 frames. ], tot_loss[loss=0.963, simple_loss=0.6141, pruned_loss=0.656, over 2579610.20 frames. ], batch size: 144, lr: 3.36e-02, grad_scale: 0.125 2024-06-19 14:33:31,512 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.43 vs. limit=9.63125 2024-06-19 14:33:35,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=5683.333333333333, ans=0.23359375 2024-06-19 14:33:37,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=5701.666666666667, ans=0.2 2024-06-19 14:33:40,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5701.666666666667, ans=0.24298333333333333 2024-06-19 14:33:46,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=5720.0, ans=9.645 2024-06-19 14:33:53,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=5738.333333333333, ans=0.009622101449275363 2024-06-19 14:33:55,512 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.730e+03 1.916e+04 2.485e+04 3.355e+04 2.110e+05, threshold=4.970e+04, percent-clipped=10.0 2024-06-19 14:34:07,591 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=67.68 vs. limit=9.65875 2024-06-19 14:34:08,635 INFO [train.py:1028] (1/2) Epoch 1, batch 3150, loss[loss=0.9067, simple_loss=0.5979, pruned_loss=0.6077, over 12972.00 frames. ], tot_loss[loss=0.9622, simple_loss=0.6129, pruned_loss=0.6559, over 2580924.40 frames. ], batch size: 158, lr: 3.36e-02, grad_scale: 0.0625 2024-06-19 14:34:09,953 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=34.89 vs. limit=9.665625 2024-06-19 14:34:14,673 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=87.60 vs. limit=9.665625 2024-06-19 14:34:24,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=5793.333333333333, ans=0.22843750000000002 2024-06-19 14:34:35,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=5830.0, ans=0.009602173913043478 2024-06-19 14:34:37,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=5830.0, ans=0.2417 2024-06-19 14:34:46,958 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.13 vs. limit=11.88625 2024-06-19 14:34:47,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=5848.333333333333, ans=0.22585937499999997 2024-06-19 14:34:50,749 INFO [train.py:1028] (1/2) Epoch 1, batch 3200, loss[loss=0.9554, simple_loss=0.5938, pruned_loss=0.6585, over 13122.00 frames. ], tot_loss[loss=0.9595, simple_loss=0.611, pruned_loss=0.654, over 2582720.74 frames. ], batch size: 55, lr: 3.36e-02, grad_scale: 0.0625 2024-06-19 14:34:51,367 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.28 vs. limit=6.466666666666667 2024-06-19 14:34:56,607 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=18.17 vs. limit=9.7 2024-06-19 14:34:57,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=5885.0, ans=0.0 2024-06-19 14:35:16,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.80 vs. limit=6.368666666666667 2024-06-19 14:35:16,658 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.35 vs. limit=9.720625 2024-06-19 14:35:17,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=5921.666666666667, ans=0.222421875 2024-06-19 14:35:17,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=5921.666666666667, ans=0.222421875 2024-06-19 14:35:20,186 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.236e+03 1.850e+04 2.496e+04 3.320e+04 2.574e+05, threshold=4.993e+04, percent-clipped=12.0 2024-06-19 14:35:21,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=5921.666666666667, ans=0.222421875 2024-06-19 14:35:21,750 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=20.30 vs. limit=9.720625 2024-06-19 14:35:22,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=5921.666666666667, ans=0.222421875 2024-06-19 14:35:30,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=5940.0, ans=9.7275 2024-06-19 14:35:32,787 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=5.191666666666666 2024-06-19 14:35:32,924 INFO [train.py:1028] (1/2) Epoch 1, batch 3250, loss[loss=0.9505, simple_loss=0.5976, pruned_loss=0.6517, over 13207.00 frames. ], tot_loss[loss=0.9568, simple_loss=0.6097, pruned_loss=0.6521, over 2587136.75 frames. ], batch size: 72, lr: 3.35e-02, grad_scale: 0.0625 2024-06-19 14:35:43,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=5976.666666666667, ans=0.009570289855072463 2024-06-19 14:35:46,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5976.666666666667, ans=0.24023333333333333 2024-06-19 14:35:48,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=5976.666666666667, ans=0.21984375 2024-06-19 14:35:53,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=5995.0, ans=0.218984375 2024-06-19 14:35:54,584 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.02 vs. limit=11.99625 2024-06-19 14:36:03,015 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.95 vs. limit=9.754999999999999 2024-06-19 14:36:07,886 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.92 vs. limit=8.015833333333333 2024-06-19 14:36:12,768 INFO [train.py:1028] (1/2) Epoch 1, batch 3300, loss[loss=0.9108, simple_loss=0.597, pruned_loss=0.6123, over 12738.00 frames. ], tot_loss[loss=0.9564, simple_loss=0.6082, pruned_loss=0.6524, over 2583935.52 frames. ], batch size: 176, lr: 3.35e-02, grad_scale: 0.125 2024-06-19 14:36:20,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=6068.333333333333, ans=0.00955036231884058 2024-06-19 14:36:32,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=6086.666666666667, ans=0.2391333333333333 2024-06-19 14:36:32,610 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.96 vs. limit=12.065000000000001 2024-06-19 14:36:37,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=6086.666666666667, ans=9.7825 2024-06-19 14:36:40,179 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.87 vs. limit=8.0525 2024-06-19 14:36:42,071 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.598e+03 1.394e+04 2.186e+04 4.132e+04 2.733e+05, threshold=4.373e+04, percent-clipped=17.0 2024-06-19 14:36:43,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=6105.0, ans=0.21382812499999998 2024-06-19 14:36:46,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=6123.333333333333, ans=0.6856833333333334 2024-06-19 14:36:54,211 INFO [train.py:1028] (1/2) Epoch 1, batch 3350, loss[loss=0.9109, simple_loss=0.6031, pruned_loss=0.6094, over 12974.00 frames. ], tot_loss[loss=0.9489, simple_loss=0.6052, pruned_loss=0.6463, over 2578468.13 frames. ], batch size: 158, lr: 3.34e-02, grad_scale: 0.125 2024-06-19 14:36:57,289 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=6.456666666666667 2024-06-19 14:36:59,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=6141.666666666667, ans=0.009534420289855072 2024-06-19 14:37:00,239 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=3.92125 2024-06-19 14:37:00,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=6141.666666666667, ans=0.212109375 2024-06-19 14:37:01,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=6160.0, ans=0.041 2024-06-19 14:37:01,718 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.55 vs. limit=9.81 2024-06-19 14:37:05,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=6160.0, ans=0.2384 2024-06-19 14:37:14,686 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.48 vs. limit=9.816875 2024-06-19 14:37:14,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten.whitening_limit, batch_count=6178.333333333333, ans=12.13375 2024-06-19 14:37:16,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=6178.333333333333, ans=0.009526449275362319 2024-06-19 14:37:17,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=6196.666666666667, ans=0.6831166666666667 2024-06-19 14:37:35,294 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.84 vs. limit=9.830625 2024-06-19 14:37:36,688 INFO [train.py:1028] (1/2) Epoch 1, batch 3400, loss[loss=1.044, simple_loss=0.6302, pruned_loss=0.7286, over 12363.00 frames. ], tot_loss[loss=0.9473, simple_loss=0.6045, pruned_loss=0.6451, over 2575850.06 frames. ], batch size: 22, lr: 3.34e-02, grad_scale: 0.0625 2024-06-19 14:37:38,857 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.68 vs. limit=9.8375 2024-06-19 14:37:40,193 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=29.24 vs. limit=9.8375 2024-06-19 14:37:41,864 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.57 vs. limit=12.175 2024-06-19 14:37:42,824 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.65 vs. limit=12.175 2024-06-19 14:37:53,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=6270.0, ans=0.20609375000000002 2024-06-19 14:38:04,331 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.282e+03 2.316e+04 2.958e+04 4.175e+04 3.129e+05, threshold=5.917e+04, percent-clipped=22.0 2024-06-19 14:38:12,536 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.49 vs. limit=9.865 2024-06-19 14:38:15,182 INFO [train.py:1028] (1/2) Epoch 1, batch 3450, loss[loss=0.8858, simple_loss=0.5881, pruned_loss=0.5917, over 12685.00 frames. ], tot_loss[loss=0.944, simple_loss=0.6022, pruned_loss=0.6429, over 2577043.59 frames. ], batch size: 176, lr: 3.34e-02, grad_scale: 0.0625 2024-06-19 14:38:17,145 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.56 vs. limit=6.58125 2024-06-19 14:38:20,087 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.29 vs. limit=12.24375 2024-06-19 14:38:20,816 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.48 vs. limit=12.24375 2024-06-19 14:38:32,313 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.62 vs. limit=12.27125 2024-06-19 14:38:36,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=6361.666666666667, ans=0.04015972222222222 2024-06-19 14:38:47,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=6380.0, ans=0.009482608695652175 2024-06-19 14:38:49,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=6398.333333333333, ans=0.030005208333333335 2024-06-19 14:38:51,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=6398.333333333333, ans=0.025 2024-06-19 14:38:51,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=6398.333333333333, ans=0.6760583333333334 2024-06-19 14:38:56,639 INFO [train.py:1028] (1/2) Epoch 1, batch 3500, loss[loss=0.9783, simple_loss=0.6009, pruned_loss=0.6779, over 12884.00 frames. ], tot_loss[loss=0.9439, simple_loss=0.6021, pruned_loss=0.6428, over 2576250.36 frames. ], batch size: 33, lr: 3.33e-02, grad_scale: 0.125 2024-06-19 14:39:06,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=6435.0, ans=0.03985416666666667 2024-06-19 14:39:08,677 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.00 vs. limit=12.32625 2024-06-19 14:39:10,902 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.10 vs. limit=12.32625 2024-06-19 14:39:12,392 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.98 vs. limit=12.34 2024-06-19 14:39:20,195 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.39 vs. limit=9.926874999999999 2024-06-19 14:39:20,884 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.83 vs. limit=9.926874999999999 2024-06-19 14:39:25,187 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.867e+03 1.621e+04 1.966e+04 2.572e+04 1.665e+05, threshold=3.932e+04, percent-clipped=4.0 2024-06-19 14:39:26,324 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=16.63 vs. limit=9.926874999999999 2024-06-19 14:39:26,335 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=12.35375 2024-06-19 14:39:29,424 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.86 vs. limit=9.93375 2024-06-19 14:39:30,845 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.83 vs. limit=6.596 2024-06-19 14:39:32,468 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=17.86 vs. limit=9.93375 2024-06-19 14:39:33,177 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=23.10 vs. limit=9.93375 2024-06-19 14:39:35,232 INFO [train.py:1028] (1/2) Epoch 1, batch 3550, loss[loss=0.8921, simple_loss=0.5751, pruned_loss=0.6046, over 13129.00 frames. ], tot_loss[loss=0.9417, simple_loss=0.6012, pruned_loss=0.6411, over 2578327.53 frames. ], batch size: 95, lr: 3.33e-02, grad_scale: 0.0625 2024-06-19 14:39:50,626 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=6.610666666666667 2024-06-19 14:39:53,920 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=6.618 2024-06-19 14:39:56,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=6545.0, ans=0.029546875000000004 2024-06-19 14:40:01,779 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.49 vs. limit=12.4225 2024-06-19 14:40:02,465 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=88.11 vs. limit=9.96125 2024-06-19 14:40:09,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=6581.666666666667, ans=0.19148437499999998 2024-06-19 14:40:10,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=6581.666666666667, ans=0.19148437499999998 2024-06-19 14:40:11,723 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.61 vs. limit=12.436250000000001 2024-06-19 14:40:17,098 INFO [train.py:1028] (1/2) Epoch 1, batch 3600, loss[loss=0.8841, simple_loss=0.5603, pruned_loss=0.604, over 13242.00 frames. ], tot_loss[loss=0.9358, simple_loss=0.5987, pruned_loss=0.6364, over 2580626.47 frames. ], batch size: 49, lr: 3.32e-02, grad_scale: 0.125 2024-06-19 14:40:20,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=6600.0, ans=0.009434782608695652 2024-06-19 14:40:21,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=6600.0, ans=0.0 2024-06-19 14:40:22,934 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=18.26 vs. limit=9.975 2024-06-19 14:40:25,972 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=23.25 vs. limit=9.981875 2024-06-19 14:40:31,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=6618.333333333333, ans=0.6683583333333334 2024-06-19 14:40:34,071 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.54 vs. limit=9.98875 2024-06-19 14:40:40,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=6655.0, ans=0.188046875 2024-06-19 14:40:42,440 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.97 vs. limit=9.995625 2024-06-19 14:40:46,427 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.863e+03 1.410e+04 1.710e+04 2.163e+04 1.056e+05, threshold=3.419e+04, percent-clipped=4.0 2024-06-19 14:40:46,856 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.91 vs. limit=9.995625 2024-06-19 14:40:47,545 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.36 vs. limit=10.0025 2024-06-19 14:40:49,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=6673.333333333333, ans=0.03886111111111112 2024-06-19 14:40:51,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=6673.333333333333, ans=0.1871875 2024-06-19 14:40:52,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6673.333333333333, ans=0.23326666666666668 2024-06-19 14:40:55,539 INFO [train.py:1028] (1/2) Epoch 1, batch 3650, loss[loss=0.9425, simple_loss=0.6044, pruned_loss=0.6403, over 13145.00 frames. ], tot_loss[loss=0.9379, simple_loss=0.5991, pruned_loss=0.6384, over 2578743.00 frames. ], batch size: 103, lr: 3.32e-02, grad_scale: 0.0625 2024-06-19 14:40:58,049 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=25.85 vs. limit=10.009375 2024-06-19 14:41:01,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=6691.666666666667, ans=0.186328125 2024-06-19 14:41:01,648 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=10.009375 2024-06-19 14:41:03,778 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=6710.0, ans=0.0580625 2024-06-19 14:41:04,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=6710.0, ans=10.01625 2024-06-19 14:41:05,523 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.95 vs. limit=10.01625 2024-06-19 14:41:15,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=6728.333333333333, ans=0.03863194444444445 2024-06-19 14:41:19,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=6728.333333333333, ans=0.6645083333333334 2024-06-19 14:41:19,522 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=12.27 vs. limit=6.682083333333333 2024-06-19 14:41:21,189 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.45 vs. limit=10.023125 2024-06-19 14:41:22,713 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.11 vs. limit=10.03 2024-06-19 14:41:30,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=6765.0, ans=0.23235 2024-06-19 14:41:33,856 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=20.81 vs. limit=10.036875 2024-06-19 14:41:37,655 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=18.24 vs. limit=10.04375 2024-06-19 14:41:38,000 INFO [train.py:1028] (1/2) Epoch 1, batch 3700, loss[loss=0.8794, simple_loss=0.5645, pruned_loss=0.5971, over 13228.00 frames. ], tot_loss[loss=0.934, simple_loss=0.5966, pruned_loss=0.6357, over 2583540.59 frames. ], batch size: 72, lr: 3.31e-02, grad_scale: 0.125 2024-06-19 14:41:40,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.69 vs. limit=10.04375 2024-06-19 14:41:45,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.65 vs. limit=6.7004166666666665 2024-06-19 14:41:49,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=6801.666666666667, ans=0.18117187499999998 2024-06-19 14:41:50,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=6801.666666666667, ans=0.02874479166666667 2024-06-19 14:41:53,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=6820.0, ans=0.009386956521739131 2024-06-19 14:41:57,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=6820.0, ans=0.04949747468305833 2024-06-19 14:41:58,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=6820.0, ans=0.009386956521739131 2024-06-19 14:41:59,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=6820.0, ans=0.1803125 2024-06-19 14:42:11,657 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.024e+03 1.385e+04 1.736e+04 2.252e+04 1.767e+05, threshold=3.473e+04, percent-clipped=14.0 2024-06-19 14:42:14,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=6856.666666666667, ans=0.1 2024-06-19 14:42:14,530 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.76 vs. limit=12.6425 2024-06-19 14:42:15,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=6856.666666666667, ans=0.05714583333333333 2024-06-19 14:42:18,801 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.27 vs. limit=10.07125 2024-06-19 14:42:20,014 INFO [train.py:1028] (1/2) Epoch 1, batch 3750, loss[loss=0.9649, simple_loss=0.5885, pruned_loss=0.6707, over 12458.00 frames. ], tot_loss[loss=0.9292, simple_loss=0.5944, pruned_loss=0.6321, over 2586189.17 frames. ], batch size: 22, lr: 3.31e-02, grad_scale: 0.0625 2024-06-19 14:42:23,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=6875.0, ans=0.009375 2024-06-19 14:42:29,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=6893.333333333333, ans=0.6587333333333334 2024-06-19 14:42:30,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=6893.333333333333, ans=0.176875 2024-06-19 14:42:32,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=6893.333333333333, ans=0.176875 2024-06-19 14:42:37,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=6911.666666666667, ans=0.23088333333333333 2024-06-19 14:42:47,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=6930.0, ans=0.17515625 2024-06-19 14:42:51,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=6948.333333333333, ans=0.17429687500000002 2024-06-19 14:42:58,006 INFO [train.py:1028] (1/2) Epoch 1, batch 3800, loss[loss=0.9002, simple_loss=0.5762, pruned_loss=0.612, over 13216.00 frames. ], tot_loss[loss=0.9308, simple_loss=0.5948, pruned_loss=0.6334, over 2584672.78 frames. ], batch size: 83, lr: 3.31e-02, grad_scale: 0.125 2024-06-19 14:43:00,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=6966.666666666667, ans=0.025 2024-06-19 14:43:01,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=6966.666666666667, ans=0.23033333333333333 2024-06-19 14:43:06,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=6985.0, ans=0.0375625 2024-06-19 14:43:06,213 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=53.78 vs. limit=10.119375 2024-06-19 14:43:06,897 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.84 vs. limit=4.04775 2024-06-19 14:43:12,945 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.54 vs. limit=12.752500000000001 2024-06-19 14:43:18,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=7003.333333333333, ans=0.6548833333333334 2024-06-19 14:43:22,063 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.91 vs. limit=12.76625 2024-06-19 14:43:23,165 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 14:43:23,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=7021.666666666667, ans=0.6542416666666666 2024-06-19 14:43:25,008 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=29.71 vs. limit=12.76625 2024-06-19 14:43:31,787 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.430e+03 1.220e+04 1.669e+04 2.499e+04 1.319e+05, threshold=3.339e+04, percent-clipped=11.0 2024-06-19 14:43:33,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=7040.0, ans=10.14 2024-06-19 14:43:39,572 INFO [train.py:1028] (1/2) Epoch 1, batch 3850, loss[loss=0.8045, simple_loss=0.5335, pruned_loss=0.5377, over 13074.00 frames. ], tot_loss[loss=0.9293, simple_loss=0.594, pruned_loss=0.6323, over 2585131.44 frames. ], batch size: 144, lr: 3.30e-02, grad_scale: 0.0625 2024-06-19 14:43:55,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=7095.0, ans=0.167421875 2024-06-19 14:43:55,979 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=212.23 vs. limit=10.160625 2024-06-19 14:43:58,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=7095.0, ans=0.009327173913043479 2024-06-19 14:43:59,007 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.26 vs. limit=10.160625 2024-06-19 14:43:59,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=7095.0, ans=0.167421875 2024-06-19 14:44:09,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=21.15 vs. limit=10.174375 2024-06-19 14:44:09,850 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.00 vs. limit=10.174375 2024-06-19 14:44:11,963 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.51 vs. limit=10.174375 2024-06-19 14:44:12,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=7131.666666666667, ans=12.848749999999999 2024-06-19 14:44:13,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=7131.666666666667, ans=0.03695138888888889 2024-06-19 14:44:14,950 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.66 vs. limit=12.848749999999999 2024-06-19 14:44:16,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=7150.0, ans=0.07 2024-06-19 14:44:16,956 INFO [train.py:1028] (1/2) Epoch 1, batch 3900, loss[loss=0.8768, simple_loss=0.561, pruned_loss=0.5963, over 13239.00 frames. ], tot_loss[loss=0.9323, simple_loss=0.5954, pruned_loss=0.6346, over 2587925.15 frames. ], batch size: 83, lr: 3.30e-02, grad_scale: 0.125 2024-06-19 14:44:18,980 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.91 vs. limit=6.7875 2024-06-19 14:44:19,027 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.88 vs. limit=6.7875 2024-06-19 14:44:19,695 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.08 vs. limit=12.8625 2024-06-19 14:44:22,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=7150.0, ans=0.16484375 2024-06-19 14:44:26,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=7168.333333333333, ans=0.6491083333333334 2024-06-19 14:44:40,974 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.95 vs. limit=12.89 2024-06-19 14:44:44,638 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=78.58 vs. limit=10.201875 2024-06-19 14:44:45,485 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=38.49 vs. limit=12.903749999999999 2024-06-19 14:44:48,791 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.27 vs. limit=12.903749999999999 2024-06-19 14:44:51,383 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.119e+03 1.522e+04 1.834e+04 2.422e+04 1.135e+05, threshold=3.668e+04, percent-clipped=10.0 2024-06-19 14:44:54,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=7223.333333333333, ans=0.16140624999999997 2024-06-19 14:44:58,973 INFO [train.py:1028] (1/2) Epoch 1, batch 3950, loss[loss=0.8152, simple_loss=0.5301, pruned_loss=0.5501, over 13112.00 frames. ], tot_loss[loss=0.9273, simple_loss=0.5922, pruned_loss=0.6312, over 2588894.83 frames. ], batch size: 132, lr: 3.29e-02, grad_scale: 0.125 2024-06-19 14:45:00,193 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=27.33 vs. limit=10.215625 2024-06-19 14:45:01,716 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=88.87 vs. limit=12.93125 2024-06-19 14:45:03,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=7241.666666666667, ans=0.16054687499999998 2024-06-19 14:45:16,191 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=11.45 vs. limit=6.819583333333333 2024-06-19 14:45:16,344 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.44 vs. limit=10.229375000000001 2024-06-19 14:45:16,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=7278.333333333333, ans=0.158828125 2024-06-19 14:45:17,286 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.64 vs. limit=8.639166666666666 2024-06-19 14:45:17,946 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.91 vs. limit=6.819583333333333 2024-06-19 14:45:21,451 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=28.92 vs. limit=10.23625 2024-06-19 14:45:23,626 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=1.084e-02 2024-06-19 14:45:26,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=7296.666666666667, ans=0.15796875 2024-06-19 14:45:26,401 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.38 vs. limit=12.9725 2024-06-19 14:45:28,002 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.31 vs. limit=8.648333333333333 2024-06-19 14:45:29,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7315.0, ans=0.22685 2024-06-19 14:45:43,043 INFO [train.py:1028] (1/2) Epoch 1, batch 4000, loss[loss=0.9299, simple_loss=0.578, pruned_loss=0.6409, over 12952.00 frames. ], tot_loss[loss=0.9234, simple_loss=0.5903, pruned_loss=0.6283, over 2583222.76 frames. ], batch size: 39, lr: 3.29e-02, grad_scale: 0.25 2024-06-19 14:45:49,605 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=52.14 vs. limit=13.0 2024-06-19 14:45:55,699 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=19.51 vs. limit=10.256875 2024-06-19 14:46:05,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=7370.0, ans=0.053937500000000006 2024-06-19 14:46:07,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=7370.0, ans=0.64205 2024-06-19 14:46:10,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=7388.333333333333, ans=0.035881944444444445 2024-06-19 14:46:12,932 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.56 vs. limit=13.04125 2024-06-19 14:46:18,766 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.068e+04 1.900e+04 2.371e+04 3.165e+04 1.518e+05, threshold=4.741e+04, percent-clipped=16.0 2024-06-19 14:46:21,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=7406.666666666667, ans=0.15281250000000002 2024-06-19 14:46:24,598 INFO [train.py:1028] (1/2) Epoch 1, batch 4050, loss[loss=0.8715, simple_loss=0.5945, pruned_loss=0.5742, over 10997.00 frames. ], tot_loss[loss=0.9213, simple_loss=0.5892, pruned_loss=0.6267, over 2580837.56 frames. ], batch size: 304, lr: 3.28e-02, grad_scale: 0.0625 2024-06-19 14:46:31,976 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=43.76 vs. limit=10.29125 2024-06-19 14:46:34,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=7443.333333333333, ans=0.15109375000000003 2024-06-19 14:46:34,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=7443.333333333333, ans=0.15109375000000003 2024-06-19 14:46:38,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7443.333333333333, ans=0.22556666666666667 2024-06-19 14:46:43,941 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.27 vs. limit=10.298125 2024-06-19 14:46:45,326 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.69 vs. limit=8.730833333333333 2024-06-19 14:46:46,901 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=37.79 vs. limit=8.74 2024-06-19 14:46:53,995 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=6.992 2024-06-19 14:46:55,659 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.17 vs. limit=4.122 2024-06-19 14:47:04,915 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.57 vs. limit=10.311875 2024-06-19 14:47:05,071 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.76 vs. limit=6.874583333333334 2024-06-19 14:47:06,046 INFO [train.py:1028] (1/2) Epoch 1, batch 4100, loss[loss=0.9172, simple_loss=0.5905, pruned_loss=0.6219, over 13016.00 frames. ], tot_loss[loss=0.921, simple_loss=0.5893, pruned_loss=0.6264, over 2576332.78 frames. ], batch size: 102, lr: 3.28e-02, grad_scale: 0.125 2024-06-19 14:47:10,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=7516.666666666667, ans=0.035347222222222224 2024-06-19 14:47:13,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=7535.0, ans=0.035270833333333335 2024-06-19 14:47:13,328 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=13.78 vs. limit=10.325625 2024-06-19 14:47:15,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=7535.0, ans=0.146796875 2024-06-19 14:47:16,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=7535.0, ans=0.09899494936611666 2024-06-19 14:47:20,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=7535.0, ans=0.009231521739130435 2024-06-19 14:47:27,858 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=33.50 vs. limit=10.3325 2024-06-19 14:47:28,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=7553.333333333333, ans=0.0 2024-06-19 14:47:28,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=7553.333333333333, ans=0.6356333333333334 2024-06-19 14:47:34,621 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.47 vs. limit=13.17875 2024-06-19 14:47:35,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=7571.666666666667, ans=0.145078125 2024-06-19 14:47:42,728 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.49 vs. limit=10.34625 2024-06-19 14:47:43,046 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.462e+03 1.364e+04 2.076e+04 3.257e+04 1.638e+05, threshold=4.152e+04, percent-clipped=12.0 2024-06-19 14:47:50,560 INFO [train.py:1028] (1/2) Epoch 1, batch 4150, loss[loss=0.9035, simple_loss=0.5737, pruned_loss=0.6166, over 13083.00 frames. ], tot_loss[loss=0.9223, simple_loss=0.5894, pruned_loss=0.6276, over 2576554.92 frames. ], batch size: 55, lr: 3.27e-02, grad_scale: 0.125 2024-06-19 14:47:53,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=7608.333333333333, ans=0.314125 2024-06-19 14:48:05,513 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.14 vs. limit=10.36 2024-06-19 14:48:29,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=7681.666666666667, ans=0.6311416666666667 2024-06-19 14:48:36,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=7681.666666666667, ans=0.13992187499999997 2024-06-19 14:48:40,970 INFO [train.py:1028] (1/2) Epoch 1, batch 4200, loss[loss=0.8317, simple_loss=0.5437, pruned_loss=0.5599, over 13064.00 frames. ], tot_loss[loss=0.9195, simple_loss=0.5879, pruned_loss=0.6255, over 2579373.79 frames. ], batch size: 102, lr: 3.27e-02, grad_scale: 0.0625 2024-06-19 14:48:45,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=7700.0, ans=0.13906249999999998 2024-06-19 14:48:49,139 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.79 vs. limit=4.15775 2024-06-19 14:48:54,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=7718.333333333333, ans=0.13820312499999998 2024-06-19 14:48:59,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=7736.666666666667, ans=0.00918768115942029 2024-06-19 14:49:01,538 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.23 vs. limit=10.401250000000001 2024-06-19 14:49:02,014 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=2.476e-01 2024-06-19 14:49:12,794 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.96 vs. limit=10.415 2024-06-19 14:49:16,007 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.831e+03 1.227e+04 1.611e+04 2.697e+04 3.867e+05, threshold=3.223e+04, percent-clipped=11.0 2024-06-19 14:49:17,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=7773.333333333333, ans=0.03427777777777778 2024-06-19 14:49:18,095 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=38.69 vs. limit=8.886666666666667 2024-06-19 14:49:19,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=7773.333333333333, ans=0.135625 2024-06-19 14:49:20,511 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=24.73 vs. limit=10.421875 2024-06-19 14:49:20,803 INFO [train.py:1028] (1/2) Epoch 1, batch 4250, loss[loss=0.8545, simple_loss=0.5431, pruned_loss=0.583, over 13291.00 frames. ], tot_loss[loss=0.9163, simple_loss=0.5866, pruned_loss=0.623, over 2583224.48 frames. ], batch size: 46, lr: 3.26e-02, grad_scale: 0.0625 2024-06-19 14:49:29,329 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.82 vs. limit=4.16875 2024-06-19 14:49:45,122 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.29 vs. limit=10.435625 2024-06-19 14:49:47,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=7846.666666666667, ans=0.1321875 2024-06-19 14:49:51,492 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=35.67 vs. limit=13.385 2024-06-19 14:49:53,201 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.54 vs. limit=13.385 2024-06-19 14:50:02,745 INFO [train.py:1028] (1/2) Epoch 1, batch 4300, loss[loss=0.9126, simple_loss=0.5764, pruned_loss=0.6244, over 13184.00 frames. ], tot_loss[loss=0.915, simple_loss=0.5859, pruned_loss=0.6221, over 2582647.20 frames. ], batch size: 59, lr: 3.26e-02, grad_scale: 0.0625 2024-06-19 14:50:05,526 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.43 vs. limit=13.4125 2024-06-19 14:50:06,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=7883.333333333333, ans=0.6240833333333333 2024-06-19 14:50:06,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=7883.333333333333, ans=0.09899494936611666 2024-06-19 14:50:06,827 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=20.40 vs. limit=10.45625 2024-06-19 14:50:09,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7901.666666666667, ans=0.2209833333333333 2024-06-19 14:50:22,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=7920.0, ans=0.2208 2024-06-19 14:50:26,731 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.63 vs. limit=10.476875 2024-06-19 14:50:28,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=7938.333333333333, ans=0.009143840579710145 2024-06-19 14:50:32,226 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.74 vs. limit=13.45375 2024-06-19 14:50:32,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=7938.333333333333, ans=0.12789062499999998 2024-06-19 14:50:37,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=7956.666666666667, ans=0.12703124999999998 2024-06-19 14:50:38,637 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.302e+03 1.350e+04 1.658e+04 2.610e+04 2.186e+05, threshold=3.317e+04, percent-clipped=21.0 2024-06-19 14:50:42,937 INFO [train.py:1028] (1/2) Epoch 1, batch 4350, loss[loss=0.9539, simple_loss=0.6155, pruned_loss=0.6461, over 13215.00 frames. ], tot_loss[loss=0.9104, simple_loss=0.584, pruned_loss=0.6184, over 2586846.65 frames. ], batch size: 59, lr: 3.26e-02, grad_scale: 0.0625 2024-06-19 14:50:44,308 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.23 vs. limit=13.48125 2024-06-19 14:50:50,879 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.26 vs. limit=13.48125 2024-06-19 14:50:59,678 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.77 vs. limit=6.998333333333333 2024-06-19 14:51:04,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=7993.333333333333, ans=0.1253125 2024-06-19 14:51:08,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=8011.666666666667, ans=0.6195916666666667 2024-06-19 14:51:16,365 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.07 vs. limit=10.51125 2024-06-19 14:51:17,888 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.45 vs. limit=10.51125 2024-06-19 14:51:18,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=8030.0, ans=0.61895 2024-06-19 14:51:29,764 INFO [train.py:1028] (1/2) Epoch 1, batch 4400, loss[loss=0.86, simple_loss=0.5519, pruned_loss=0.584, over 13251.00 frames. ], tot_loss[loss=0.9084, simple_loss=0.5828, pruned_loss=0.617, over 2585418.47 frames. ], batch size: 83, lr: 3.25e-02, grad_scale: 0.125 2024-06-19 14:51:42,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=8085.0, ans=0.617025 2024-06-19 14:52:03,744 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.17 vs. limit=13.591249999999999 2024-06-19 14:52:04,603 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=62.27 vs. limit=10.545625 2024-06-19 14:52:16,351 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.629e+03 1.189e+04 1.635e+04 2.249e+04 9.656e+04, threshold=3.270e+04, percent-clipped=5.0 2024-06-19 14:52:21,369 INFO [train.py:1028] (1/2) Epoch 1, batch 4450, loss[loss=0.9049, simple_loss=0.5635, pruned_loss=0.6232, over 12815.00 frames. ], tot_loss[loss=0.91, simple_loss=0.5836, pruned_loss=0.6182, over 2580087.34 frames. ], batch size: 33, lr: 3.25e-02, grad_scale: 0.125 2024-06-19 14:52:22,072 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.14 vs. limit=7.263333333333334 2024-06-19 14:52:24,733 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.18 vs. limit=13.61875 2024-06-19 14:52:39,160 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.32 vs. limit=10.573125000000001 2024-06-19 14:52:40,811 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.95 vs. limit=7.04875 2024-06-19 14:52:44,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=8195.0, ans=0.025 2024-06-19 14:52:51,196 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=7.33 vs. limit=7.285333333333334 2024-06-19 14:52:51,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=8213.333333333334, ans=0.009084057971014492 2024-06-19 14:52:51,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=8213.333333333334, ans=0.21786666666666665 2024-06-19 14:52:54,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=8231.666666666666, ans=0.125 2024-06-19 14:53:00,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=8231.666666666666, ans=0.0 2024-06-19 14:53:03,267 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.42 vs. limit=10.586875 2024-06-19 14:53:04,413 INFO [train.py:1028] (1/2) Epoch 1, batch 4500, loss[loss=0.8314, simple_loss=0.5255, pruned_loss=0.5686, over 13289.00 frames. ], tot_loss[loss=0.9095, simple_loss=0.583, pruned_loss=0.618, over 2585241.08 frames. ], batch size: 89, lr: 3.24e-02, grad_scale: 0.125 2024-06-19 14:53:05,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=8250.0, ans=0.125 2024-06-19 14:53:05,958 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.82 vs. limit=10.59375 2024-06-19 14:53:10,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=8250.0, ans=0.0 2024-06-19 14:53:15,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=8268.333333333334, ans=0.025 2024-06-19 14:53:28,422 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.59 vs. limit=4.243 2024-06-19 14:53:28,483 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.54 vs. limit=10.6075 2024-06-19 14:53:31,709 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.12 vs. limit=10.614374999999999 2024-06-19 14:53:39,437 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=10.614374999999999 2024-06-19 14:53:45,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=8323.333333333334, ans=0.21676666666666666 2024-06-19 14:53:46,548 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.264e+03 1.316e+04 1.672e+04 2.176e+04 6.204e+04, threshold=3.345e+04, percent-clipped=10.0 2024-06-19 14:53:47,856 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=10.62125 2024-06-19 14:53:49,837 INFO [train.py:1028] (1/2) Epoch 1, batch 4550, loss[loss=0.947, simple_loss=0.5968, pruned_loss=0.6486, over 13270.00 frames. ], tot_loss[loss=0.9116, simple_loss=0.5841, pruned_loss=0.6196, over 2589617.27 frames. ], batch size: 52, lr: 3.24e-02, grad_scale: 0.125 2024-06-19 14:53:52,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8341.666666666666, ans=0.21658333333333335 2024-06-19 14:53:56,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=8341.666666666666, ans=0.125 2024-06-19 14:54:01,897 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.78 vs. limit=10.635 2024-06-19 14:54:03,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=8360.0, ans=0.0 2024-06-19 14:54:07,603 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.61 vs. limit=10.641875 2024-06-19 14:54:13,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=8396.666666666666, ans=0.125 2024-06-19 14:54:17,913 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.32 vs. limit=10.64875 2024-06-19 14:54:31,048 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.91 vs. limit=10.655625 2024-06-19 14:54:33,796 INFO [train.py:1028] (1/2) Epoch 1, batch 4600, loss[loss=0.9099, simple_loss=0.5964, pruned_loss=0.6117, over 12577.00 frames. ], tot_loss[loss=0.9107, simple_loss=0.5829, pruned_loss=0.6192, over 2585287.98 frames. ], batch size: 202, lr: 3.23e-02, grad_scale: 0.25 2024-06-19 14:54:36,617 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=48.36 vs. limit=10.662500000000001 2024-06-19 14:54:36,671 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=25.23 vs. limit=10.662500000000001 2024-06-19 14:54:44,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=8451.666666666666, ans=0.125 2024-06-19 14:54:47,046 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.65 vs. limit=10.669374999999999 2024-06-19 14:54:50,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=8470.0, ans=0.125 2024-06-19 14:54:55,268 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.29 vs. limit=10.67625 2024-06-19 14:54:56,827 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=15.25 vs. limit=10.67625 2024-06-19 14:55:00,034 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.95 vs. limit=13.86625 2024-06-19 14:55:00,895 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.12 vs. limit=10.683125 2024-06-19 14:55:01,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=8488.333333333334, ans=0.00902427536231884 2024-06-19 14:55:10,336 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.269e+03 1.447e+04 1.887e+04 2.301e+04 1.060e+05, threshold=3.774e+04, percent-clipped=7.0 2024-06-19 14:55:13,240 INFO [train.py:1028] (1/2) Epoch 1, batch 4650, loss[loss=0.8744, simple_loss=0.5708, pruned_loss=0.5891, over 13128.00 frames. ], tot_loss[loss=0.9087, simple_loss=0.582, pruned_loss=0.6178, over 2588232.09 frames. ], batch size: 132, lr: 3.23e-02, grad_scale: 0.25 2024-06-19 14:55:19,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=8525.0, ans=0.125 2024-06-19 14:55:23,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=8543.333333333334, ans=0.031069444444444445 2024-06-19 14:55:36,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=8561.666666666666, ans=0.025 2024-06-19 14:55:43,276 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.25 vs. limit=13.934999999999999 2024-06-19 14:55:53,383 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.76 vs. limit=13.94875 2024-06-19 14:55:54,349 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=45.04 vs. limit=10.724375 2024-06-19 14:55:56,865 INFO [train.py:1028] (1/2) Epoch 1, batch 4700, loss[loss=0.9549, simple_loss=0.5923, pruned_loss=0.6587, over 12358.00 frames. ], tot_loss[loss=0.9071, simple_loss=0.5811, pruned_loss=0.6165, over 2585698.69 frames. ], batch size: 25, lr: 3.22e-02, grad_scale: 0.25 2024-06-19 14:56:06,641 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=1.614e-02 2024-06-19 14:56:06,967 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.54 vs. limit=13.962499999999999 2024-06-19 14:56:17,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8635.0, ans=0.21365 2024-06-19 14:56:26,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=8671.666666666666, ans=0.125 2024-06-19 14:56:28,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=8671.666666666666, ans=0.025 2024-06-19 14:56:34,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=8690.0, ans=0.125 2024-06-19 14:56:38,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=8690.0, ans=0.030458333333333337 2024-06-19 14:56:42,698 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.233e+03 1.564e+04 2.027e+04 2.703e+04 1.479e+05, threshold=4.054e+04, percent-clipped=10.0 2024-06-19 14:56:42,732 INFO [train.py:1028] (1/2) Epoch 1, batch 4750, loss[loss=0.8095, simple_loss=0.5466, pruned_loss=0.5362, over 12597.00 frames. ], tot_loss[loss=0.9036, simple_loss=0.5802, pruned_loss=0.6135, over 2582542.08 frames. ], batch size: 202, lr: 3.22e-02, grad_scale: 0.03125 2024-06-19 14:56:42,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=8708.333333333334, ans=0.125 2024-06-19 14:56:48,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=8708.333333333334, ans=0.125 2024-06-19 14:57:01,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=8726.666666666666, ans=0.21273333333333333 2024-06-19 14:57:03,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8726.666666666666, ans=0.21273333333333333 2024-06-19 14:57:04,971 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.52 vs. limit=5.745333333333333 2024-06-19 14:57:08,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=8745.0, ans=0.008968478260869566 2024-06-19 14:57:08,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=8745.0, ans=0.030229166666666668 2024-06-19 14:57:09,197 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.13 vs. limit=14.05875 2024-06-19 14:57:16,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=8763.333333333334, ans=0.8376333333333333 2024-06-19 14:57:23,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=8763.333333333334, ans=0.125 2024-06-19 14:57:26,410 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=10.793125 2024-06-19 14:57:33,288 INFO [train.py:1028] (1/2) Epoch 1, batch 4800, loss[loss=0.8777, simple_loss=0.5562, pruned_loss=0.5997, over 13222.00 frames. ], tot_loss[loss=0.9023, simple_loss=0.5795, pruned_loss=0.6126, over 2578036.95 frames. ], batch size: 63, lr: 3.21e-02, grad_scale: 0.0625 2024-06-19 14:57:40,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=8818.333333333334, ans=0.008952536231884058 2024-06-19 14:57:50,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=8836.666666666666, ans=0.0 2024-06-19 14:57:51,080 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=30.90 vs. limit=10.81375 2024-06-19 14:57:55,067 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.11 vs. limit=9.418333333333333 2024-06-19 14:58:14,996 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.507e+03 1.145e+04 1.437e+04 1.934e+04 1.429e+05, threshold=2.875e+04, percent-clipped=5.0 2024-06-19 14:58:15,027 INFO [train.py:1028] (1/2) Epoch 1, batch 4850, loss[loss=0.9004, simple_loss=0.5752, pruned_loss=0.6128, over 13230.00 frames. ], tot_loss[loss=0.9027, simple_loss=0.5796, pruned_loss=0.6129, over 2575468.10 frames. ], batch size: 89, lr: 3.21e-02, grad_scale: 0.0625 2024-06-19 14:58:27,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=8910.0, ans=0.125 2024-06-19 14:58:27,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=8910.0, ans=0.02954166666666667 2024-06-19 14:58:43,528 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=28.53 vs. limit=10.848125 2024-06-19 14:58:55,906 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.05 vs. limit=10.861875 2024-06-19 14:59:04,813 INFO [train.py:1028] (1/2) Epoch 1, batch 4900, loss[loss=0.9873, simple_loss=0.6203, pruned_loss=0.6771, over 13183.00 frames. ], tot_loss[loss=0.9044, simple_loss=0.5797, pruned_loss=0.6145, over 2575748.14 frames. ], batch size: 59, lr: 3.20e-02, grad_scale: 0.125 2024-06-19 14:59:08,013 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.89 vs. limit=14.2375 2024-06-19 14:59:13,105 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.20 vs. limit=10.875625 2024-06-19 14:59:17,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=9001.666666666666, ans=0.5849416666666667 2024-06-19 14:59:18,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=9001.666666666666, ans=0.0 2024-06-19 14:59:21,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9020.0, ans=0.2098 2024-06-19 14:59:31,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=9020.0, ans=0.125 2024-06-19 14:59:32,513 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.03 vs. limit=14.265 2024-06-19 14:59:32,563 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.26 vs. limit=10.8825 2024-06-19 14:59:35,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.89 vs. limit=14.265 2024-06-19 14:59:38,171 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=33.53 vs. limit=10.889375 2024-06-19 14:59:42,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=9038.333333333334, ans=0.125 2024-06-19 14:59:50,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=9056.666666666666, ans=0.125 2024-06-19 14:59:50,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=9056.666666666666, ans=0.125 2024-06-19 14:59:50,734 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.33 vs. limit=14.2925 2024-06-19 14:59:51,085 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.072e-01 2024-06-19 14:59:51,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=9056.666666666666, ans=0.025 2024-06-19 14:59:53,193 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.526e+03 8.688e+03 1.131e+04 1.534e+04 7.408e+04, threshold=2.262e+04, percent-clipped=5.0 2024-06-19 14:59:53,227 INFO [train.py:1028] (1/2) Epoch 1, batch 4950, loss[loss=0.8144, simple_loss=0.5516, pruned_loss=0.5386, over 11002.00 frames. ], tot_loss[loss=0.9008, simple_loss=0.5786, pruned_loss=0.6115, over 2570387.52 frames. ], batch size: 303, lr: 3.20e-02, grad_scale: 0.125 2024-06-19 14:59:54,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.08 vs. limit=14.30625 2024-06-19 14:59:54,538 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.44 vs. limit=10.903125 2024-06-19 14:59:59,233 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.31 vs. limit=10.903125 2024-06-19 15:00:02,704 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=31.01 vs. limit=14.32 2024-06-19 15:00:12,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=9111.666666666666, ans=0.028701388888888894 2024-06-19 15:00:13,990 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.96 vs. limit=10.916875000000001 2024-06-19 15:00:27,880 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.57 vs. limit=14.36125 2024-06-19 15:00:33,833 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=14.55 vs. limit=10.9375 2024-06-19 15:00:34,342 INFO [train.py:1028] (1/2) Epoch 1, batch 5000, loss[loss=0.9177, simple_loss=0.5945, pruned_loss=0.6204, over 13156.00 frames. ], tot_loss[loss=0.9001, simple_loss=0.578, pruned_loss=0.6111, over 2574308.74 frames. ], batch size: 95, lr: 3.19e-02, grad_scale: 0.125 2024-06-19 15:00:49,848 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.61 vs. limit=7.296250000000001 2024-06-19 15:00:52,720 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=25.23 vs. limit=10.95125 2024-06-19 15:00:53,549 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.22 vs. limit=9.601666666666667 2024-06-19 15:01:00,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=9203.333333333334, ans=0.125 2024-06-19 15:01:10,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=9221.666666666666, ans=10.958124999999999 2024-06-19 15:01:17,287 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=14.21 vs. limit=7.3100000000000005 2024-06-19 15:01:18,041 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.95 vs. limit=14.43 2024-06-19 15:01:18,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.81 vs. limit=10.965 2024-06-19 15:01:24,618 INFO [train.py:1028] (1/2) Epoch 1, batch 5050, loss[loss=0.9403, simple_loss=0.5915, pruned_loss=0.6446, over 12912.00 frames. ], tot_loss[loss=0.9025, simple_loss=0.5785, pruned_loss=0.6132, over 2572982.61 frames. ], batch size: 36, lr: 3.19e-02, grad_scale: 0.0625 2024-06-19 15:01:26,233 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.044e+03 1.222e+04 1.607e+04 2.151e+04 1.008e+05, threshold=3.214e+04, percent-clipped=21.0 2024-06-19 15:01:26,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=9258.333333333334, ans=0.02809027777777778 2024-06-19 15:01:27,465 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=105.94 vs. limit=10.971875 2024-06-19 15:01:31,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=9258.333333333334, ans=0.09899494936611666 2024-06-19 15:01:37,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9276.666666666666, ans=0.20723333333333332 2024-06-19 15:01:39,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=9276.666666666666, ans=0.125 2024-06-19 15:01:39,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=9276.666666666666, ans=0.028013888888888894 2024-06-19 15:01:46,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=36.24 vs. limit=14.471250000000001 2024-06-19 15:01:49,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=9313.333333333334, ans=7.328333333333333 2024-06-19 15:01:50,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=9313.333333333334, ans=0.125 2024-06-19 15:01:51,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=9313.333333333334, ans=0.125 2024-06-19 15:01:56,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=9313.333333333334, ans=0.125 2024-06-19 15:01:57,433 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=25.99 vs. limit=10.9925 2024-06-19 15:02:07,701 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.61 vs. limit=7.332916666666666 2024-06-19 15:02:09,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=9331.666666666666, ans=0.008840942028985508 2024-06-19 15:02:12,222 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.94 vs. limit=7.332916666666666 2024-06-19 15:02:13,352 INFO [train.py:1028] (1/2) Epoch 1, batch 5100, loss[loss=1.003, simple_loss=0.6208, pruned_loss=0.6926, over 13287.00 frames. ], tot_loss[loss=0.9018, simple_loss=0.5784, pruned_loss=0.6126, over 2571039.78 frames. ], batch size: 40, lr: 3.18e-02, grad_scale: 0.125 2024-06-19 15:02:17,621 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.08 vs. limit=11.00625 2024-06-19 15:02:17,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=9350.0, ans=7.3375 2024-06-19 15:02:23,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=9368.333333333334, ans=0.5721083333333334 2024-06-19 15:02:27,223 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=22.54 vs. limit=9.684166666666666 2024-06-19 15:02:32,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=9386.666666666666, ans=0.125 2024-06-19 15:02:34,372 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.31 vs. limit=11.02 2024-06-19 15:02:35,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=9386.666666666666, ans=0.125 2024-06-19 15:02:36,593 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=12.99 vs. limit=11.02 2024-06-19 15:02:36,799 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.20 vs. limit=4.4079999999999995 2024-06-19 15:02:39,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9405.0, ans=0.20595 2024-06-19 15:02:40,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=9405.0, ans=0.02 2024-06-19 15:02:40,989 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=24.14 vs. limit=11.026875 2024-06-19 15:02:45,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=9423.333333333334, ans=0.125 2024-06-19 15:02:50,915 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=32.19 vs. limit=11.03375 2024-06-19 15:02:53,561 INFO [train.py:1028] (1/2) Epoch 1, batch 5150, loss[loss=0.7908, simple_loss=0.5201, pruned_loss=0.5307, over 13083.00 frames. ], tot_loss[loss=0.9013, simple_loss=0.5786, pruned_loss=0.612, over 2573236.77 frames. ], batch size: 132, lr: 3.18e-02, grad_scale: 0.0625 2024-06-19 15:02:56,003 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.970e+03 1.167e+04 1.854e+04 2.650e+04 1.531e+05, threshold=3.709e+04, percent-clipped=18.0 2024-06-19 15:03:02,920 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.83 vs. limit=11.0475 2024-06-19 15:03:05,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=9460.0, ans=0.2054 2024-06-19 15:03:06,515 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.79 vs. limit=11.0475 2024-06-19 15:03:16,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9478.333333333334, ans=0.20521666666666666 2024-06-19 15:03:24,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9496.666666666666, ans=0.20503333333333335 2024-06-19 15:03:26,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=9496.666666666666, ans=0.125 2024-06-19 15:03:27,102 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=3.277e+02 2024-06-19 15:03:27,417 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=22.22 vs. limit=11.06125 2024-06-19 15:03:40,823 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.32 vs. limit=11.068125 2024-06-19 15:03:42,986 INFO [train.py:1028] (1/2) Epoch 1, batch 5200, loss[loss=0.8809, simple_loss=0.5609, pruned_loss=0.6005, over 13173.00 frames. ], tot_loss[loss=0.9024, simple_loss=0.5785, pruned_loss=0.6131, over 2576554.76 frames. ], batch size: 95, lr: 3.17e-02, grad_scale: 0.125 2024-06-19 15:03:44,866 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.21 vs. limit=11.075 2024-06-19 15:03:47,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=9533.333333333334, ans=0.025 2024-06-19 15:03:47,932 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=39.92 vs. limit=11.075 2024-06-19 15:03:49,543 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.09 vs. limit=11.075 2024-06-19 15:03:50,167 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=23.79 vs. limit=11.081875 2024-06-19 15:04:12,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9588.333333333334, ans=0.20411666666666667 2024-06-19 15:04:15,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.31 vs. limit=11.095625 2024-06-19 15:04:19,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=9606.666666666666, ans=0.125 2024-06-19 15:04:23,511 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.35 vs. limit=14.705 2024-06-19 15:04:25,533 INFO [train.py:1028] (1/2) Epoch 1, batch 5250, loss[loss=0.8871, simple_loss=0.5648, pruned_loss=0.6047, over 13233.00 frames. ], tot_loss[loss=0.9016, simple_loss=0.5773, pruned_loss=0.6129, over 2571029.10 frames. ], batch size: 52, lr: 3.17e-02, grad_scale: 0.0625 2024-06-19 15:04:28,979 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.544e+03 1.334e+04 1.737e+04 2.229e+04 1.798e+05, threshold=3.474e+04, percent-clipped=8.0 2024-06-19 15:04:30,029 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9625.0, ans=0.20375 2024-06-19 15:04:47,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=9661.666666666666, ans=0.125 2024-06-19 15:04:56,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=9680.0, ans=0.125 2024-06-19 15:04:59,077 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=40.17 vs. limit=14.76 2024-06-19 15:05:08,240 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.00 vs. limit=7.4245833333333335 2024-06-19 15:05:10,626 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=11.136875 2024-06-19 15:05:12,965 INFO [train.py:1028] (1/2) Epoch 1, batch 5300, loss[loss=0.821, simple_loss=0.5383, pruned_loss=0.5519, over 13014.00 frames. ], tot_loss[loss=0.9026, simple_loss=0.5779, pruned_loss=0.6137, over 2567645.24 frames. ], batch size: 144, lr: 3.16e-02, grad_scale: 0.125 2024-06-19 15:05:17,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=9716.666666666666, ans=0.125 2024-06-19 15:05:19,591 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=28.15 vs. limit=9.858333333333333 2024-06-19 15:05:19,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=22.87 vs. limit=11.14375 2024-06-19 15:05:25,516 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.77 vs. limit=14.80125 2024-06-19 15:05:31,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=9753.333333333334, ans=0.00874927536231884 2024-06-19 15:05:39,944 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.07 vs. limit=11.164375 2024-06-19 15:05:45,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=9771.666666666666, ans=0.125 2024-06-19 15:05:46,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9790.0, ans=0.2021 2024-06-19 15:05:49,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=9790.0, ans=0.125 2024-06-19 15:05:52,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=9790.0, ans=0.008741304347826087 2024-06-19 15:05:54,290 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=11.17125 2024-06-19 15:05:56,297 INFO [train.py:1028] (1/2) Epoch 1, batch 5350, loss[loss=0.8625, simple_loss=0.5207, pruned_loss=0.6022, over 11518.00 frames. ], tot_loss[loss=0.9008, simple_loss=0.5771, pruned_loss=0.6122, over 2574725.47 frames. ], batch size: 16, lr: 3.16e-02, grad_scale: 0.03125 2024-06-19 15:06:01,531 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.631e+03 1.383e+04 1.863e+04 2.866e+04 2.751e+05, threshold=3.726e+04, percent-clipped=14.0 2024-06-19 15:06:03,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9826.666666666666, ans=0.20173333333333332 2024-06-19 15:06:11,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=9826.666666666666, ans=0.125 2024-06-19 15:06:17,747 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=9.729e-02 2024-06-19 15:06:17,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=9845.0, ans=0.025645833333333336 2024-06-19 15:06:19,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=9845.0, ans=0.125 2024-06-19 15:06:24,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=9845.0, ans=0.125 2024-06-19 15:06:27,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=9863.333333333334, ans=0.025 2024-06-19 15:06:32,807 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=119.11 vs. limit=14.8975 2024-06-19 15:06:35,737 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.30 vs. limit=4.4822500000000005 2024-06-19 15:06:42,938 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=31.88 vs. limit=14.925 2024-06-19 15:06:43,462 INFO [train.py:1028] (1/2) Epoch 1, batch 5400, loss[loss=0.8705, simple_loss=0.5837, pruned_loss=0.5787, over 12238.00 frames. ], tot_loss[loss=0.8968, simple_loss=0.5758, pruned_loss=0.6089, over 2567294.22 frames. ], batch size: 241, lr: 3.15e-02, grad_scale: 0.0625 2024-06-19 15:06:59,818 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=26.56 vs. limit=11.219375 2024-06-19 15:06:59,942 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.78 vs. limit=7.479583333333334 2024-06-19 15:07:00,873 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=11.219375 2024-06-19 15:07:01,442 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=11.219375 2024-06-19 15:07:02,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=9918.333333333334, ans=0.025340277777777778 2024-06-19 15:07:10,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9936.666666666666, ans=0.20063333333333333 2024-06-19 15:07:14,191 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=26.21 vs. limit=9.9775 2024-06-19 15:07:16,582 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.48 vs. limit=11.233125 2024-06-19 15:07:21,537 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.12 vs. limit=11.233125 2024-06-19 15:07:23,106 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=11.24 2024-06-19 15:07:23,248 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.60 vs. limit=14.98 2024-06-19 15:07:27,227 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=34.16 vs. limit=11.24 2024-06-19 15:07:28,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=9973.333333333334, ans=0.00870144927536232 2024-06-19 15:07:29,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9973.333333333334, ans=0.20026666666666665 2024-06-19 15:07:32,066 INFO [train.py:1028] (1/2) Epoch 1, batch 5450, loss[loss=0.9527, simple_loss=0.5842, pruned_loss=0.6606, over 12390.00 frames. ], tot_loss[loss=0.8985, simple_loss=0.5764, pruned_loss=0.6103, over 2570772.08 frames. ], batch size: 25, lr: 3.15e-02, grad_scale: 0.0625 2024-06-19 15:07:34,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=9991.666666666666, ans=0.1 2024-06-19 15:07:37,614 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.314e+03 9.296e+03 1.443e+04 1.969e+04 7.796e+04, threshold=2.886e+04, percent-clipped=3.0 2024-06-19 15:07:44,595 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=39.79 vs. limit=11.25375 2024-06-19 15:07:51,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=10028.333333333334, ans=0.125 2024-06-19 15:08:03,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=10046.666666666666, ans=0.025 2024-06-19 15:08:04,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=10046.666666666666, ans=0.125 2024-06-19 15:08:07,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=10065.0, ans=0.125 2024-06-19 15:08:17,053 INFO [train.py:1028] (1/2) Epoch 1, batch 5500, loss[loss=0.8915, simple_loss=0.5992, pruned_loss=0.5919, over 12098.00 frames. ], tot_loss[loss=0.8995, simple_loss=0.577, pruned_loss=0.611, over 2562923.25 frames. ], batch size: 240, lr: 3.14e-02, grad_scale: 0.125 2024-06-19 15:08:18,288 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=11.28125 2024-06-19 15:08:18,351 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=26.01 vs. limit=11.28125 2024-06-19 15:08:19,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=10083.333333333334, ans=0.09899494936611666 2024-06-19 15:08:37,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=10120.0, ans=0.0 2024-06-19 15:08:47,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=10138.333333333334, ans=0.07 2024-06-19 15:09:05,241 INFO [train.py:1028] (1/2) Epoch 1, batch 5550, loss[loss=0.9039, simple_loss=0.5605, pruned_loss=0.6236, over 13302.00 frames. ], tot_loss[loss=0.9017, simple_loss=0.5774, pruned_loss=0.613, over 2566898.04 frames. ], batch size: 43, lr: 3.14e-02, grad_scale: 0.125 2024-06-19 15:09:10,041 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.765e+03 8.437e+03 1.109e+04 1.602e+04 8.377e+04, threshold=2.218e+04, percent-clipped=4.0 2024-06-19 15:09:11,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=10175.0, ans=0.008657608695652174 2024-06-19 15:09:16,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=10193.333333333334, ans=0.125 2024-06-19 15:09:19,567 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=46.24 vs. limit=11.3225 2024-06-19 15:09:24,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.64 vs. limit=15.158750000000001 2024-06-19 15:09:28,987 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=31.71 vs. limit=15.158750000000001 2024-06-19 15:09:34,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=10230.0, ans=0.125 2024-06-19 15:09:39,517 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.32 vs. limit=11.33625 2024-06-19 15:09:47,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=10248.333333333334, ans=0.19751666666666667 2024-06-19 15:09:50,891 INFO [train.py:1028] (1/2) Epoch 1, batch 5600, loss[loss=0.8625, simple_loss=0.5599, pruned_loss=0.5826, over 13258.00 frames. ], tot_loss[loss=0.8984, simple_loss=0.5756, pruned_loss=0.6105, over 2568684.87 frames. ], batch size: 89, lr: 3.13e-02, grad_scale: 0.25 2024-06-19 15:09:52,216 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.21 vs. limit=11.35 2024-06-19 15:09:53,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=10266.666666666666, ans=0.5406666666666667 2024-06-19 15:10:03,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=10285.0, ans=0.125 2024-06-19 15:10:08,822 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.42 vs. limit=11.36375 2024-06-19 15:10:12,211 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.14 vs. limit=7.575833333333334 2024-06-19 15:10:29,918 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=3.97 vs. limit=10.17 2024-06-19 15:10:32,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=10358.333333333334, ans=0.19641666666666668 2024-06-19 15:10:33,533 INFO [train.py:1028] (1/2) Epoch 1, batch 5650, loss[loss=0.8841, simple_loss=0.5922, pruned_loss=0.588, over 12549.00 frames. ], tot_loss[loss=0.9012, simple_loss=0.5763, pruned_loss=0.613, over 2574816.38 frames. ], batch size: 202, lr: 3.13e-02, grad_scale: 0.0625 2024-06-19 15:10:34,909 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.55 vs. limit=15.26875 2024-06-19 15:10:35,619 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.35 vs. limit=11.384375 2024-06-19 15:10:38,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=10358.333333333334, ans=0.125 2024-06-19 15:10:39,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=10358.333333333334, ans=0.04949747468305833 2024-06-19 15:10:40,212 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.543e+03 1.199e+04 1.630e+04 2.347e+04 1.398e+05, threshold=3.261e+04, percent-clipped=27.0 2024-06-19 15:10:42,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=10376.666666666666, ans=0.02343055555555556 2024-06-19 15:10:43,572 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.575e+02 2024-06-19 15:10:45,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=10376.666666666666, ans=8.150666666666666 2024-06-19 15:10:49,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=10395.0, ans=0.0 2024-06-19 15:10:52,781 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=41.37 vs. limit=8.158000000000001 2024-06-19 15:10:53,022 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.94 vs. limit=15.29625 2024-06-19 15:10:56,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=10413.333333333334, ans=0.19586666666666666 2024-06-19 15:11:01,706 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.40 vs. limit=15.309999999999999 2024-06-19 15:11:11,343 INFO [train.py:1028] (1/2) Epoch 1, batch 5700, loss[loss=0.9619, simple_loss=0.6213, pruned_loss=0.6513, over 13282.00 frames. ], tot_loss[loss=0.8959, simple_loss=0.5737, pruned_loss=0.609, over 2578297.52 frames. ], batch size: 63, lr: 3.12e-02, grad_scale: 0.125 2024-06-19 15:11:14,339 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=28.34 vs. limit=11.41875 2024-06-19 15:11:26,319 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=1.040e-02 2024-06-19 15:11:31,387 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.79 vs. limit=10.243333333333332 2024-06-19 15:11:35,345 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.04 vs. limit=5.0 2024-06-19 15:11:36,038 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.64 vs. limit=11.439375 2024-06-19 15:11:36,787 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.64 vs. limit=7.626250000000001 2024-06-19 15:11:45,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=10523.333333333334, ans=0.19476666666666664 2024-06-19 15:11:47,633 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.67 vs. limit=11.44625 2024-06-19 15:11:50,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=10523.333333333334, ans=0.025 2024-06-19 15:11:51,658 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.72 vs. limit=10.270833333333332 2024-06-19 15:11:52,025 INFO [train.py:1028] (1/2) Epoch 1, batch 5750, loss[loss=0.8674, simple_loss=0.5782, pruned_loss=0.5783, over 12742.00 frames. ], tot_loss[loss=0.8956, simple_loss=0.5748, pruned_loss=0.6082, over 2579707.30 frames. ], batch size: 176, lr: 3.12e-02, grad_scale: 0.0625 2024-06-19 15:11:58,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=10541.666666666666, ans=0.125 2024-06-19 15:12:05,626 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.521e+03 1.315e+04 1.752e+04 2.484e+04 1.271e+05, threshold=3.503e+04, percent-clipped=9.0 2024-06-19 15:12:07,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=10560.0, ans=0.008573913043478262 2024-06-19 15:12:15,080 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=30.11 vs. limit=11.466875 2024-06-19 15:12:20,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=10578.333333333334, ans=0.07 2024-06-19 15:12:33,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=10615.0, ans=0.00856195652173913 2024-06-19 15:12:35,016 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.27 vs. limit=11.480625 2024-06-19 15:12:38,468 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=23.14 vs. limit=11.480625 2024-06-19 15:12:41,601 INFO [train.py:1028] (1/2) Epoch 1, batch 5800, loss[loss=0.8639, simple_loss=0.5789, pruned_loss=0.5744, over 12712.00 frames. ], tot_loss[loss=0.8952, simple_loss=0.5762, pruned_loss=0.6071, over 2579713.52 frames. ], batch size: 176, lr: 3.11e-02, grad_scale: 0.125 2024-06-19 15:12:42,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=10633.333333333334, ans=0.09899494936611666 2024-06-19 15:12:44,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=10633.333333333334, ans=0.125 2024-06-19 15:12:50,151 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.39 vs. limit=15.48875 2024-06-19 15:13:07,731 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.41 vs. limit=15.51625 2024-06-19 15:13:23,922 INFO [train.py:1028] (1/2) Epoch 1, batch 5850, loss[loss=0.9016, simple_loss=0.6006, pruned_loss=0.6013, over 12564.00 frames. ], tot_loss[loss=0.9011, simple_loss=0.5797, pruned_loss=0.6112, over 2577597.94 frames. ], batch size: 202, lr: 3.11e-02, grad_scale: 0.0625 2024-06-19 15:13:25,207 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.91 vs. limit=15.54375 2024-06-19 15:13:25,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=10725.0, ans=0.125 2024-06-19 15:13:29,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=10725.0, ans=0.0 2024-06-19 15:13:31,595 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.146e+03 1.160e+04 1.714e+04 2.777e+04 1.818e+05, threshold=3.429e+04, percent-clipped=10.0 2024-06-19 15:13:43,473 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.12 vs. limit=15.5575 2024-06-19 15:13:47,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10761.666666666666, ans=0.19238333333333335 2024-06-19 15:13:56,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=10780.0, ans=0.125 2024-06-19 15:14:06,543 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=25.72 vs. limit=11.549375000000001 2024-06-19 15:14:09,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=10798.333333333334, ans=0.125 2024-06-19 15:14:11,254 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=11.55625 2024-06-19 15:14:11,450 INFO [train.py:1028] (1/2) Epoch 1, batch 5900, loss[loss=0.8611, simple_loss=0.5638, pruned_loss=0.5792, over 13053.00 frames. ], tot_loss[loss=0.9061, simple_loss=0.583, pruned_loss=0.6145, over 2577380.99 frames. ], batch size: 121, lr: 3.10e-02, grad_scale: 0.125 2024-06-19 15:14:19,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=10835.0, ans=0.125 2024-06-19 15:14:22,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=10835.0, ans=0.125 2024-06-19 15:14:26,721 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 15:14:37,681 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=11.57 2024-06-19 15:14:42,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=10871.666666666666, ans=0.025 2024-06-19 15:14:50,009 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.74 vs. limit=11.576875000000001 2024-06-19 15:14:55,445 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=8.356 2024-06-19 15:15:01,140 INFO [train.py:1028] (1/2) Epoch 1, batch 5950, loss[loss=0.8497, simple_loss=0.5622, pruned_loss=0.5685, over 13111.00 frames. ], tot_loss[loss=0.9094, simple_loss=0.5856, pruned_loss=0.6166, over 2582495.78 frames. ], batch size: 121, lr: 3.10e-02, grad_scale: 0.125 2024-06-19 15:15:01,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=10908.333333333334, ans=0.363625 2024-06-19 15:15:06,774 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.77 vs. limit=11.590625 2024-06-19 15:15:10,674 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.148e+03 1.226e+04 1.519e+04 1.920e+04 4.746e+04, threshold=3.037e+04, percent-clipped=5.0 2024-06-19 15:15:11,116 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.23 vs. limit=8.370666666666667 2024-06-19 15:15:12,953 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=24.77 vs. limit=15.695 2024-06-19 15:15:14,102 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.60 vs. limit=11.5975 2024-06-19 15:15:15,635 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=3.937e+00 2024-06-19 15:15:20,220 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.58 vs. limit=11.604375000000001 2024-06-19 15:15:30,440 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.82 vs. limit=7.740833333333334 2024-06-19 15:15:34,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=10963.333333333334, ans=0.125 2024-06-19 15:15:44,285 INFO [train.py:1028] (1/2) Epoch 1, batch 6000, loss[loss=0.934, simple_loss=0.6232, pruned_loss=0.6223, over 12306.00 frames. ], tot_loss[loss=0.9132, simple_loss=0.5883, pruned_loss=0.619, over 2575466.12 frames. ], batch size: 241, lr: 3.09e-02, grad_scale: 0.25 2024-06-19 15:15:44,286 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 15:15:54,156 INFO [train.py:1060] (1/2) Epoch 1, validation: loss=0.9935, simple_loss=0.6369, pruned_loss=0.6751, over 351949.00 frames. 2024-06-19 15:15:54,156 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 16946MB 2024-06-19 15:15:56,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=11000.0, ans=0.5150000000000001 2024-06-19 15:16:02,242 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=5.72 vs. limit=8.4 2024-06-19 15:16:04,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=27.62 vs. limit=11.631875 2024-06-19 15:16:14,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=11036.666666666666, ans=0.125 2024-06-19 15:16:15,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=11036.666666666666, ans=0.125 2024-06-19 15:16:20,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=11055.0, ans=0.125 2024-06-19 15:16:31,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=11055.0, ans=0.125 2024-06-19 15:16:43,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=11073.333333333334, ans=0.125 2024-06-19 15:16:44,895 INFO [train.py:1028] (1/2) Epoch 1, batch 6050, loss[loss=0.9427, simple_loss=0.5964, pruned_loss=0.6445, over 12970.00 frames. ], tot_loss[loss=0.9186, simple_loss=0.5912, pruned_loss=0.623, over 2577951.80 frames. ], batch size: 39, lr: 3.09e-02, grad_scale: 0.125 2024-06-19 15:16:53,445 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.967e+03 1.562e+04 2.095e+04 3.031e+04 1.028e+05, threshold=4.190e+04, percent-clipped=23.0 2024-06-19 15:16:55,451 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=25.28 vs. limit=11.66625 2024-06-19 15:16:55,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=11110.0, ans=0.020375000000000004 2024-06-19 15:17:00,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=11128.333333333334, ans=0.125 2024-06-19 15:17:02,731 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.91 vs. limit=15.84625 2024-06-19 15:17:08,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=11128.333333333334, ans=0.07 2024-06-19 15:17:22,902 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.08 vs. limit=11.68 2024-06-19 15:17:32,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=11165.0, ans=0.020145833333333335 2024-06-19 15:17:33,821 INFO [train.py:1028] (1/2) Epoch 1, batch 6100, loss[loss=0.8707, simple_loss=0.5767, pruned_loss=0.5823, over 13115.00 frames. ], tot_loss[loss=0.9227, simple_loss=0.5942, pruned_loss=0.6256, over 2581179.09 frames. ], batch size: 121, lr: 3.08e-02, grad_scale: 0.25 2024-06-19 15:17:35,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=11183.333333333334, ans=0.125 2024-06-19 15:17:35,464 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=18.70 vs. limit=11.69375 2024-06-19 15:17:37,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=11183.333333333334, ans=0.5085833333333334 2024-06-19 15:17:38,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=11183.333333333334, ans=0.020069444444444442 2024-06-19 15:17:44,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=11201.666666666666, ans=0.01999305555555556 2024-06-19 15:17:45,961 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.35 vs. limit=8.480666666666666 2024-06-19 15:17:54,498 WARNING [optim.py:503] (1/2) Scaling gradients by 0.07710374146699905, model_norm_threshold=41897.69140625 2024-06-19 15:17:54,680 WARNING [optim.py:575] (1/2) Parameter dominating tot_sumsq module.encoder_embed.conv.4.weight with proportion 0.44, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=1.304e+11, grad_sumsq=7.881e+10, orig_rms_sq=1.654e+00 2024-06-19 15:17:55,144 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.50 vs. limit=15.915 2024-06-19 15:17:59,210 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=10.18 vs. limit=10.61 2024-06-19 15:18:03,034 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=8.495333333333335 2024-06-19 15:18:05,305 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.17 vs. limit=15.92875 2024-06-19 15:18:14,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=11256.666666666666, ans=0.5060166666666668 2024-06-19 15:18:15,383 INFO [train.py:1028] (1/2) Epoch 1, batch 6150, loss[loss=0.898, simple_loss=0.6128, pruned_loss=0.5916, over 10870.00 frames. ], tot_loss[loss=0.93, simple_loss=0.5984, pruned_loss=0.6308, over 2579379.26 frames. ], batch size: 304, lr: 3.08e-02, grad_scale: 0.03125 2024-06-19 15:18:17,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=11275.0, ans=0.125 2024-06-19 15:18:18,734 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.82 vs. limit=7.81875 2024-06-19 15:18:20,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=11275.0, ans=0.125 2024-06-19 15:18:25,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=11293.333333333334, ans=0.3694 2024-06-19 15:18:27,827 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.36 vs. limit=15.97 2024-06-19 15:18:28,101 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.390e+03 1.395e+04 1.990e+04 2.832e+04 5.434e+05, threshold=3.980e+04, percent-clipped=9.0 2024-06-19 15:18:30,304 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.31 vs. limit=4.694 2024-06-19 15:18:32,833 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.74 vs. limit=8.524666666666667 2024-06-19 15:18:34,546 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=27.87 vs. limit=11.741875 2024-06-19 15:18:49,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=11330.0, ans=0.125 2024-06-19 15:18:51,020 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=61.67 vs. limit=11.755625 2024-06-19 15:19:07,376 INFO [train.py:1028] (1/2) Epoch 1, batch 6200, loss[loss=1.023, simple_loss=0.6742, pruned_loss=0.6862, over 13257.00 frames. ], tot_loss[loss=0.9381, simple_loss=0.6028, pruned_loss=0.6367, over 2576501.53 frames. ], batch size: 89, lr: 3.07e-02, grad_scale: 0.0625 2024-06-19 15:19:11,210 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=34.54 vs. limit=16.025 2024-06-19 15:19:17,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=11385.0, ans=0.125 2024-06-19 15:19:26,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=11403.333333333334, ans=0.019152777777777776 2024-06-19 15:19:36,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=11421.666666666666, ans=10.0 2024-06-19 15:19:58,235 INFO [train.py:1028] (1/2) Epoch 1, batch 6250, loss[loss=0.9979, simple_loss=0.6413, pruned_loss=0.6772, over 13255.00 frames. ], tot_loss[loss=0.9431, simple_loss=0.6053, pruned_loss=0.6405, over 2570749.42 frames. ], batch size: 83, lr: 3.07e-02, grad_scale: 0.0625 2024-06-19 15:19:59,940 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.89 vs. limit=5.0 2024-06-19 15:20:02,287 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.53 vs. limit=16.09375 2024-06-19 15:20:02,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=11458.333333333334, ans=0.0 2024-06-19 15:20:06,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=11476.666666666666, ans=0.125 2024-06-19 15:20:09,864 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.563e+03 7.199e+03 9.948e+03 1.234e+04 7.441e+04, threshold=1.990e+04, percent-clipped=1.0 2024-06-19 15:20:17,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=11495.0, ans=0.125 2024-06-19 15:20:18,108 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=11.810625 2024-06-19 15:20:27,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=11513.333333333334, ans=0.4970333333333334 2024-06-19 15:20:31,589 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=11.824375 2024-06-19 15:20:32,370 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.55 vs. limit=11.824375 2024-06-19 15:20:34,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=11531.666666666666, ans=0.125 2024-06-19 15:20:40,890 INFO [train.py:1028] (1/2) Epoch 1, batch 6300, loss[loss=0.9824, simple_loss=0.597, pruned_loss=0.6839, over 11620.00 frames. ], tot_loss[loss=0.947, simple_loss=0.6077, pruned_loss=0.6431, over 2566081.39 frames. ], batch size: 16, lr: 3.06e-02, grad_scale: 0.125 2024-06-19 15:20:46,799 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=11.83125 2024-06-19 15:20:56,063 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=39.47 vs. limit=11.838125 2024-06-19 15:20:58,608 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.55 vs. limit=10.793333333333333 2024-06-19 15:21:08,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=11605.0, ans=0.18395 2024-06-19 15:21:09,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=11605.0, ans=0.125 2024-06-19 15:21:12,779 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.12 vs. limit=11.851875 2024-06-19 15:21:14,449 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.67 vs. limit=11.851875 2024-06-19 15:21:24,898 INFO [train.py:1028] (1/2) Epoch 1, batch 6350, loss[loss=0.9945, simple_loss=0.6645, pruned_loss=0.6622, over 12469.00 frames. ], tot_loss[loss=0.9559, simple_loss=0.612, pruned_loss=0.6499, over 2574465.49 frames. ], batch size: 202, lr: 3.06e-02, grad_scale: 0.125 2024-06-19 15:21:34,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=11660.0, ans=0.1834 2024-06-19 15:21:35,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=11660.0, ans=0.125 2024-06-19 15:21:35,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=11660.0, ans=0.008334782608695652 2024-06-19 15:21:36,502 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.593e+03 9.663e+03 1.217e+04 1.640e+04 5.662e+04, threshold=2.434e+04, percent-clipped=17.0 2024-06-19 15:21:37,834 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.85 vs. limit=11.872499999999999 2024-06-19 15:21:40,777 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.03 vs. limit=11.879375 2024-06-19 15:21:42,394 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.57 vs. limit=16.25875 2024-06-19 15:21:43,113 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.16 vs. limit=11.879375 2024-06-19 15:21:50,840 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=32.79 vs. limit=11.879375 2024-06-19 15:21:54,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=11696.666666666666, ans=0.008326811594202899 2024-06-19 15:21:55,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=11696.666666666666, ans=0.008326811594202899 2024-06-19 15:21:57,282 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=29.01 vs. limit=10.848333333333333 2024-06-19 15:22:07,098 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.80 vs. limit=11.893125000000001 2024-06-19 15:22:12,046 INFO [train.py:1028] (1/2) Epoch 1, batch 6400, loss[loss=0.9671, simple_loss=0.6235, pruned_loss=0.6553, over 13225.00 frames. ], tot_loss[loss=0.9639, simple_loss=0.6164, pruned_loss=0.6557, over 2576078.86 frames. ], batch size: 67, lr: 3.05e-02, grad_scale: 0.25 2024-06-19 15:22:20,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=11751.666666666666, ans=0.125 2024-06-19 15:22:37,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=11770.0, ans=0.0 2024-06-19 15:22:52,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=11806.666666666666, ans=0.125 2024-06-19 15:22:59,769 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=26.85 vs. limit=16.355 2024-06-19 15:23:00,986 INFO [train.py:1028] (1/2) Epoch 1, batch 6450, loss[loss=0.9322, simple_loss=0.6254, pruned_loss=0.6195, over 12547.00 frames. ], tot_loss[loss=0.97, simple_loss=0.6205, pruned_loss=0.6597, over 2581727.44 frames. ], batch size: 202, lr: 3.05e-02, grad_scale: 0.25 2024-06-19 15:23:06,058 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.64 vs. limit=10.9125 2024-06-19 15:23:07,379 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.77 vs. limit=16.36875 2024-06-19 15:23:08,838 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=42.93 vs. limit=16.3825 2024-06-19 15:23:09,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11843.333333333334, ans=0.18156666666666665 2024-06-19 15:23:12,650 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.158e+03 1.193e+04 1.508e+04 2.229e+04 6.080e+04, threshold=3.015e+04, percent-clipped=19.0 2024-06-19 15:23:30,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=11898.333333333334, ans=0.4835583333333333 2024-06-19 15:23:36,946 INFO [train.py:1028] (1/2) Epoch 1, batch 6500, loss[loss=0.9322, simple_loss=0.6341, pruned_loss=0.6151, over 10723.00 frames. ], tot_loss[loss=0.9753, simple_loss=0.6239, pruned_loss=0.6633, over 2584810.16 frames. ], batch size: 303, lr: 3.04e-02, grad_scale: 0.25 2024-06-19 15:23:44,255 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.87 vs. limit=7.979166666666666 2024-06-19 15:23:45,226 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.07 vs. limit=7.983750000000001 2024-06-19 15:24:04,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=27.59 vs. limit=11.989374999999999 2024-06-19 15:24:09,109 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.01 vs. limit=8.788666666666668 2024-06-19 15:24:11,792 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.604e+03 2024-06-19 15:24:19,015 INFO [train.py:1028] (1/2) Epoch 1, batch 6550, loss[loss=1.029, simple_loss=0.6338, pruned_loss=0.7126, over 12544.00 frames. ], tot_loss[loss=0.9763, simple_loss=0.6243, pruned_loss=0.6641, over 2587132.35 frames. ], batch size: 22, lr: 3.04e-02, grad_scale: 0.125 2024-06-19 15:24:27,226 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=38.96 vs. limit=12.01 2024-06-19 15:24:38,492 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.576e+03 1.680e+04 2.097e+04 2.799e+04 9.664e+04, threshold=4.195e+04, percent-clipped=17.0 2024-06-19 15:24:41,777 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=23.08 vs. limit=12.016874999999999 2024-06-19 15:24:46,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=12045.0, ans=0.00825108695652174 2024-06-19 15:24:58,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=12081.666666666666, ans=0.125 2024-06-19 15:25:00,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=12081.666666666666, ans=0.125 2024-06-19 15:25:06,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=12100.0, ans=0.05 2024-06-19 15:25:07,735 INFO [train.py:1028] (1/2) Epoch 1, batch 6600, loss[loss=0.947, simple_loss=0.6048, pruned_loss=0.6445, over 13219.00 frames. ], tot_loss[loss=0.9782, simple_loss=0.6255, pruned_loss=0.6654, over 2589414.03 frames. ], batch size: 72, lr: 3.03e-02, grad_scale: 0.25 2024-06-19 15:25:39,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=12155.0, ans=0.0 2024-06-19 15:25:42,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=12155.0, ans=0.01602083333333334 2024-06-19 15:25:49,702 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.64 vs. limit=16.630000000000003 2024-06-19 15:25:50,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=12173.333333333334, ans=0.125 2024-06-19 15:25:51,412 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=10.52 vs. limit=8.869333333333334 2024-06-19 15:25:52,788 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=8.650e-02 2024-06-19 15:25:54,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=12191.666666666666, ans=0.025 2024-06-19 15:25:55,420 INFO [train.py:1028] (1/2) Epoch 1, batch 6650, loss[loss=0.9911, simple_loss=0.6512, pruned_loss=0.6655, over 12905.00 frames. ], tot_loss[loss=0.9843, simple_loss=0.6292, pruned_loss=0.6697, over 2583393.80 frames. ], batch size: 158, lr: 3.03e-02, grad_scale: 0.125 2024-06-19 15:25:56,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12191.666666666666, ans=0.17808333333333334 2024-06-19 15:25:56,655 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.96 vs. limit=12.071875 2024-06-19 15:26:05,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=12210.0, ans=0.125 2024-06-19 15:26:05,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=12210.0, ans=0.125 2024-06-19 15:26:06,355 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.97 vs. limit=12.07875 2024-06-19 15:26:08,018 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=12.07875 2024-06-19 15:26:10,352 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.596e+03 1.318e+04 1.703e+04 2.370e+04 1.907e+05, threshold=3.406e+04, percent-clipped=8.0 2024-06-19 15:26:22,593 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.99 vs. limit=12.092500000000001 2024-06-19 15:26:23,531 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.97 vs. limit=4.837 2024-06-19 15:26:25,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=12246.666666666666, ans=0.07 2024-06-19 15:26:39,841 INFO [train.py:1028] (1/2) Epoch 1, batch 6700, loss[loss=0.9706, simple_loss=0.6455, pruned_loss=0.6478, over 12832.00 frames. ], tot_loss[loss=0.9873, simple_loss=0.6315, pruned_loss=0.6716, over 2583290.29 frames. ], batch size: 177, lr: 3.02e-02, grad_scale: 0.25 2024-06-19 15:26:41,081 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.93 vs. limit=4.8425 2024-06-19 15:26:42,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten.whitening_limit, batch_count=12283.333333333334, ans=16.7125 2024-06-19 15:27:08,876 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-19 15:27:12,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=12338.333333333334, ans=0.015256944444444441 2024-06-19 15:27:12,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=12338.333333333334, ans=0.015256944444444441 2024-06-19 15:27:14,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=12338.333333333334, ans=0.46815833333333334 2024-06-19 15:27:18,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=12356.666666666666, ans=0.015180555555555558 2024-06-19 15:27:28,228 INFO [train.py:1028] (1/2) Epoch 1, batch 6750, loss[loss=1.018, simple_loss=0.6821, pruned_loss=0.6767, over 12135.00 frames. ], tot_loss[loss=0.9911, simple_loss=0.6341, pruned_loss=0.6741, over 2577907.64 frames. ], batch size: 241, lr: 3.02e-02, grad_scale: 0.125 2024-06-19 15:27:28,649 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.35 vs. limit=11.1875 2024-06-19 15:27:33,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=12375.0, ans=0.125 2024-06-19 15:27:41,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.22 vs. limit=16.795 2024-06-19 15:27:42,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=12393.333333333334, ans=0.125 2024-06-19 15:27:43,386 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.471e+03 1.488e+04 2.069e+04 2.751e+04 1.024e+05, threshold=4.138e+04, percent-clipped=14.0 2024-06-19 15:27:54,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=12411.666666666666, ans=0.125 2024-06-19 15:27:56,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=12411.666666666666, ans=0.014951388888888896 2024-06-19 15:27:59,472 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.958e+03 2024-06-19 15:28:06,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=12430.0, ans=0.0 2024-06-19 15:28:07,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=12448.333333333334, ans=0.025 2024-06-19 15:28:12,425 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.34 vs. limit=16.83625 2024-06-19 15:28:15,954 INFO [train.py:1028] (1/2) Epoch 1, batch 6800, loss[loss=0.9911, simple_loss=0.6333, pruned_loss=0.6745, over 13172.00 frames. ], tot_loss[loss=0.9944, simple_loss=0.6364, pruned_loss=0.6762, over 2579931.93 frames. ], batch size: 67, lr: 3.01e-02, grad_scale: 0.25 2024-06-19 15:28:19,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=12466.666666666666, ans=0.025 2024-06-19 15:28:29,998 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.08 vs. limit=12.181875 2024-06-19 15:28:30,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=12485.0, ans=0.17514999999999997 2024-06-19 15:28:37,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=12503.333333333334, ans=0.07 2024-06-19 15:28:39,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=12503.333333333334, ans=0.01456944444444444 2024-06-19 15:28:51,056 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.57 vs. limit=4.881 2024-06-19 15:28:53,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=12540.0, ans=0.125 2024-06-19 15:29:00,170 INFO [train.py:1028] (1/2) Epoch 1, batch 6850, loss[loss=1.099, simple_loss=0.6974, pruned_loss=0.7499, over 13289.00 frames. ], tot_loss[loss=0.9996, simple_loss=0.6387, pruned_loss=0.6802, over 2583409.69 frames. ], batch size: 63, lr: 3.01e-02, grad_scale: 0.0625 2024-06-19 15:29:17,665 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.85 vs. limit=16.94625 2024-06-19 15:29:18,022 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.490e+03 1.554e+04 2.261e+04 3.516e+04 1.635e+05, threshold=4.522e+04, percent-clipped=18.0 2024-06-19 15:29:20,390 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.46 vs. limit=12.223125 2024-06-19 15:29:28,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=12613.333333333334, ans=0.125 2024-06-19 15:29:30,060 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.25 vs. limit=8.153333333333334 2024-06-19 15:29:35,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=12631.666666666666, ans=0.008123550724637682 2024-06-19 15:29:36,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=12631.666666666666, ans=0.125 2024-06-19 15:29:36,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=12631.666666666666, ans=0.125 2024-06-19 15:29:41,025 INFO [train.py:1028] (1/2) Epoch 1, batch 6900, loss[loss=1.035, simple_loss=0.6536, pruned_loss=0.708, over 13292.00 frames. ], tot_loss[loss=1.002, simple_loss=0.6407, pruned_loss=0.6816, over 2585126.16 frames. ], batch size: 49, lr: 3.00e-02, grad_scale: 0.125 2024-06-19 15:29:45,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=12650.0, ans=0.125 2024-06-19 15:29:45,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=12650.0, ans=0.125 2024-06-19 15:29:56,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=12668.333333333334, ans=0.01388194444444444 2024-06-19 15:29:58,786 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.38 vs. limit=4.90025 2024-06-19 15:30:00,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=12668.333333333334, ans=0.01388194444444444 2024-06-19 15:30:04,098 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.04 vs. limit=8.171666666666667 2024-06-19 15:30:17,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=12705.0, ans=12.264375000000001 2024-06-19 15:30:18,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=12705.0, ans=0.013729166666666667 2024-06-19 15:30:26,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=12723.333333333334, ans=0.45468333333333333 2024-06-19 15:30:27,289 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=20.11 vs. limit=12.27125 2024-06-19 15:30:29,997 INFO [train.py:1028] (1/2) Epoch 1, batch 6950, loss[loss=0.9322, simple_loss=0.5739, pruned_loss=0.6452, over 11902.00 frames. ], tot_loss[loss=1.004, simple_loss=0.6415, pruned_loss=0.6831, over 2579373.15 frames. ], batch size: 17, lr: 3.00e-02, grad_scale: 0.125 2024-06-19 15:30:34,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=12741.666666666666, ans=0.013576388888888895 2024-06-19 15:30:37,644 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=27.62 vs. limit=17.05625 2024-06-19 15:30:50,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.78 vs. limit=17.083750000000002 2024-06-19 15:30:50,724 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.05 vs. limit=12.291875000000001 2024-06-19 15:30:50,984 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.023e+03 8.847e+03 1.233e+04 1.711e+04 6.541e+04, threshold=2.466e+04, percent-clipped=2.0 2024-06-19 15:31:09,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12815.0, ans=0.17185 2024-06-19 15:31:12,453 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=12815.0, ans=0.013270833333333336 2024-06-19 15:31:16,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=12815.0, ans=0.125 2024-06-19 15:31:17,758 INFO [train.py:1028] (1/2) Epoch 1, batch 7000, loss[loss=1.007, simple_loss=0.668, pruned_loss=0.673, over 12946.00 frames. ], tot_loss[loss=1.004, simple_loss=0.6412, pruned_loss=0.6834, over 2575080.52 frames. ], batch size: 158, lr: 2.99e-02, grad_scale: 0.25 2024-06-19 15:31:22,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12833.333333333334, ans=0.17166666666666666 2024-06-19 15:31:26,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=12851.666666666666, ans=0.008075724637681159 2024-06-19 15:31:26,367 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=18.92 vs. limit=12.319375 2024-06-19 15:31:36,196 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=32.58 vs. limit=17.1525 2024-06-19 15:31:37,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12870.0, ans=0.17129999999999998 2024-06-19 15:31:42,743 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=34.63 vs. limit=17.1525 2024-06-19 15:31:51,024 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=21.18 vs. limit=12.333124999999999 2024-06-19 15:31:54,717 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=20.03 vs. limit=12.34 2024-06-19 15:32:02,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=12925.0, ans=0.447625 2024-06-19 15:32:02,795 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=39.83 vs. limit=11.4625 2024-06-19 15:32:03,226 INFO [train.py:1028] (1/2) Epoch 1, batch 7050, loss[loss=1.02, simple_loss=0.6669, pruned_loss=0.6864, over 12775.00 frames. ], tot_loss[loss=1.01, simple_loss=0.6445, pruned_loss=0.6879, over 2582128.72 frames. ], batch size: 176, lr: 2.99e-02, grad_scale: 0.25 2024-06-19 15:32:10,639 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.27 vs. limit=8.23125 2024-06-19 15:32:14,872 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.26 vs. limit=8.235833333333334 2024-06-19 15:32:20,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=12961.666666666666, ans=0.125 2024-06-19 15:32:21,872 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.414e+03 1.230e+04 1.674e+04 2.295e+04 8.542e+04, threshold=3.349e+04, percent-clipped=17.0 2024-06-19 15:32:24,936 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=33.51 vs. limit=12.360624999999999 2024-06-19 15:32:27,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=12961.666666666666, ans=0.17038333333333333 2024-06-19 15:32:27,848 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.70 vs. limit=8.240416666666667 2024-06-19 15:32:38,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=12980.0, ans=0.44570000000000004 2024-06-19 15:32:38,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=12980.0, ans=0.125 2024-06-19 15:32:40,479 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=12.374375 2024-06-19 15:32:43,772 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.73 vs. limit=17.24875 2024-06-19 15:32:48,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=13016.666666666666, ans=0.025 2024-06-19 15:32:49,133 INFO [train.py:1028] (1/2) Epoch 1, batch 7100, loss[loss=1.054, simple_loss=0.691, pruned_loss=0.7082, over 13180.00 frames. ], tot_loss[loss=1.01, simple_loss=0.6457, pruned_loss=0.6873, over 2574672.60 frames. ], batch size: 112, lr: 2.98e-02, grad_scale: 0.25 2024-06-19 15:32:52,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=13016.666666666666, ans=0.44441666666666674 2024-06-19 15:32:56,175 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.98 vs. limit=4.9525 2024-06-19 15:32:57,311 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.013e+01 2024-06-19 15:33:05,450 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=9.214 2024-06-19 15:33:22,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13071.666666666666, ans=0.16928333333333334 2024-06-19 15:33:23,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=13071.666666666666, ans=0.0 2024-06-19 15:33:23,886 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=21.93 vs. limit=12.401875 2024-06-19 15:33:27,408 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.92 vs. limit=12.401875 2024-06-19 15:33:28,029 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13071.666666666666, ans=0.16928333333333334 2024-06-19 15:33:29,414 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.90 vs. limit=12.401875 2024-06-19 15:33:40,291 INFO [train.py:1028] (1/2) Epoch 1, batch 7150, loss[loss=1.07, simple_loss=0.7164, pruned_loss=0.7117, over 12530.00 frames. ], tot_loss[loss=1.013, simple_loss=0.6467, pruned_loss=0.6899, over 2572299.88 frames. ], batch size: 202, lr: 2.98e-02, grad_scale: 0.25 2024-06-19 15:33:41,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=13108.333333333334, ans=0.012048611111111107 2024-06-19 15:33:45,472 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.52 vs. limit=12.415625 2024-06-19 15:33:57,500 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=38.65 vs. limit=17.35875 2024-06-19 15:33:57,866 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.179e+03 1.152e+04 1.599e+04 2.117e+04 7.402e+04, threshold=3.198e+04, percent-clipped=9.0 2024-06-19 15:34:10,764 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=28.46 vs. limit=17.372500000000002 2024-06-19 15:34:23,271 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.38 vs. limit=17.4 2024-06-19 15:34:23,982 INFO [train.py:1028] (1/2) Epoch 1, batch 7200, loss[loss=1.037, simple_loss=0.6725, pruned_loss=0.701, over 13201.00 frames. ], tot_loss[loss=1.019, simple_loss=0.6508, pruned_loss=0.6939, over 2577969.58 frames. ], batch size: 112, lr: 2.97e-02, grad_scale: 0.125 2024-06-19 15:34:25,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=13200.0, ans=0.438 2024-06-19 15:34:28,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=13200.0, ans=0.16799999999999998 2024-06-19 15:34:33,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=13218.333333333334, ans=0.07 2024-06-19 15:34:41,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=13236.666666666666, ans=0.011513888888888893 2024-06-19 15:34:45,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=13236.666666666666, ans=0.007992028985507247 2024-06-19 15:34:48,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=13236.666666666666, ans=0.4367166666666667 2024-06-19 15:34:55,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13255.0, ans=0.16745 2024-06-19 15:34:58,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=13273.333333333334, ans=0.0 2024-06-19 15:34:59,969 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=31.01 vs. limit=17.455 2024-06-19 15:35:01,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=13273.333333333334, ans=0.3991 2024-06-19 15:35:06,750 INFO [train.py:1028] (1/2) Epoch 1, batch 7250, loss[loss=1.025, simple_loss=0.6466, pruned_loss=0.7017, over 12946.00 frames. ], tot_loss[loss=1.021, simple_loss=0.6519, pruned_loss=0.6954, over 2578743.40 frames. ], batch size: 36, lr: 2.97e-02, grad_scale: 0.125 2024-06-19 15:35:07,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.13 vs. limit=4.99375 2024-06-19 15:35:09,586 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=9.613e+01 2024-06-19 15:35:09,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=13291.666666666666, ans=0.125 2024-06-19 15:35:11,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=13291.666666666666, ans=0.011284722222222224 2024-06-19 15:35:11,221 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.879e+01 2024-06-19 15:35:16,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=13310.0, ans=0.0 2024-06-19 15:35:19,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=13310.0, ans=0.1669 2024-06-19 15:35:26,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=13310.0, ans=0.125 2024-06-19 15:35:26,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=13310.0, ans=0.125 2024-06-19 15:35:28,673 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.97 vs. limit=17.4825 2024-06-19 15:35:31,928 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=25.56 vs. limit=12.498125 2024-06-19 15:35:33,244 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.443e+03 1.599e+04 1.986e+04 2.820e+04 1.891e+05, threshold=3.971e+04, percent-clipped=22.0 2024-06-19 15:35:34,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=13328.333333333334, ans=0.399925 2024-06-19 15:35:40,330 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.03 vs. limit=5.002 2024-06-19 15:35:42,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=13346.666666666666, ans=0.025 2024-06-19 15:35:48,443 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.57 vs. limit=17.52375 2024-06-19 15:35:53,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=13365.0, ans=0.125 2024-06-19 15:35:57,873 INFO [train.py:1028] (1/2) Epoch 1, batch 7300, loss[loss=1.049, simple_loss=0.6559, pruned_loss=0.7207, over 12919.00 frames. ], tot_loss[loss=1.024, simple_loss=0.6544, pruned_loss=0.6967, over 2579045.63 frames. ], batch size: 36, lr: 2.96e-02, grad_scale: 0.25 2024-06-19 15:35:59,384 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=24.18 vs. limit=12.51875 2024-06-19 15:36:00,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=13383.333333333334, ans=0.125 2024-06-19 15:36:05,748 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.52 vs. limit=12.51875 2024-06-19 15:36:07,234 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.59 vs. limit=8.345833333333333 2024-06-19 15:36:15,704 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.94 vs. limit=5.01025 2024-06-19 15:36:17,087 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.72 vs. limit=8.355 2024-06-19 15:36:24,889 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.52 vs. limit=12.532499999999999 2024-06-19 15:36:31,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.05 vs. limit=17.57875 2024-06-19 15:36:33,695 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=12.539375 2024-06-19 15:36:37,459 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.05 vs. limit=12.54625 2024-06-19 15:36:40,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=13456.666666666666, ans=0.16543333333333335 2024-06-19 15:36:42,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=13456.666666666666, ans=0.010597222222222223 2024-06-19 15:36:43,981 INFO [train.py:1028] (1/2) Epoch 1, batch 7350, loss[loss=1.104, simple_loss=0.6998, pruned_loss=0.7544, over 13267.00 frames. ], tot_loss[loss=1.023, simple_loss=0.6546, pruned_loss=0.6957, over 2579333.21 frames. ], batch size: 46, lr: 2.96e-02, grad_scale: 0.125 2024-06-19 15:36:53,931 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.37 vs. limit=17.619999999999997 2024-06-19 15:37:02,635 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.463e+03 1.230e+04 1.765e+04 2.410e+04 1.110e+05, threshold=3.530e+04, percent-clipped=10.0 2024-06-19 15:37:10,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=13530.0, ans=0.007928260869565218 2024-06-19 15:37:17,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten.whitening_limit, batch_count=13548.333333333334, ans=12.580625000000001 2024-06-19 15:37:22,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13548.333333333334, ans=0.16451666666666667 2024-06-19 15:37:25,759 INFO [train.py:1028] (1/2) Epoch 1, batch 7400, loss[loss=1.052, simple_loss=0.6745, pruned_loss=0.7147, over 13224.00 frames. ], tot_loss[loss=1.022, simple_loss=0.6544, pruned_loss=0.6945, over 2585111.24 frames. ], batch size: 63, lr: 2.95e-02, grad_scale: 0.25 2024-06-19 15:37:30,969 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=4.490e-02 2024-06-19 15:37:43,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13603.333333333334, ans=0.16396666666666665 2024-06-19 15:37:57,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=13621.666666666666, ans=10.0 2024-06-19 15:38:13,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=13640.0, ans=0.125 2024-06-19 15:38:16,190 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=34.21 vs. limit=12.621875 2024-06-19 15:38:16,466 INFO [train.py:1028] (1/2) Epoch 1, batch 7450, loss[loss=1.034, simple_loss=0.6403, pruned_loss=0.7137, over 12800.00 frames. ], tot_loss[loss=1.024, simple_loss=0.655, pruned_loss=0.6961, over 2579734.48 frames. ], batch size: 29, lr: 2.95e-02, grad_scale: 0.25 2024-06-19 15:38:26,655 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.49 vs. limit=8.419166666666666 2024-06-19 15:38:34,097 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=28.90 vs. limit=17.771250000000002 2024-06-19 15:38:39,057 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.142e+03 1.233e+04 1.526e+04 1.898e+04 6.914e+04, threshold=3.052e+04, percent-clipped=5.0 2024-06-19 15:38:41,497 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.70 vs. limit=12.635625000000001 2024-06-19 15:38:48,058 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=38.14 vs. limit=12.635625000000001 2024-06-19 15:38:52,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=36.49 vs. limit=12.6425 2024-06-19 15:38:54,381 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.74 vs. limit=12.6425 2024-06-19 15:38:56,287 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.30 vs. limit=17.785 2024-06-19 15:39:06,939 INFO [train.py:1028] (1/2) Epoch 1, batch 7500, loss[loss=0.9721, simple_loss=0.6583, pruned_loss=0.643, over 10574.00 frames. ], tot_loss[loss=1.028, simple_loss=0.6578, pruned_loss=0.6995, over 2577473.42 frames. ], batch size: 304, lr: 2.94e-02, grad_scale: 0.25 2024-06-19 15:39:07,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=13750.0, ans=0.125 2024-06-19 15:39:11,124 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.25 vs. limit=11.875 2024-06-19 15:39:11,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=13750.0, ans=0.025 2024-06-19 15:39:13,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=13750.0, ans=0.125 2024-06-19 15:39:14,058 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=9.507333333333333 2024-06-19 15:39:15,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=13768.333333333334, ans=0.4181083333333333 2024-06-19 15:39:25,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=13786.666666666666, ans=0.4174666666666667 2024-06-19 15:39:32,438 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=22.32 vs. limit=12.676874999999999 2024-06-19 15:39:40,458 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.95 vs. limit=8.455833333333334 2024-06-19 15:39:42,316 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.01 vs. limit=5.0735 2024-06-19 15:39:47,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=13823.333333333334, ans=0.125 2024-06-19 15:39:49,270 INFO [train.py:1028] (1/2) Epoch 1, batch 7550, loss[loss=0.9744, simple_loss=0.6432, pruned_loss=0.6528, over 12959.00 frames. ], tot_loss[loss=1.026, simple_loss=0.6577, pruned_loss=0.6975, over 2577958.01 frames. ], batch size: 158, lr: 2.94e-02, grad_scale: 0.125 2024-06-19 15:39:53,626 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.87 vs. limit=8.460416666666667 2024-06-19 15:39:56,220 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=22.39 vs. limit=11.920833333333333 2024-06-19 15:40:01,502 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=5.024e-03 2024-06-19 15:40:05,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=13860.0, ans=0.125 2024-06-19 15:40:10,621 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.31 vs. limit=17.908749999999998 2024-06-19 15:40:11,781 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.004e+03 8.419e+03 1.080e+04 1.610e+04 7.338e+04, threshold=2.161e+04, percent-clipped=5.0 2024-06-19 15:40:14,476 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=16.62 vs. limit=8.474166666666667 2024-06-19 15:40:23,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=13915.0, ans=0.125 2024-06-19 15:40:26,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=13915.0, ans=0.125 2024-06-19 15:40:31,367 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.99 vs. limit=17.93625 2024-06-19 15:40:32,642 INFO [train.py:1028] (1/2) Epoch 1, batch 7600, loss[loss=1.054, simple_loss=0.6631, pruned_loss=0.7228, over 13233.00 frames. ], tot_loss[loss=1.029, simple_loss=0.6588, pruned_loss=0.6994, over 2576172.25 frames. ], batch size: 83, lr: 2.93e-02, grad_scale: 0.25 2024-06-19 15:40:34,227 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.72 vs. limit=8.483333333333334 2024-06-19 15:40:51,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=13951.666666666666, ans=0.125 2024-06-19 15:40:54,199 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=15.45 vs. limit=9.580666666666666 2024-06-19 15:40:54,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=13951.666666666666, ans=0.04949747468305833 2024-06-19 15:40:56,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=13970.0, ans=0.007832608695652174 2024-06-19 15:40:56,924 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.40 vs. limit=12.73875 2024-06-19 15:41:01,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=13970.0, ans=10.0 2024-06-19 15:41:05,740 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=29.03 vs. limit=12.745625 2024-06-19 15:41:07,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=13988.333333333334, ans=0.007828623188405798 2024-06-19 15:41:08,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=13988.333333333334, ans=0.125 2024-06-19 15:41:25,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=14006.666666666666, ans=0.125 2024-06-19 15:41:27,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=14025.0, ans=18.01875 2024-06-19 15:41:27,802 INFO [train.py:1028] (1/2) Epoch 1, batch 7650, loss[loss=1.053, simple_loss=0.6551, pruned_loss=0.7255, over 12920.00 frames. ], tot_loss[loss=1.032, simple_loss=0.6605, pruned_loss=0.7018, over 2572564.70 frames. ], batch size: 33, lr: 2.93e-02, grad_scale: 0.25 2024-06-19 15:41:28,634 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.89 vs. limit=18.01875 2024-06-19 15:41:29,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=14025.0, ans=0.125 2024-06-19 15:41:35,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=14025.0, ans=0.025 2024-06-19 15:41:39,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=14043.333333333334, ans=0.008152777777777773 2024-06-19 15:41:43,821 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=25.86 vs. limit=12.76625 2024-06-19 15:41:46,840 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=23.54 vs. limit=12.773125 2024-06-19 15:41:51,884 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.008e+03 8.799e+03 1.196e+04 1.710e+04 8.040e+04, threshold=2.393e+04, percent-clipped=12.0 2024-06-19 15:41:55,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=14080.0, ans=0.1592 2024-06-19 15:41:58,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=14080.0, ans=0.04949747468305833 2024-06-19 15:42:13,042 INFO [train.py:1028] (1/2) Epoch 1, batch 7700, loss[loss=1.082, simple_loss=0.6958, pruned_loss=0.7345, over 13241.00 frames. ], tot_loss[loss=1.033, simple_loss=0.6615, pruned_loss=0.7023, over 2569041.10 frames. ], batch size: 63, lr: 2.92e-02, grad_scale: 0.5 2024-06-19 15:42:27,259 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.78 vs. limit=12.800625 2024-06-19 15:42:39,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=14171.666666666666, ans=0.125 2024-06-19 15:42:49,645 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.35 vs. limit=12.82125 2024-06-19 15:42:52,120 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=20.81 vs. limit=12.82125 2024-06-19 15:42:53,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=14190.0, ans=0.15810000000000002 2024-06-19 15:42:55,709 INFO [train.py:1028] (1/2) Epoch 1, batch 7750, loss[loss=1.053, simple_loss=0.6789, pruned_loss=0.7134, over 13212.00 frames. ], tot_loss[loss=1.031, simple_loss=0.6623, pruned_loss=0.7002, over 2573570.16 frames. ], batch size: 72, lr: 2.92e-02, grad_scale: 0.25 2024-06-19 15:43:02,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=14208.333333333334, ans=0.125 2024-06-19 15:43:07,658 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=22.53 vs. limit=12.835 2024-06-19 15:43:22,973 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.90 vs. limit=18.18375 2024-06-19 15:43:25,850 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.996e+03 1.334e+04 1.762e+04 2.386e+04 7.682e+04, threshold=3.523e+04, percent-clipped=23.0 2024-06-19 15:43:33,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=14263.333333333334, ans=0.00723611111111111 2024-06-19 15:43:37,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=14281.666666666666, ans=0.00715972222222222 2024-06-19 15:43:48,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=14281.666666666666, ans=0.00715972222222222 2024-06-19 15:43:49,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=14281.666666666666, ans=10.0 2024-06-19 15:43:52,047 INFO [train.py:1028] (1/2) Epoch 1, batch 7800, loss[loss=1.03, simple_loss=0.6678, pruned_loss=0.6956, over 13135.00 frames. ], tot_loss[loss=1.035, simple_loss=0.6639, pruned_loss=0.7027, over 2577637.56 frames. ], batch size: 95, lr: 2.91e-02, grad_scale: 0.25 2024-06-19 15:43:53,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=14300.0, ans=0.125 2024-06-19 15:43:55,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=14300.0, ans=0.125 2024-06-19 15:44:01,468 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.99 vs. limit=18.23875 2024-06-19 15:44:10,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=14336.666666666666, ans=0.3982166666666668 2024-06-19 15:44:23,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=14355.0, ans=0.125 2024-06-19 15:44:25,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=14373.333333333334, ans=0.025 2024-06-19 15:44:26,088 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=12.89 2024-06-19 15:44:36,050 INFO [train.py:1028] (1/2) Epoch 1, batch 7850, loss[loss=0.9365, simple_loss=0.5733, pruned_loss=0.6499, over 11568.00 frames. ], tot_loss[loss=1.036, simple_loss=0.6646, pruned_loss=0.7041, over 2572308.47 frames. ], batch size: 16, lr: 2.91e-02, grad_scale: 0.125 2024-06-19 15:44:41,992 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=32.69 vs. limit=12.896875 2024-06-19 15:44:42,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=14391.666666666666, ans=0.007740942028985507 2024-06-19 15:44:42,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=14391.666666666666, ans=0.025 2024-06-19 15:44:44,518 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=24.27 vs. limit=18.307499999999997 2024-06-19 15:44:47,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=14410.0, ans=0.007736956521739131 2024-06-19 15:44:57,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=14428.333333333334, ans=0.006548611111111109 2024-06-19 15:44:57,765 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=26.09 vs. limit=12.910625 2024-06-19 15:44:59,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=14428.333333333334, ans=0.3950083333333333 2024-06-19 15:45:01,346 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.179e+03 9.341e+03 1.250e+04 1.829e+04 9.521e+04, threshold=2.499e+04, percent-clipped=9.0 2024-06-19 15:45:06,732 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.58 vs. limit=18.335 2024-06-19 15:45:12,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=14465.0, ans=0.007725 2024-06-19 15:45:13,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=14465.0, ans=0.04949747468305833 2024-06-19 15:45:20,049 INFO [train.py:1028] (1/2) Epoch 1, batch 7900, loss[loss=1.031, simple_loss=0.6561, pruned_loss=0.7033, over 13171.00 frames. ], tot_loss[loss=1.035, simple_loss=0.6644, pruned_loss=0.703, over 2571145.89 frames. ], batch size: 77, lr: 2.90e-02, grad_scale: 0.25 2024-06-19 15:45:20,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=14483.333333333334, ans=0.125 2024-06-19 15:45:27,762 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.78 vs. limit=5.172499999999999 2024-06-19 15:45:28,779 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.90 vs. limit=12.938125 2024-06-19 15:45:29,496 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=25.48 vs. limit=18.37625 2024-06-19 15:45:41,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.00 vs. limit=12.945 2024-06-19 15:45:47,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=14538.333333333334, ans=0.007709057971014493 2024-06-19 15:45:48,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=14538.333333333334, ans=0.15461666666666665 2024-06-19 15:46:07,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=14556.666666666666, ans=0.125 2024-06-19 15:46:11,426 INFO [train.py:1028] (1/2) Epoch 1, batch 7950, loss[loss=0.9698, simple_loss=0.6582, pruned_loss=0.6407, over 10553.00 frames. ], tot_loss[loss=1.036, simple_loss=0.665, pruned_loss=0.7032, over 2574408.70 frames. ], batch size: 303, lr: 2.90e-02, grad_scale: 0.125 2024-06-19 15:46:23,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=14575.0, ans=0.125 2024-06-19 15:46:29,830 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.98 vs. limit=8.648333333333333 2024-06-19 15:46:33,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=14611.666666666666, ans=0.125 2024-06-19 15:46:43,114 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.329e+03 1.257e+04 1.902e+04 2.657e+04 1.034e+05, threshold=3.803e+04, percent-clipped=27.0 2024-06-19 15:46:47,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=14630.0, ans=0.125 2024-06-19 15:46:52,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=14648.333333333334, ans=0.125 2024-06-19 15:46:52,492 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.58 vs. limit=12.993125 2024-06-19 15:47:05,798 INFO [train.py:1028] (1/2) Epoch 1, batch 8000, loss[loss=1.053, simple_loss=0.6556, pruned_loss=0.725, over 13096.00 frames. ], tot_loss[loss=1.04, simple_loss=0.6663, pruned_loss=0.7069, over 2573301.13 frames. ], batch size: 30, lr: 2.89e-02, grad_scale: 0.125 2024-06-19 15:47:17,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=14685.0, ans=0.005479166666666667 2024-06-19 15:47:18,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=14685.0, ans=0.38602500000000006 2024-06-19 15:47:26,172 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.57 vs. limit=13.01375 2024-06-19 15:47:28,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=14703.333333333334, ans=0.125 2024-06-19 15:47:36,443 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.40 vs. limit=18.541249999999998 2024-06-19 15:47:39,873 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=36.29 vs. limit=13.0275 2024-06-19 15:47:48,030 INFO [train.py:1028] (1/2) Epoch 1, batch 8050, loss[loss=0.9658, simple_loss=0.6255, pruned_loss=0.6531, over 13197.00 frames. ], tot_loss[loss=1.039, simple_loss=0.6661, pruned_loss=0.7063, over 2572720.76 frames. ], batch size: 83, lr: 2.89e-02, grad_scale: 0.125 2024-06-19 15:47:48,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=14758.333333333334, ans=0.007661231884057971 2024-06-19 15:47:54,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=14758.333333333334, ans=0.38345833333333335 2024-06-19 15:47:58,430 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=17.41 vs. limit=12.388333333333332 2024-06-19 15:48:02,175 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.29 vs. limit=5.2165 2024-06-19 15:48:04,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=14795.0, ans=0.125 2024-06-19 15:48:05,236 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.38 vs. limit=8.69875 2024-06-19 15:48:15,979 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.894e+03 1.004e+04 1.467e+04 1.959e+04 6.956e+04, threshold=2.934e+04, percent-clipped=6.0 2024-06-19 15:48:24,116 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=37.21 vs. limit=18.62375 2024-06-19 15:48:29,757 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.64 vs. limit=13.061875 2024-06-19 15:48:30,715 INFO [train.py:1028] (1/2) Epoch 1, batch 8100, loss[loss=1.039, simple_loss=0.6732, pruned_loss=0.702, over 13158.00 frames. ], tot_loss[loss=1.041, simple_loss=0.6681, pruned_loss=0.7073, over 2576842.32 frames. ], batch size: 112, lr: 2.88e-02, grad_scale: 0.125 2024-06-19 15:48:39,926 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=13.74 vs. limit=13.06875 2024-06-19 15:48:48,013 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.92 vs. limit=13.075624999999999 2024-06-19 15:48:53,600 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.14 vs. limit=13.075624999999999 2024-06-19 15:48:56,184 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.29 vs. limit=13.0825 2024-06-19 15:49:00,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=14886.666666666666, ans=0.125 2024-06-19 15:49:20,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=14923.333333333334, ans=0.007625362318840579 2024-06-19 15:49:26,600 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=42.53 vs. limit=13.096250000000001 2024-06-19 15:49:27,852 INFO [train.py:1028] (1/2) Epoch 1, batch 8150, loss[loss=0.9916, simple_loss=0.6422, pruned_loss=0.6705, over 13071.00 frames. ], tot_loss[loss=1.046, simple_loss=0.6693, pruned_loss=0.7113, over 2580433.37 frames. ], batch size: 121, lr: 2.88e-02, grad_scale: 0.0625 2024-06-19 15:49:47,071 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.55 vs. limit=5.2467500000000005 2024-06-19 15:49:47,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=14978.333333333334, ans=0.125 2024-06-19 15:49:56,012 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=65.26 vs. limit=13.123750000000001 2024-06-19 15:49:56,323 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.247e+03 8.756e+03 1.130e+04 1.576e+04 8.630e+04, threshold=2.259e+04, percent-clipped=4.0 2024-06-19 15:49:56,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=14996.666666666666, ans=0.125 2024-06-19 15:50:12,059 INFO [train.py:1028] (1/2) Epoch 1, batch 8200, loss[loss=1.032, simple_loss=0.6718, pruned_loss=0.6957, over 13191.00 frames. ], tot_loss[loss=1.047, simple_loss=0.6702, pruned_loss=0.7118, over 2583830.16 frames. ], batch size: 112, lr: 2.88e-02, grad_scale: 0.125 2024-06-19 15:50:12,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=15033.333333333334, ans=0.125 2024-06-19 15:50:14,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=15033.333333333334, ans=0.125 2024-06-19 15:50:16,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=15033.333333333334, ans=0.125 2024-06-19 15:50:17,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=15033.333333333334, ans=0.125 2024-06-19 15:50:20,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=15051.666666666666, ans=0.3731916666666667 2024-06-19 15:50:34,378 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.24 vs. limit=12.535 2024-06-19 15:50:48,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=15106.666666666666, ans=0.37126666666666674 2024-06-19 15:50:51,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=15106.666666666666, ans=0.37126666666666674 2024-06-19 15:50:51,726 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.96 vs. limit=8.776666666666667 2024-06-19 15:50:51,881 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.00 vs. limit=8.776666666666667 2024-06-19 15:50:53,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=15106.666666666666, ans=0.125 2024-06-19 15:50:53,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=15106.666666666666, ans=0.125 2024-06-19 15:50:55,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=15106.666666666666, ans=0.007585507246376811 2024-06-19 15:50:55,253 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=26.60 vs. limit=18.83 2024-06-19 15:50:55,313 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.61 vs. limit=13.165 2024-06-19 15:50:56,552 INFO [train.py:1028] (1/2) Epoch 1, batch 8250, loss[loss=1.021, simple_loss=0.6558, pruned_loss=0.6926, over 13241.00 frames. ], tot_loss[loss=1.047, simple_loss=0.6704, pruned_loss=0.7118, over 2583553.52 frames. ], batch size: 52, lr: 2.87e-02, grad_scale: 0.125 2024-06-19 15:51:12,149 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.43 vs. limit=8.785833333333333 2024-06-19 15:51:12,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=15161.666666666666, ans=0.125 2024-06-19 15:51:29,249 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.859e+03 1.031e+04 1.279e+04 1.778e+04 6.785e+04, threshold=2.558e+04, percent-clipped=11.0 2024-06-19 15:51:33,092 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=76.04 vs. limit=13.192499999999999 2024-06-19 15:51:33,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=15180.0, ans=0.025 2024-06-19 15:51:42,062 INFO [train.py:1028] (1/2) Epoch 1, batch 8300, loss[loss=1.012, simple_loss=0.6541, pruned_loss=0.6845, over 13183.00 frames. ], tot_loss[loss=1.043, simple_loss=0.6685, pruned_loss=0.7089, over 2581117.00 frames. ], batch size: 103, lr: 2.87e-02, grad_scale: 0.25 2024-06-19 15:51:44,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=15216.666666666666, ans=0.125 2024-06-19 15:51:46,001 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=25.87 vs. limit=18.9125 2024-06-19 15:51:46,068 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=12.34 vs. limit=8.804166666666667 2024-06-19 15:52:02,477 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.48 vs. limit=13.213125 2024-06-19 15:52:07,439 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.77 vs. limit=18.939999999999998 2024-06-19 15:52:25,973 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.02 vs. limit=18.9675 2024-06-19 15:52:27,652 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.46 vs. limit=8.8225 2024-06-19 15:52:30,324 INFO [train.py:1028] (1/2) Epoch 1, batch 8350, loss[loss=0.9896, simple_loss=0.6536, pruned_loss=0.6628, over 13178.00 frames. ], tot_loss[loss=1.045, simple_loss=0.6687, pruned_loss=0.7105, over 2580687.93 frames. ], batch size: 112, lr: 2.86e-02, grad_scale: 0.125 2024-06-19 15:52:34,749 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=13.34 vs. limit=12.654166666666667 2024-06-19 15:52:45,041 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=18.75 vs. limit=13.247499999999999 2024-06-19 15:52:46,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=15326.666666666666, ans=0.02 2024-06-19 15:52:58,622 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.861e+03 1.233e+04 1.610e+04 2.127e+04 1.379e+05, threshold=3.220e+04, percent-clipped=19.0 2024-06-19 15:53:00,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=15363.333333333334, ans=0.3622833333333333 2024-06-19 15:53:04,929 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.78 vs. limit=19.036250000000003 2024-06-19 15:53:11,709 INFO [train.py:1028] (1/2) Epoch 1, batch 8400, loss[loss=1.123, simple_loss=0.7004, pruned_loss=0.7731, over 12989.00 frames. ], tot_loss[loss=1.045, simple_loss=0.6688, pruned_loss=0.7108, over 2579039.51 frames. ], batch size: 39, lr: 2.86e-02, grad_scale: 0.25 2024-06-19 15:53:17,579 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.89 vs. limit=13.275 2024-06-19 15:53:25,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15418.333333333334, ans=0.14581666666666665 2024-06-19 15:53:26,334 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=39.32 vs. limit=13.281875 2024-06-19 15:53:39,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=15455.0, ans=0.125 2024-06-19 15:53:40,851 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=21.34 vs. limit=13.295625000000001 2024-06-19 15:53:42,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=15455.0, ans=0.125 2024-06-19 15:53:42,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=15455.0, ans=0.125 2024-06-19 15:53:57,081 INFO [train.py:1028] (1/2) Epoch 1, batch 8450, loss[loss=0.9988, simple_loss=0.6522, pruned_loss=0.6727, over 13189.00 frames. ], tot_loss[loss=1.047, simple_loss=0.6708, pruned_loss=0.7121, over 2580924.46 frames. ], batch size: 112, lr: 2.85e-02, grad_scale: 0.125 2024-06-19 15:53:58,215 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=18.06 vs. limit=13.309375 2024-06-19 15:54:01,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=15491.666666666666, ans=0.0021180555555555536 2024-06-19 15:54:12,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=15510.0, ans=0.1449 2024-06-19 15:54:25,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=15528.333333333334, ans=0.07 2024-06-19 15:54:32,689 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=40.81 vs. limit=13.33 2024-06-19 15:54:35,740 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.624e+03 1.005e+04 1.413e+04 1.774e+04 6.103e+04, threshold=2.826e+04, percent-clipped=13.0 2024-06-19 15:54:45,500 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.90 vs. limit=19.17375 2024-06-19 15:54:48,423 INFO [train.py:1028] (1/2) Epoch 1, batch 8500, loss[loss=1.098, simple_loss=0.6808, pruned_loss=0.7581, over 12607.00 frames. ], tot_loss[loss=1.05, simple_loss=0.6726, pruned_loss=0.714, over 2578723.08 frames. ], batch size: 29, lr: 2.85e-02, grad_scale: 0.125 2024-06-19 15:55:10,138 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.97 vs. limit=19.215 2024-06-19 15:55:30,711 INFO [train.py:1028] (1/2) Epoch 1, batch 8550, loss[loss=1.097, simple_loss=0.6764, pruned_loss=0.7588, over 12494.00 frames. ], tot_loss[loss=1.052, simple_loss=0.6728, pruned_loss=0.7161, over 2575414.33 frames. ], batch size: 22, lr: 2.84e-02, grad_scale: 0.125 2024-06-19 15:55:32,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=15675.0, ans=0.351375 2024-06-19 15:55:35,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=15675.0, ans=0.125 2024-06-19 15:55:39,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=15693.333333333334, ans=0.4354 2024-06-19 15:56:01,527 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.252e+03 1.303e+04 1.733e+04 2.383e+04 9.526e+04, threshold=3.465e+04, percent-clipped=13.0 2024-06-19 15:56:04,797 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.04 vs. limit=8.937083333333334 2024-06-19 15:56:08,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=15748.333333333334, ans=0.025 2024-06-19 15:56:08,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=15748.333333333334, ans=13.405625 2024-06-19 15:56:13,350 INFO [train.py:1028] (1/2) Epoch 1, batch 8600, loss[loss=1.016, simple_loss=0.6564, pruned_loss=0.6874, over 13042.00 frames. ], tot_loss[loss=1.053, simple_loss=0.6728, pruned_loss=0.7166, over 2574535.30 frames. ], batch size: 121, lr: 2.84e-02, grad_scale: 0.25 2024-06-19 15:56:22,131 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.35 vs. limit=12.8925 2024-06-19 15:56:23,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=15785.0, ans=0.0 2024-06-19 15:56:28,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=15803.333333333334, ans=0.025 2024-06-19 15:56:28,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=15803.333333333334, ans=0.125 2024-06-19 15:56:57,637 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=13.440000000000001 2024-06-19 15:56:59,376 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.79 vs. limit=12.929166666666667 2024-06-19 15:56:59,800 INFO [train.py:1028] (1/2) Epoch 1, batch 8650, loss[loss=0.9473, simple_loss=0.6154, pruned_loss=0.6396, over 13104.00 frames. ], tot_loss[loss=1.054, simple_loss=0.6731, pruned_loss=0.7173, over 2576992.02 frames. ], batch size: 103, lr: 2.83e-02, grad_scale: 0.0625 2024-06-19 15:57:06,265 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.06 vs. limit=13.446875 2024-06-19 15:57:19,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=15895.0, ans=0.438425 2024-06-19 15:57:31,959 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=1.983e-01 2024-06-19 15:57:32,469 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.893e+03 1.161e+04 1.602e+04 2.552e+04 2.716e+05, threshold=3.204e+04, percent-clipped=15.0 2024-06-19 15:57:41,669 INFO [train.py:1028] (1/2) Epoch 1, batch 8700, loss[loss=1.169, simple_loss=0.7456, pruned_loss=0.7966, over 13155.00 frames. ], tot_loss[loss=1.05, simple_loss=0.6725, pruned_loss=0.714, over 2572672.93 frames. ], batch size: 59, lr: 2.83e-02, grad_scale: 0.125 2024-06-19 15:57:45,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=15950.0, ans=0.34175 2024-06-19 15:58:02,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=15986.666666666666, ans=0.035 2024-06-19 15:58:14,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=16023.333333333334, ans=0.13976666666666665 2024-06-19 15:58:22,740 INFO [train.py:1028] (1/2) Epoch 1, batch 8750, loss[loss=1.062, simple_loss=0.6923, pruned_loss=0.7154, over 13108.00 frames. ], tot_loss[loss=1.049, simple_loss=0.6721, pruned_loss=0.7126, over 2567601.56 frames. ], batch size: 121, lr: 2.82e-02, grad_scale: 0.0625 2024-06-19 15:58:27,739 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.26 vs. limit=19.53125 2024-06-19 15:58:28,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=16041.666666666666, ans=0.13958333333333334 2024-06-19 15:58:30,129 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=39.77 vs. limit=19.53125 2024-06-19 15:58:32,182 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.93 vs. limit=19.545 2024-06-19 15:58:43,008 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.17 vs. limit=13.529375 2024-06-19 15:58:43,938 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.40 vs. limit=19.55875 2024-06-19 15:58:51,495 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=28.65 vs. limit=13.536249999999999 2024-06-19 15:59:01,935 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.830e+03 1.054e+04 1.419e+04 1.760e+04 1.152e+05, threshold=2.839e+04, percent-clipped=3.0 2024-06-19 15:59:02,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=16115.0, ans=0.007366304347826087 2024-06-19 15:59:02,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=16115.0, ans=0.13885 2024-06-19 15:59:07,575 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.08 vs. limit=13.543125 2024-06-19 15:59:08,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=16115.0, ans=0.025 2024-06-19 15:59:11,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=16133.333333333334, ans=0.125 2024-06-19 15:59:12,455 INFO [train.py:1028] (1/2) Epoch 1, batch 8800, loss[loss=1.125, simple_loss=0.7154, pruned_loss=0.7675, over 13302.00 frames. ], tot_loss[loss=1.048, simple_loss=0.6722, pruned_loss=0.7117, over 2572585.39 frames. ], batch size: 72, lr: 2.82e-02, grad_scale: 0.125 2024-06-19 15:59:12,988 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.19 vs. limit=5.42 2024-06-19 15:59:19,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=16151.666666666666, ans=0.007358333333333333 2024-06-19 15:59:20,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=16151.666666666666, ans=0.13848333333333335 2024-06-19 15:59:27,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=16151.666666666666, ans=0.13848333333333335 2024-06-19 15:59:28,613 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.65 vs. limit=5.422750000000001 2024-06-19 15:59:37,777 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.45 vs. limit=13.563749999999999 2024-06-19 15:59:42,716 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.63 vs. limit=13.570625 2024-06-19 15:59:49,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=16206.666666666666, ans=0.33276666666666677 2024-06-19 15:59:50,446 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.66 vs. limit=13.5775 2024-06-19 15:59:51,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=16206.666666666666, ans=0.125 2024-06-19 15:59:57,984 INFO [train.py:1028] (1/2) Epoch 1, batch 8850, loss[loss=1.001, simple_loss=0.6663, pruned_loss=0.6676, over 12540.00 frames. ], tot_loss[loss=1.046, simple_loss=0.6716, pruned_loss=0.7106, over 2561016.64 frames. ], batch size: 202, lr: 2.81e-02, grad_scale: 0.03125 2024-06-19 16:00:02,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=16225.0, ans=0.125 2024-06-19 16:00:06,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=16243.333333333334, ans=0.13756666666666667 2024-06-19 16:00:09,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=16243.333333333334, ans=0.125 2024-06-19 16:00:16,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=16261.666666666666, ans=0.125 2024-06-19 16:00:22,081 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=1.048e-02 2024-06-19 16:00:22,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=16280.0, ans=0.0872 2024-06-19 16:00:28,538 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.96 vs. limit=13.605 2024-06-19 16:00:32,245 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.916e+03 1.185e+04 1.706e+04 3.037e+04 2.228e+05, threshold=3.412e+04, percent-clipped=29.0 2024-06-19 16:00:34,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=16298.333333333334, ans=0.0 2024-06-19 16:00:39,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=16316.666666666666, ans=0.0 2024-06-19 16:00:40,641 INFO [train.py:1028] (1/2) Epoch 1, batch 8900, loss[loss=1.11, simple_loss=0.6849, pruned_loss=0.7679, over 12857.00 frames. ], tot_loss[loss=1.047, simple_loss=0.6722, pruned_loss=0.7112, over 2559881.76 frames. ], batch size: 33, lr: 2.81e-02, grad_scale: 0.0625 2024-06-19 16:00:42,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=16316.666666666666, ans=0.07 2024-06-19 16:00:48,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=16316.666666666666, ans=0.9131666666666667 2024-06-19 16:00:55,293 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.65 vs. limit=13.625625 2024-06-19 16:01:16,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=16390.0, ans=0.125 2024-06-19 16:01:26,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=16390.0, ans=0.007306521739130435 2024-06-19 16:01:26,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=16408.333333333332, ans=0.007302536231884059 2024-06-19 16:01:27,474 INFO [train.py:1028] (1/2) Epoch 1, batch 8950, loss[loss=1.033, simple_loss=0.6791, pruned_loss=0.6936, over 12581.00 frames. ], tot_loss[loss=1.053, simple_loss=0.6737, pruned_loss=0.7164, over 2562048.43 frames. ], batch size: 202, lr: 2.80e-02, grad_scale: 0.0625 2024-06-19 16:01:30,033 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=16408.333333333332, ans=0.0 2024-06-19 16:01:48,695 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=92.29 vs. limit=13.666875000000001 2024-06-19 16:01:53,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=16463.333333333332, ans=0.125 2024-06-19 16:02:01,480 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.38 vs. limit=13.673749999999998 2024-06-19 16:02:10,224 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.428e+03 6.767e+03 9.739e+03 1.176e+04 2.110e+04, threshold=1.948e+04, percent-clipped=0.0 2024-06-19 16:02:10,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=16481.666666666668, ans=0.125 2024-06-19 16:02:12,727 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=25.51 vs. limit=13.680625000000001 2024-06-19 16:02:16,785 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=21.68 vs. limit=13.680625000000001 2024-06-19 16:02:18,141 INFO [train.py:1028] (1/2) Epoch 1, batch 9000, loss[loss=1.065, simple_loss=0.6708, pruned_loss=0.7299, over 13245.00 frames. ], tot_loss[loss=1.059, simple_loss=0.676, pruned_loss=0.721, over 2567745.77 frames. ], batch size: 46, lr: 2.80e-02, grad_scale: 0.125 2024-06-19 16:02:18,143 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 16:02:26,803 INFO [train.py:1060] (1/2) Epoch 1, validation: loss=0.9889, simple_loss=0.6323, pruned_loss=0.6727, over 351949.00 frames. 2024-06-19 16:02:26,803 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 16946MB 2024-06-19 16:02:29,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=16500.0, ans=10.0 2024-06-19 16:02:33,434 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=75.33 vs. limit=13.6875 2024-06-19 16:02:34,933 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=79.26 vs. limit=13.694374999999999 2024-06-19 16:02:38,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=16518.333333333332, ans=0.125 2024-06-19 16:02:44,661 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.27 vs. limit=13.701250000000002 2024-06-19 16:02:48,822 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=40.19 vs. limit=19.916249999999998 2024-06-19 16:02:50,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=16555.0, ans=0.13445000000000001 2024-06-19 16:02:54,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=16555.0, ans=0.125 2024-06-19 16:02:57,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=16573.333333333332, ans=0.125 2024-06-19 16:02:59,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=16573.333333333332, ans=0.125 2024-06-19 16:03:04,185 INFO [train.py:1028] (1/2) Epoch 1, batch 9050, loss[loss=1.096, simple_loss=0.6698, pruned_loss=0.7615, over 11864.00 frames. ], tot_loss[loss=1.061, simple_loss=0.6771, pruned_loss=0.7228, over 2568183.48 frames. ], batch size: 18, lr: 2.80e-02, grad_scale: 0.125 2024-06-19 16:03:07,100 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=9.29 vs. limit=10.636666666666667 2024-06-19 16:03:07,910 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.08 vs. limit=5.48875 2024-06-19 16:03:08,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=16591.666666666668, ans=0.125 2024-06-19 16:03:11,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=16610.0, ans=0.07 2024-06-19 16:03:12,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=16610.0, ans=0.125 2024-06-19 16:03:15,801 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.18 vs. limit=13.72875 2024-06-19 16:03:18,359 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.07 vs. limit=13.305 2024-06-19 16:03:21,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=16628.333333333332, ans=0.13371666666666668 2024-06-19 16:03:29,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=16646.666666666668, ans=0.0 2024-06-19 16:03:33,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=16646.666666666668, ans=0.4497 2024-06-19 16:03:35,749 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=71.01 vs. limit=13.749375 2024-06-19 16:03:36,082 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.640e+03 6.648e+03 9.842e+03 1.562e+04 7.535e+04, threshold=1.968e+04, percent-clipped=14.0 2024-06-19 16:03:37,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=16665.0, ans=0.31672500000000003 2024-06-19 16:03:42,676 INFO [train.py:1028] (1/2) Epoch 1, batch 9100, loss[loss=1.136, simple_loss=0.7164, pruned_loss=0.7781, over 13269.00 frames. ], tot_loss[loss=1.064, simple_loss=0.6761, pruned_loss=0.7257, over 2569768.99 frames. ], batch size: 72, lr: 2.79e-02, grad_scale: 0.25 2024-06-19 16:03:44,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=11.15 vs. limit=9.170833333333333 2024-06-19 16:03:46,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=16683.333333333332, ans=0.125 2024-06-19 16:03:50,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=16701.666666666668, ans=0.0 2024-06-19 16:03:53,922 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.64 vs. limit=5.50525 2024-06-19 16:03:56,347 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.55 vs. limit=5.50525 2024-06-19 16:04:00,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=16720.0, ans=0.0 2024-06-19 16:04:02,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=16720.0, ans=0.025 2024-06-19 16:04:11,238 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.43 vs. limit=20.05375 2024-06-19 16:04:22,479 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=7.00 vs. limit=10.71 2024-06-19 16:04:22,742 INFO [train.py:1028] (1/2) Epoch 1, batch 9150, loss[loss=1.038, simple_loss=0.6488, pruned_loss=0.7137, over 13153.00 frames. ], tot_loss[loss=1.065, simple_loss=0.6763, pruned_loss=0.7267, over 2570556.80 frames. ], batch size: 77, lr: 2.79e-02, grad_scale: 0.0625 2024-06-19 16:04:28,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=16775.0, ans=0.125 2024-06-19 16:04:33,755 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.85 vs. limit=13.7975 2024-06-19 16:04:49,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=16830.0, ans=0.125 2024-06-19 16:04:49,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=16830.0, ans=0.125 2024-06-19 16:04:50,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=16830.0, ans=13.811250000000001 2024-06-19 16:04:58,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=16848.333333333332, ans=0.125 2024-06-19 16:04:59,148 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.461e+03 1.191e+04 1.599e+04 3.022e+04 1.422e+05, threshold=3.198e+04, percent-clipped=43.0 2024-06-19 16:05:03,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=16848.333333333332, ans=0.007206884057971015 2024-06-19 16:05:04,974 INFO [train.py:1028] (1/2) Epoch 1, batch 9200, loss[loss=1.129, simple_loss=0.6967, pruned_loss=0.7806, over 12960.00 frames. ], tot_loss[loss=1.068, simple_loss=0.6766, pruned_loss=0.7293, over 2573879.15 frames. ], batch size: 36, lr: 2.78e-02, grad_scale: 0.125 2024-06-19 16:05:08,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=16866.666666666668, ans=0.0 2024-06-19 16:05:08,977 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=25.85 vs. limit=20.150000000000002 2024-06-19 16:05:10,420 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=80.39 vs. limit=20.150000000000002 2024-06-19 16:05:11,894 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.91 vs. limit=9.221250000000001 2024-06-19 16:05:12,604 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.50 vs. limit=20.16375 2024-06-19 16:05:22,010 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=32.01 vs. limit=20.1775 2024-06-19 16:05:23,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=16903.333333333332, ans=0.09899494936611666 2024-06-19 16:05:25,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=16903.333333333332, ans=0.125 2024-06-19 16:05:42,677 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.54 vs. limit=9.235 2024-06-19 16:05:47,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=16940.0, ans=13.8525 2024-06-19 16:05:49,392 INFO [train.py:1028] (1/2) Epoch 1, batch 9250, loss[loss=1.059, simple_loss=0.6673, pruned_loss=0.7249, over 13217.00 frames. ], tot_loss[loss=1.072, simple_loss=0.678, pruned_loss=0.7331, over 2576461.11 frames. ], batch size: 67, lr: 2.78e-02, grad_scale: 0.125 2024-06-19 16:05:55,549 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.62 vs. limit=13.859375 2024-06-19 16:05:56,071 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=22.32 vs. limit=13.859375 2024-06-19 16:06:02,093 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.00 vs. limit=13.86625 2024-06-19 16:06:09,099 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.80 vs. limit=9.248750000000001 2024-06-19 16:06:13,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=17013.333333333332, ans=0.125 2024-06-19 16:06:28,632 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.331e+03 7.856e+03 9.272e+03 1.375e+04 4.263e+04, threshold=1.854e+04, percent-clipped=4.0 2024-06-19 16:06:34,415 INFO [train.py:1028] (1/2) Epoch 1, batch 9300, loss[loss=1.033, simple_loss=0.6449, pruned_loss=0.7105, over 12931.00 frames. ], tot_loss[loss=1.071, simple_loss=0.6765, pruned_loss=0.7328, over 2572760.20 frames. ], batch size: 39, lr: 2.77e-02, grad_scale: 0.25 2024-06-19 16:06:37,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=17050.0, ans=0.125 2024-06-19 16:06:43,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17068.333333333332, ans=0.1293166666666667 2024-06-19 16:06:49,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17086.666666666668, ans=0.12913333333333332 2024-06-19 16:06:52,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17086.666666666668, ans=0.12913333333333332 2024-06-19 16:06:57,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=17105.0, ans=0.025 2024-06-19 16:07:05,758 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=31.85 vs. limit=13.92125 2024-06-19 16:07:12,906 INFO [train.py:1028] (1/2) Epoch 1, batch 9350, loss[loss=1.067, simple_loss=0.6542, pruned_loss=0.7394, over 12784.00 frames. ], tot_loss[loss=1.072, simple_loss=0.6773, pruned_loss=0.7331, over 2568462.15 frames. ], batch size: 22, lr: 2.77e-02, grad_scale: 0.25 2024-06-19 16:07:17,998 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.07 vs. limit=13.928125000000001 2024-06-19 16:07:21,896 WARNING [optim.py:503] (1/2) Scaling gradients by 0.08550368994474411, model_norm_threshold=18544.259765625 2024-06-19 16:07:22,083 WARNING [optim.py:575] (1/2) Parameter dominating tot_sumsq module.encoder_embed.conv.4.weight with proportion 0.38, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=1.794e+10, grad_sumsq=5.411e+09, orig_rms_sq=3.316e+00 2024-06-19 16:07:23,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=17160.0, ans=0.125 2024-06-19 16:07:25,889 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=38.70 vs. limit=13.934999999999999 2024-06-19 16:07:26,931 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.81 vs. limit=20.38375 2024-06-19 16:07:28,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=17178.333333333332, ans=0.125 2024-06-19 16:07:29,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=17178.333333333332, ans=0.007135144927536232 2024-06-19 16:07:31,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=17178.333333333332, ans=0.125 2024-06-19 16:07:33,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten.whitening_limit, batch_count=17196.666666666668, ans=13.94875 2024-06-19 16:07:34,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=17196.666666666668, ans=0.9219666666666666 2024-06-19 16:07:37,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=17196.666666666668, ans=0.0 2024-06-19 16:07:38,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=17196.666666666668, ans=0.125 2024-06-19 16:07:42,970 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=20.43 vs. limit=13.955625000000001 2024-06-19 16:07:43,788 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=22.02 vs. limit=13.955625000000001 2024-06-19 16:07:45,657 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.189e+03 1.189e+04 1.502e+04 2.131e+04 2.169e+05, threshold=3.004e+04, percent-clipped=36.0 2024-06-19 16:07:46,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=17215.0, ans=0.29747500000000004 2024-06-19 16:07:49,843 INFO [train.py:1028] (1/2) Epoch 1, batch 9400, loss[loss=1.138, simple_loss=0.7118, pruned_loss=0.7823, over 13288.00 frames. ], tot_loss[loss=1.071, simple_loss=0.6777, pruned_loss=0.7321, over 2569083.44 frames. ], batch size: 52, lr: 2.76e-02, grad_scale: 0.125 2024-06-19 16:07:59,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=17251.666666666668, ans=0.007119202898550725 2024-06-19 16:08:00,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=17251.666666666668, ans=0.007119202898550725 2024-06-19 16:08:01,846 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.00 vs. limit=13.969375 2024-06-19 16:08:09,506 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.16 vs. limit=20.4525 2024-06-19 16:08:11,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=17270.0, ans=0.125 2024-06-19 16:08:21,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=17288.333333333332, ans=0.12711666666666668 2024-06-19 16:08:27,749 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.83 vs. limit=13.990000000000002 2024-06-19 16:08:29,395 INFO [train.py:1028] (1/2) Epoch 1, batch 9450, loss[loss=1.041, simple_loss=0.6435, pruned_loss=0.7193, over 12545.00 frames. ], tot_loss[loss=1.069, simple_loss=0.6778, pruned_loss=0.7303, over 2568455.14 frames. ], batch size: 22, lr: 2.76e-02, grad_scale: 0.125 2024-06-19 16:08:42,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=17361.666666666668, ans=0.125 2024-06-19 16:08:44,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=17361.666666666668, ans=0.125 2024-06-19 16:08:44,429 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=36.86 vs. limit=14.010625000000001 2024-06-19 16:08:49,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=17361.666666666668, ans=0.125 2024-06-19 16:09:01,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17398.333333333332, ans=0.1260166666666667 2024-06-19 16:09:03,339 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.630e+03 1.501e+04 1.785e+04 2.284e+04 7.471e+04, threshold=3.570e+04, percent-clipped=8.0 2024-06-19 16:09:10,763 INFO [train.py:1028] (1/2) Epoch 1, batch 9500, loss[loss=1.096, simple_loss=0.6756, pruned_loss=0.7578, over 13249.00 frames. ], tot_loss[loss=1.068, simple_loss=0.677, pruned_loss=0.7293, over 2577673.20 frames. ], batch size: 43, lr: 2.76e-02, grad_scale: 0.125 2024-06-19 16:09:14,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=17416.666666666668, ans=0.0 2024-06-19 16:09:22,060 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=19.71 vs. limit=14.038125 2024-06-19 16:09:25,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=17453.333333333332, ans=0.125 2024-06-19 16:09:36,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=17471.666666666668, ans=0.07528333333333331 2024-06-19 16:09:39,532 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.00 vs. limit=5.6235 2024-06-19 16:09:40,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=17490.0, ans=0.12510000000000002 2024-06-19 16:09:44,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=17490.0, ans=0.125 2024-06-19 16:09:46,773 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=1.340e+01 2024-06-19 16:09:48,225 INFO [train.py:1028] (1/2) Epoch 1, batch 9550, loss[loss=1.038, simple_loss=0.6446, pruned_loss=0.7158, over 12978.00 frames. ], tot_loss[loss=1.069, simple_loss=0.6775, pruned_loss=0.7301, over 2573994.88 frames. ], batch size: 39, lr: 2.75e-02, grad_scale: 0.0625 2024-06-19 16:09:48,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=17508.333333333332, ans=0.125 2024-06-19 16:09:59,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=17526.666666666668, ans=0.0 2024-06-19 16:10:09,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17545.0, ans=0.12455000000000002 2024-06-19 16:10:11,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=17545.0, ans=0.0 2024-06-19 16:10:14,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=17545.0, ans=0.125 2024-06-19 16:10:17,972 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=54.98 vs. limit=20.6725 2024-06-19 16:10:21,895 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.33 vs. limit=14.08625 2024-06-19 16:10:28,883 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.303e+03 1.002e+04 1.452e+04 2.832e+04 1.209e+05, threshold=2.904e+04, percent-clipped=14.0 2024-06-19 16:10:29,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=17581.666666666668, ans=0.125 2024-06-19 16:10:31,193 INFO [train.py:1028] (1/2) Epoch 1, batch 9600, loss[loss=1.044, simple_loss=0.6791, pruned_loss=0.7048, over 10572.00 frames. ], tot_loss[loss=1.067, simple_loss=0.6769, pruned_loss=0.729, over 2572587.14 frames. ], batch size: 304, lr: 2.75e-02, grad_scale: 0.125 2024-06-19 16:10:34,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17600.0, ans=0.124 2024-06-19 16:10:35,558 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=14.1 2024-06-19 16:10:54,834 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=122.59 vs. limit=14.120625 2024-06-19 16:10:57,262 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=19.45 vs. limit=14.120625 2024-06-19 16:11:00,373 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.02 vs. limit=20.74125 2024-06-19 16:11:03,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=17673.333333333332, ans=0.007027536231884058 2024-06-19 16:11:05,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=17673.333333333332, ans=0.125 2024-06-19 16:11:10,010 INFO [train.py:1028] (1/2) Epoch 1, batch 9650, loss[loss=0.9585, simple_loss=0.6135, pruned_loss=0.6518, over 13105.00 frames. ], tot_loss[loss=1.065, simple_loss=0.6756, pruned_loss=0.7273, over 2562696.32 frames. ], batch size: 132, lr: 2.74e-02, grad_scale: 0.125 2024-06-19 16:11:10,465 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=31.44 vs. limit=14.134375 2024-06-19 16:11:11,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=17691.666666666668, ans=0.12308333333333332 2024-06-19 16:11:12,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=17691.666666666668, ans=0.0 2024-06-19 16:11:15,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=17691.666666666668, ans=0.46537500000000004 2024-06-19 16:11:22,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17710.0, ans=0.12290000000000001 2024-06-19 16:11:27,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=17728.333333333332, ans=0.125 2024-06-19 16:11:30,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=17728.333333333332, ans=0.27950833333333347 2024-06-19 16:11:37,103 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.34 vs. limit=5.662 2024-06-19 16:11:43,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=17765.0, ans=0.27822500000000006 2024-06-19 16:11:45,768 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.317e+03 5.779e+03 6.944e+03 9.194e+03 6.573e+04, threshold=1.389e+04, percent-clipped=2.0 2024-06-19 16:11:47,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=17783.333333333332, ans=0.2775833333333334 2024-06-19 16:11:47,775 INFO [train.py:1028] (1/2) Epoch 1, batch 9700, loss[loss=0.9831, simple_loss=0.6301, pruned_loss=0.668, over 12995.00 frames. ], tot_loss[loss=1.062, simple_loss=0.6735, pruned_loss=0.7249, over 2557534.06 frames. ], batch size: 144, lr: 2.74e-02, grad_scale: 0.25 2024-06-19 16:11:49,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17783.333333333332, ans=0.12216666666666667 2024-06-19 16:11:50,872 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.80 vs. limit=14.16875 2024-06-19 16:11:51,517 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.41 vs. limit=14.16875 2024-06-19 16:11:52,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=17783.333333333332, ans=0.025 2024-06-19 16:11:56,582 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=33.99 vs. limit=14.175625 2024-06-19 16:12:00,380 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=11.120666666666668 2024-06-19 16:12:04,045 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=31.47 vs. limit=20.865000000000002 2024-06-19 16:12:06,309 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.02 vs. limit=20.865000000000002 2024-06-19 16:12:08,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=17820.0, ans=0.125 2024-06-19 16:12:14,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=17838.333333333332, ans=0.006991666666666667 2024-06-19 16:12:20,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=17856.666666666668, ans=0.0 2024-06-19 16:12:20,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=17856.666666666668, ans=0.006987681159420289 2024-06-19 16:12:24,304 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.53 vs. limit=14.19625 2024-06-19 16:12:28,662 INFO [train.py:1028] (1/2) Epoch 1, batch 9750, loss[loss=0.9418, simple_loss=0.6073, pruned_loss=0.6382, over 13095.00 frames. ], tot_loss[loss=1.06, simple_loss=0.6721, pruned_loss=0.7236, over 2554067.92 frames. ], batch size: 132, lr: 2.73e-02, grad_scale: 0.25 2024-06-19 16:12:29,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=17875.0, ans=0.125 2024-06-19 16:12:30,548 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.32 vs. limit=14.203125 2024-06-19 16:12:30,603 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=19.98 vs. limit=14.203125 2024-06-19 16:12:31,877 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=61.61 vs. limit=14.203125 2024-06-19 16:12:39,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=17893.333333333332, ans=0.0 2024-06-19 16:12:44,203 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.38 vs. limit=14.216875 2024-06-19 16:12:45,666 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.03 vs. limit=20.93375 2024-06-19 16:12:54,250 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=72.92 vs. limit=14.223749999999999 2024-06-19 16:13:05,898 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.560e+03 7.216e+03 9.938e+03 1.489e+04 7.523e+04, threshold=1.988e+04, percent-clipped=30.0 2024-06-19 16:13:07,531 INFO [train.py:1028] (1/2) Epoch 1, batch 9800, loss[loss=1.096, simple_loss=0.6944, pruned_loss=0.7492, over 12950.00 frames. ], tot_loss[loss=1.061, simple_loss=0.6722, pruned_loss=0.725, over 2545922.28 frames. ], batch size: 39, lr: 2.73e-02, grad_scale: 0.25 2024-06-19 16:13:10,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=17966.666666666668, ans=0.0 2024-06-19 16:13:11,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17966.666666666668, ans=0.12033333333333332 2024-06-19 16:13:12,084 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.39 vs. limit=5.695 2024-06-19 16:13:13,573 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.93 vs. limit=20.975 2024-06-19 16:13:14,147 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=52.70 vs. limit=14.2375 2024-06-19 16:13:18,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=17985.0, ans=0.125 2024-06-19 16:13:20,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=17985.0, ans=0.006959782608695652 2024-06-19 16:13:20,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=17985.0, ans=0.125 2024-06-19 16:13:20,634 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.85 vs. limit=5.69775 2024-06-19 16:13:26,853 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=29.90 vs. limit=14.251249999999999 2024-06-19 16:13:27,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=18003.333333333332, ans=0.006955797101449276 2024-06-19 16:13:33,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.89 vs. limit=21.01625 2024-06-19 16:13:39,049 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.67 vs. limit=14.265 2024-06-19 16:13:45,287 INFO [train.py:1028] (1/2) Epoch 1, batch 9850, loss[loss=1.018, simple_loss=0.6556, pruned_loss=0.6907, over 12991.00 frames. ], tot_loss[loss=1.057, simple_loss=0.6708, pruned_loss=0.722, over 2538793.00 frames. ], batch size: 102, lr: 2.72e-02, grad_scale: 0.25 2024-06-19 16:13:45,691 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=43.54 vs. limit=21.04375 2024-06-19 16:13:46,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=18058.333333333332, ans=0.2679583333333334 2024-06-19 16:13:48,331 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=31.08 vs. limit=14.271875 2024-06-19 16:13:49,627 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.97 vs. limit=5.70875 2024-06-19 16:13:50,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=18058.333333333332, ans=0.0 2024-06-19 16:13:53,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=18076.666666666668, ans=0.1192333333333333 2024-06-19 16:13:54,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=18076.666666666668, ans=0.125 2024-06-19 16:13:55,591 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=52.04 vs. limit=14.278749999999999 2024-06-19 16:14:07,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=18113.333333333332, ans=0.0 2024-06-19 16:14:20,532 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=18131.666666666668, ans=0.125 2024-06-19 16:14:20,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=18131.666666666668, ans=0.0 2024-06-19 16:14:26,465 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.144e+03 8.179e+03 1.169e+04 1.634e+04 5.486e+04, threshold=2.339e+04, percent-clipped=12.0 2024-06-19 16:14:27,363 INFO [train.py:1028] (1/2) Epoch 1, batch 9900, loss[loss=1.057, simple_loss=0.6601, pruned_loss=0.7265, over 12887.00 frames. ], tot_loss[loss=1.05, simple_loss=0.6684, pruned_loss=0.7162, over 2530912.74 frames. ], batch size: 39, lr: 2.72e-02, grad_scale: 0.25 2024-06-19 16:14:37,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=18168.333333333332, ans=0.125 2024-06-19 16:14:41,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=18186.666666666668, ans=0.025 2024-06-19 16:14:43,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=18186.666666666668, ans=0.0 2024-06-19 16:15:01,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=18223.333333333332, ans=0.125 2024-06-19 16:15:03,367 INFO [train.py:1028] (1/2) Epoch 1, batch 9950, loss[loss=1.158, simple_loss=0.7085, pruned_loss=0.8038, over 12602.00 frames. ], tot_loss[loss=1.043, simple_loss=0.6652, pruned_loss=0.7105, over 2522822.83 frames. ], batch size: 29, lr: 2.72e-02, grad_scale: 0.25 2024-06-19 16:15:05,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=18241.666666666668, ans=0.2615416666666667 2024-06-19 16:15:07,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=18241.666666666668, ans=0.125 2024-06-19 16:15:10,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=18260.0, ans=0.125 2024-06-19 16:15:15,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=18260.0, ans=0.125 2024-06-19 16:15:17,136 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.74 vs. limit=14.13 2024-06-19 16:15:17,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=18260.0, ans=0.0 2024-06-19 16:15:19,840 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.85 vs. limit=5.74175 2024-06-19 16:15:23,138 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=23.82 vs. limit=14.354375000000001 2024-06-19 16:15:31,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=18296.666666666668, ans=0.006892028985507247 2024-06-19 16:15:33,197 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.05 vs. limit=21.2225 2024-06-19 16:15:34,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=18315.0, ans=0.125 2024-06-19 16:15:38,416 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=1.363e-01 2024-06-19 16:15:40,489 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.487e+03 9.253e+03 1.124e+04 1.542e+04 5.730e+04, threshold=2.249e+04, percent-clipped=7.0 2024-06-19 16:15:41,571 INFO [train.py:1028] (1/2) Epoch 1, batch 10000, loss[loss=1.028, simple_loss=0.6361, pruned_loss=0.7098, over 12458.00 frames. ], tot_loss[loss=1.042, simple_loss=0.6648, pruned_loss=0.7101, over 2486229.08 frames. ], batch size: 22, lr: 2.71e-02, grad_scale: 0.5 2024-06-19 16:15:53,088 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.62 vs. limit=5.75275 2024-06-19 16:15:54,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=18351.666666666668, ans=0.25769166666666676 2024-06-19 16:16:00,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=18370.0, ans=0.125 2024-06-19 16:16:00,985 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.55 vs. limit=14.38875 2024-06-19 16:16:08,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=18388.333333333332, ans=0.125 2024-06-19 16:16:10,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=18388.333333333332, ans=0.125 2024-06-19 16:16:13,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=18406.666666666668, ans=14.4025 2024-06-19 16:16:13,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=18406.666666666668, ans=0.125 2024-06-19 16:16:18,502 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.07 vs. limit=14.409375 2024-06-19 16:16:18,746 INFO [train.py:1028] (1/2) Epoch 1, batch 10050, loss[loss=1.045, simple_loss=0.6501, pruned_loss=0.7195, over 12437.00 frames. ], tot_loss[loss=1.038, simple_loss=0.6632, pruned_loss=0.7065, over 2444054.79 frames. ], batch size: 22, lr: 2.71e-02, grad_scale: 0.125 2024-06-19 16:16:20,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=18425.0, ans=0.125 2024-06-19 16:16:25,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=18425.0, ans=0.0 2024-06-19 16:16:29,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=18443.333333333332, ans=0.2544833333333334 2024-06-19 16:16:31,754 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.60 vs. limit=9.610833333333332 2024-06-19 16:16:36,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=18461.666666666668, ans=0.125 2024-06-19 16:16:43,194 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.38 vs. limit=14.43 2024-06-19 16:16:45,439 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.65 vs. limit=9.620000000000001 2024-06-19 16:16:45,472 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=39.14 vs. limit=21.36 2024-06-19 16:16:47,153 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=28.63 vs. limit=21.36 2024-06-19 16:16:51,315 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.37 vs. limit=14.436875 2024-06-19 16:16:56,362 INFO [train.py:1028] (1/2) Epoch 1, batch 10100, loss[loss=0.9871, simple_loss=0.6057, pruned_loss=0.6842, over 11859.00 frames. ], tot_loss[loss=1.041, simple_loss=0.6616, pruned_loss=0.7099, over 2425818.86 frames. ], batch size: 17, lr: 2.70e-02, grad_scale: 0.25 2024-06-19 16:16:56,972 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.105e+03 8.445e+03 1.116e+04 1.531e+04 6.312e+04, threshold=2.232e+04, percent-clipped=15.0 2024-06-19 16:17:00,933 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.68 vs. limit=14.258333333333335 2024-06-19 16:17:02,257 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=57.48 vs. limit=14.44375 2024-06-19 16:17:03,845 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.63 vs. limit=21.401249999999997 2024-06-19 16:17:07,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=46.41 vs. limit=14.450624999999999 2024-06-19 16:19:49,340 INFO [train.py:1028] (1/2) Epoch 2, batch 0, loss[loss=0.986, simple_loss=0.6178, pruned_loss=0.6771, over 12928.00 frames. ], tot_loss[loss=0.986, simple_loss=0.6178, pruned_loss=0.6771, over 12928.00 frames. ], batch size: 36, lr: 2.65e-02, grad_scale: 0.5 2024-06-19 16:19:49,340 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 16:19:56,760 INFO [train.py:1060] (1/2) Epoch 2, validation: loss=1.017, simple_loss=0.6453, pruned_loss=0.694, over 351949.00 frames. 2024-06-19 16:19:56,761 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 16946MB 2024-06-19 16:19:59,623 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=40.66 vs. limit=21.41225 2024-06-19 16:20:03,162 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=43.72 vs. limit=14.456125 2024-06-19 16:20:08,420 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.60 vs. limit=14.463000000000001 2024-06-19 16:20:08,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=18568.0, ans=0.125 2024-06-19 16:20:18,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=18586.333333333332, ans=0.0 2024-06-19 16:20:23,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=18604.666666666668, ans=0.006825072463768116 2024-06-19 16:20:23,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=18604.666666666668, ans=0.07 2024-06-19 16:20:26,662 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=14.47675 2024-06-19 16:20:34,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=18623.0, ans=0.07 2024-06-19 16:20:39,847 INFO [train.py:1028] (1/2) Epoch 2, batch 50, loss[loss=1.035, simple_loss=0.6398, pruned_loss=0.7155, over 12685.00 frames. ], tot_loss[loss=0.9795, simple_loss=0.6291, pruned_loss=0.665, over 574725.23 frames. ], batch size: 29, lr: 2.64e-02, grad_scale: 0.125 2024-06-19 16:20:42,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=18641.333333333332, ans=0.125 2024-06-19 16:20:44,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=18641.333333333332, ans=0.125 2024-06-19 16:20:49,785 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=113.16 vs. limit=14.497375 2024-06-19 16:20:52,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=18659.666666666668, ans=0.125 2024-06-19 16:20:54,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=18659.666666666668, ans=0.0 2024-06-19 16:20:59,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=18678.0, ans=0.006809130434782609 2024-06-19 16:21:01,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=18678.0, ans=0.006809130434782609 2024-06-19 16:21:04,694 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=54.31 vs. limit=14.511125 2024-06-19 16:21:07,020 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.452e+03 7.112e+03 9.601e+03 1.429e+04 7.351e+04, threshold=1.920e+04, percent-clipped=7.0 2024-06-19 16:21:07,548 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=19.43 vs. limit=14.511125 2024-06-19 16:21:09,498 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=38.31 vs. limit=21.52225 2024-06-19 16:21:21,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=18714.666666666668, ans=0.0 2024-06-19 16:21:22,938 INFO [train.py:1028] (1/2) Epoch 2, batch 100, loss[loss=1.032, simple_loss=0.6503, pruned_loss=0.7073, over 13258.00 frames. ], tot_loss[loss=0.9736, simple_loss=0.6263, pruned_loss=0.6605, over 1018197.22 frames. ], batch size: 46, lr: 2.64e-02, grad_scale: 0.25 2024-06-19 16:21:24,562 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=18.12 vs. limit=14.524875 2024-06-19 16:21:26,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=18733.0, ans=0.006797173913043478 2024-06-19 16:21:31,563 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.32 vs. limit=14.531749999999999 2024-06-19 16:21:35,090 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.63 vs. limit=9.687833333333334 2024-06-19 16:21:38,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=18769.666666666668, ans=0.125 2024-06-19 16:21:40,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18769.666666666668, ans=0.11230333333333331 2024-06-19 16:21:44,595 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.61 vs. limit=21.591 2024-06-19 16:21:48,268 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=11.04 vs. limit=9.697 2024-06-19 16:21:53,054 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=27.82 vs. limit=21.60475 2024-06-19 16:21:59,724 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=11.41 vs. limit=9.706166666666668 2024-06-19 16:21:59,880 INFO [train.py:1028] (1/2) Epoch 2, batch 150, loss[loss=1.012, simple_loss=0.6327, pruned_loss=0.696, over 12600.00 frames. ], tot_loss[loss=0.9798, simple_loss=0.6277, pruned_loss=0.666, over 1365781.01 frames. ], batch size: 29, lr: 2.64e-02, grad_scale: 0.25 2024-06-19 16:22:00,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=18824.666666666668, ans=0.0 2024-06-19 16:22:02,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=18824.666666666668, ans=0.24113666666666667 2024-06-19 16:22:17,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=18861.333333333332, ans=0.125 2024-06-19 16:22:17,568 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.87 vs. limit=21.646 2024-06-19 16:22:19,183 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.74 vs. limit=14.573 2024-06-19 16:22:19,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=18861.333333333332, ans=0.025 2024-06-19 16:22:21,339 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.37 vs. limit=14.573 2024-06-19 16:22:21,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=18861.333333333332, ans=0.23985333333333336 2024-06-19 16:22:22,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=18861.333333333332, ans=0.125 2024-06-19 16:22:26,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.66 vs. limit=21.659750000000003 2024-06-19 16:22:27,826 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.659e+03 6.918e+03 9.642e+03 1.244e+04 3.866e+04, threshold=1.928e+04, percent-clipped=10.0 2024-06-19 16:22:38,044 INFO [train.py:1028] (1/2) Epoch 2, batch 200, loss[loss=0.9271, simple_loss=0.6172, pruned_loss=0.6185, over 12645.00 frames. ], tot_loss[loss=0.9837, simple_loss=0.6291, pruned_loss=0.6691, over 1634815.89 frames. ], batch size: 202, lr: 2.63e-02, grad_scale: 0.25 2024-06-19 16:22:42,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=18916.333333333332, ans=0.2379283333333334 2024-06-19 16:22:44,393 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=25.21 vs. limit=21.68725 2024-06-19 16:22:47,148 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.18 vs. limit=21.701 2024-06-19 16:22:47,380 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.97 vs. limit=21.701 2024-06-19 16:22:49,852 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.70 vs. limit=21.701 2024-06-19 16:22:51,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18934.666666666668, ans=0.11065333333333333 2024-06-19 16:22:55,684 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=18.70 vs. limit=14.607375000000001 2024-06-19 16:22:55,733 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=34.01 vs. limit=21.714750000000002 2024-06-19 16:22:57,597 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.51 vs. limit=21.714750000000002 2024-06-19 16:22:57,682 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=11.11 vs. limit=14.4765 2024-06-19 16:23:04,218 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.52 vs. limit=14.61425 2024-06-19 16:23:07,462 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=20.01 vs. limit=14.621125 2024-06-19 16:23:11,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=18989.666666666668, ans=0.125 2024-06-19 16:23:15,194 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.78 vs. limit=9.751999999999999 2024-06-19 16:23:15,303 INFO [train.py:1028] (1/2) Epoch 2, batch 250, loss[loss=0.8912, simple_loss=0.5821, pruned_loss=0.6001, over 12983.00 frames. ], tot_loss[loss=0.9805, simple_loss=0.627, pruned_loss=0.667, over 1846659.52 frames. ], batch size: 144, lr: 2.63e-02, grad_scale: 0.25 2024-06-19 16:23:18,960 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.80 vs. limit=5.8512 2024-06-19 16:23:34,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=19026.333333333332, ans=0.125 2024-06-19 16:23:36,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=19026.333333333332, ans=0.125 2024-06-19 16:23:36,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=19026.333333333332, ans=0.0067334057971014495 2024-06-19 16:23:48,793 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.54 vs. limit=14.64175 2024-06-19 16:23:51,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=19063.0, ans=0.94063 2024-06-19 16:23:56,821 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.674e+03 1.038e+04 1.241e+04 1.675e+04 5.031e+04, threshold=2.482e+04, percent-clipped=14.0 2024-06-19 16:24:00,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19081.333333333332, ans=0.10918666666666668 2024-06-19 16:24:08,140 INFO [train.py:1028] (1/2) Epoch 2, batch 300, loss[loss=0.9025, simple_loss=0.5858, pruned_loss=0.6096, over 13184.00 frames. ], tot_loss[loss=0.9783, simple_loss=0.6263, pruned_loss=0.6651, over 2009682.58 frames. ], batch size: 112, lr: 2.62e-02, grad_scale: 0.5 2024-06-19 16:24:11,782 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.93 vs. limit=21.82475 2024-06-19 16:24:14,626 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.62 vs. limit=9.774916666666666 2024-06-19 16:24:17,406 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.90 vs. limit=14.66925 2024-06-19 16:24:27,457 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.42 vs. limit=14.676124999999999 2024-06-19 16:24:27,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.25 vs. limit=14.676124999999999 2024-06-19 16:24:34,004 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.48 vs. limit=21.866 2024-06-19 16:24:40,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19173.0, ans=0.10827 2024-06-19 16:24:46,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=19173.0, ans=0.0 2024-06-19 16:24:47,243 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.70 vs. limit=14.69675 2024-06-19 16:24:47,595 INFO [train.py:1028] (1/2) Epoch 2, batch 350, loss[loss=1.033, simple_loss=0.6357, pruned_loss=0.7153, over 12860.00 frames. ], tot_loss[loss=0.9781, simple_loss=0.6255, pruned_loss=0.6653, over 2138124.34 frames. ], batch size: 33, lr: 2.62e-02, grad_scale: 0.5 2024-06-19 16:24:58,464 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.51 vs. limit=14.703624999999999 2024-06-19 16:24:58,773 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=8.217e+01 2024-06-19 16:24:58,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=19209.666666666668, ans=0.22766166666666665 2024-06-19 16:25:09,269 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=113.78 vs. limit=14.7105 2024-06-19 16:25:09,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=19228.0, ans=0.125 2024-06-19 16:25:16,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=19246.333333333332, ans=0.22637833333333335 2024-06-19 16:25:18,842 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.923e+03 6.364e+03 9.500e+03 1.188e+04 3.866e+04, threshold=1.900e+04, percent-clipped=4.0 2024-06-19 16:25:20,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=19264.666666666668, ans=0.09899494936611666 2024-06-19 16:25:22,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=19264.666666666668, ans=0.125 2024-06-19 16:25:23,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=19264.666666666668, ans=0.025 2024-06-19 16:25:29,751 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=13.88 vs. limit=14.6415 2024-06-19 16:25:29,880 INFO [train.py:1028] (1/2) Epoch 2, batch 400, loss[loss=0.9987, simple_loss=0.6422, pruned_loss=0.6776, over 13277.00 frames. ], tot_loss[loss=0.9774, simple_loss=0.6254, pruned_loss=0.6647, over 2238699.58 frames. ], batch size: 63, lr: 2.61e-02, grad_scale: 0.5 2024-06-19 16:25:36,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=19301.333333333332, ans=14.738 2024-06-19 16:25:42,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=19301.333333333332, ans=0.0 2024-06-19 16:25:47,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=19319.666666666668, ans=0.05 2024-06-19 16:25:54,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=19338.0, ans=0.125 2024-06-19 16:26:03,588 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=80.34 vs. limit=14.758625 2024-06-19 16:26:04,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=19356.333333333332, ans=0.025 2024-06-19 16:26:16,430 INFO [train.py:1028] (1/2) Epoch 2, batch 450, loss[loss=1.004, simple_loss=0.6441, pruned_loss=0.6822, over 13201.00 frames. ], tot_loss[loss=0.9756, simple_loss=0.6255, pruned_loss=0.6628, over 2313044.44 frames. ], batch size: 67, lr: 2.61e-02, grad_scale: 0.25 2024-06-19 16:26:16,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=19374.666666666668, ans=0.125 2024-06-19 16:26:21,709 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=26.19 vs. limit=22.031 2024-06-19 16:26:34,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=19393.0, ans=0.22124500000000014 2024-06-19 16:26:39,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=19411.333333333332, ans=0.125 2024-06-19 16:26:41,997 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=33.32 vs. limit=14.779250000000001 2024-06-19 16:26:44,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=19429.666666666668, ans=0.006645724637681159 2024-06-19 16:26:49,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=19429.666666666668, ans=0.125 2024-06-19 16:26:51,268 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.475e+03 7.250e+03 8.612e+03 1.163e+04 2.828e+04, threshold=1.722e+04, percent-clipped=4.0 2024-06-19 16:27:00,070 INFO [train.py:1028] (1/2) Epoch 2, batch 500, loss[loss=0.8989, simple_loss=0.576, pruned_loss=0.6109, over 13080.00 frames. ], tot_loss[loss=0.9811, simple_loss=0.6271, pruned_loss=0.6676, over 2375066.33 frames. ], batch size: 121, lr: 2.61e-02, grad_scale: 0.5 2024-06-19 16:27:06,699 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=26.85 vs. limit=22.113500000000002 2024-06-19 16:27:10,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=19484.666666666668, ans=0.025 2024-06-19 16:27:14,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=19503.0, ans=0.006629782608695652 2024-06-19 16:27:24,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=19521.333333333332, ans=0.125 2024-06-19 16:27:25,134 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.88 vs. limit=22.141 2024-06-19 16:27:25,821 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=9.26 vs. limit=11.808533333333333 2024-06-19 16:27:33,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=19539.666666666668, ans=0.125 2024-06-19 16:27:37,833 INFO [train.py:1028] (1/2) Epoch 2, batch 550, loss[loss=0.9114, simple_loss=0.5938, pruned_loss=0.6145, over 12896.00 frames. ], tot_loss[loss=0.9838, simple_loss=0.6269, pruned_loss=0.6703, over 2419933.46 frames. ], batch size: 158, lr: 2.60e-02, grad_scale: 0.25 2024-06-19 16:27:40,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19558.0, ans=0.10442000000000001 2024-06-19 16:27:44,239 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.80 vs. limit=14.83425 2024-06-19 16:27:51,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=19594.666666666668, ans=0.125 2024-06-19 16:27:57,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=19594.666666666668, ans=0.035 2024-06-19 16:28:06,218 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.060e+03 4.820e+03 7.996e+03 1.122e+04 3.675e+04, threshold=1.599e+04, percent-clipped=8.0 2024-06-19 16:28:09,375 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=14.86175 2024-06-19 16:28:12,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=19631.333333333332, ans=14.86175 2024-06-19 16:28:14,168 INFO [train.py:1028] (1/2) Epoch 2, batch 600, loss[loss=0.8968, simple_loss=0.5977, pruned_loss=0.5979, over 13064.00 frames. ], tot_loss[loss=0.9823, simple_loss=0.6258, pruned_loss=0.6694, over 2457949.99 frames. ], batch size: 144, lr: 2.60e-02, grad_scale: 0.5 2024-06-19 16:28:19,178 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=25.30 vs. limit=22.23725 2024-06-19 16:28:19,955 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=14.868625 2024-06-19 16:28:22,366 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=44.03 vs. limit=14.875499999999999 2024-06-19 16:28:29,783 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=52.39 vs. limit=22.26475 2024-06-19 16:28:33,845 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=50.45 vs. limit=14.882375 2024-06-19 16:28:42,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=19704.666666666668, ans=0.0 2024-06-19 16:28:42,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=19704.666666666668, ans=0.006585942028985508 2024-06-19 16:28:43,198 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=19.64 vs. limit=14.88925 2024-06-19 16:28:47,362 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.98 vs. limit=9.93075 2024-06-19 16:28:55,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=19723.0, ans=0.125 2024-06-19 16:28:57,475 INFO [train.py:1028] (1/2) Epoch 2, batch 650, loss[loss=1.002, simple_loss=0.638, pruned_loss=0.6829, over 13308.00 frames. ], tot_loss[loss=0.9843, simple_loss=0.6265, pruned_loss=0.6711, over 2489664.52 frames. ], batch size: 59, lr: 2.59e-02, grad_scale: 0.5 2024-06-19 16:29:02,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19741.333333333332, ans=0.10258666666666669 2024-06-19 16:29:20,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=19778.0, ans=0.125 2024-06-19 16:29:20,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=19778.0, ans=0.0 2024-06-19 16:29:21,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=19778.0, ans=0.0 2024-06-19 16:29:22,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=19796.333333333332, ans=0.0 2024-06-19 16:29:25,202 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=49.39 vs. limit=14.923625000000001 2024-06-19 16:29:29,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=19796.333333333332, ans=0.125 2024-06-19 16:29:32,109 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.048e+03 4.392e+03 5.654e+03 7.717e+03 6.039e+04, threshold=1.131e+04, percent-clipped=2.0 2024-06-19 16:29:33,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=19814.666666666668, ans=0.0 2024-06-19 16:29:34,139 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.22 vs. limit=14.907333333333334 2024-06-19 16:29:35,125 WARNING [optim.py:503] (1/2) Scaling gradients by 0.05225700885057449, model_norm_threshold=11308.90625 2024-06-19 16:29:35,287 WARNING [optim.py:575] (1/2) Parameter dominating tot_sumsq module.encoder_embed.conv.4.weight with proportion 0.24, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=1.123e+10, grad_sumsq=2.701e+09, orig_rms_sq=4.158e+00 2024-06-19 16:29:39,147 INFO [train.py:1028] (1/2) Epoch 2, batch 700, loss[loss=1.01, simple_loss=0.6397, pruned_loss=0.6906, over 13209.00 frames. ], tot_loss[loss=0.9821, simple_loss=0.625, pruned_loss=0.6696, over 2512006.43 frames. ], batch size: 46, lr: 2.59e-02, grad_scale: 0.125 2024-06-19 16:29:41,184 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.84 vs. limit=14.937375 2024-06-19 16:29:48,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=19851.333333333332, ans=0.006554057971014493 2024-06-19 16:29:52,465 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.90 vs. limit=14.94425 2024-06-19 16:29:53,050 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.41 vs. limit=14.925666666666666 2024-06-19 16:29:54,061 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.28 vs. limit=14.94425 2024-06-19 16:30:02,602 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=6.182e-02 2024-06-19 16:30:06,625 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.76 vs. limit=14.958 2024-06-19 16:30:06,647 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.52 vs. limit=14.958 2024-06-19 16:30:22,360 INFO [train.py:1028] (1/2) Epoch 2, batch 750, loss[loss=1.009, simple_loss=0.6429, pruned_loss=0.6876, over 13288.00 frames. ], tot_loss[loss=0.9862, simple_loss=0.6274, pruned_loss=0.6725, over 2526889.09 frames. ], batch size: 63, lr: 2.59e-02, grad_scale: 0.125 2024-06-19 16:30:23,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=19924.666666666668, ans=0.09899494936611666 2024-06-19 16:30:41,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=19961.333333333332, ans=0.125 2024-06-19 16:30:42,831 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.72 vs. limit=14.9855 2024-06-19 16:30:43,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19961.333333333332, ans=0.10038666666666668 2024-06-19 16:30:44,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=19961.333333333332, ans=0.125 2024-06-19 16:30:57,193 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.384e+03 5.918e+03 7.493e+03 1.144e+04 2.164e+05, threshold=1.499e+04, percent-clipped=25.0 2024-06-19 16:31:00,819 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=81.50 vs. limit=14.99925 2024-06-19 16:31:03,273 INFO [train.py:1028] (1/2) Epoch 2, batch 800, loss[loss=1, simple_loss=0.6217, pruned_loss=0.6892, over 12946.00 frames. ], tot_loss[loss=0.9888, simple_loss=0.6278, pruned_loss=0.6749, over 2539486.03 frames. ], batch size: 36, lr: 2.58e-02, grad_scale: 0.25 2024-06-19 16:31:05,225 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.62 vs. limit=15.0 2024-06-19 16:31:21,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=20034.666666666668, ans=0.1 2024-06-19 16:31:26,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=20053.0, ans=10.0 2024-06-19 16:31:44,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=20071.333333333332, ans=0.07 2024-06-19 16:31:48,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=20089.666666666668, ans=0.125 2024-06-19 16:31:54,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=20089.666666666668, ans=0.0 2024-06-19 16:31:56,553 INFO [train.py:1028] (1/2) Epoch 2, batch 850, loss[loss=0.9803, simple_loss=0.6211, pruned_loss=0.6698, over 13127.00 frames. ], tot_loss[loss=0.99, simple_loss=0.6277, pruned_loss=0.6762, over 2549511.07 frames. ], batch size: 95, lr: 2.58e-02, grad_scale: 0.25 2024-06-19 16:31:57,065 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=41.05 vs. limit=15.0 2024-06-19 16:32:03,822 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=131.35 vs. limit=15.0 2024-06-19 16:32:13,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=20144.666666666668, ans=0.006490289855072464 2024-06-19 16:32:32,205 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.045e+03 4.587e+03 6.105e+03 7.839e+03 2.814e+04, threshold=1.221e+04, percent-clipped=5.0 2024-06-19 16:32:39,642 INFO [train.py:1028] (1/2) Epoch 2, batch 900, loss[loss=0.9784, simple_loss=0.6096, pruned_loss=0.6736, over 13249.00 frames. ], tot_loss[loss=0.9879, simple_loss=0.6264, pruned_loss=0.6747, over 2555155.63 frames. ], batch size: 37, lr: 2.57e-02, grad_scale: 0.5 2024-06-19 16:32:40,031 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=65.30 vs. limit=22.5 2024-06-19 16:32:44,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=20199.666666666668, ans=0.125 2024-06-19 16:32:48,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=20218.0, ans=0.025 2024-06-19 16:33:00,751 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.39 vs. limit=15.0 2024-06-19 16:33:04,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=20236.333333333332, ans=0.025 2024-06-19 16:33:09,532 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.81 vs. limit=15.0 2024-06-19 16:33:12,964 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=22.27 vs. limit=15.0 2024-06-19 16:33:22,654 INFO [train.py:1028] (1/2) Epoch 2, batch 950, loss[loss=1.036, simple_loss=0.6566, pruned_loss=0.7075, over 12956.00 frames. ], tot_loss[loss=0.9875, simple_loss=0.6277, pruned_loss=0.6737, over 2558417.06 frames. ], batch size: 39, lr: 2.57e-02, grad_scale: 0.5 2024-06-19 16:33:31,366 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.57 vs. limit=15.0 2024-06-19 16:33:33,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=20309.666666666668, ans=0.0 2024-06-19 16:33:35,757 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=42.87 vs. limit=15.0 2024-06-19 16:33:52,826 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.53 vs. limit=15.0 2024-06-19 16:33:59,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=20364.666666666668, ans=0.125 2024-06-19 16:34:00,097 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.89 vs. limit=15.0 2024-06-19 16:34:03,001 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.503e+03 5.671e+03 7.306e+03 1.007e+04 4.704e+04, threshold=1.461e+04, percent-clipped=15.0 2024-06-19 16:34:05,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=20364.666666666668, ans=0.09899494936611666 2024-06-19 16:34:09,192 INFO [train.py:1028] (1/2) Epoch 2, batch 1000, loss[loss=1.016, simple_loss=0.6465, pruned_loss=0.6923, over 13278.00 frames. ], tot_loss[loss=0.9839, simple_loss=0.6265, pruned_loss=0.6707, over 2560882.50 frames. ], batch size: 49, lr: 2.57e-02, grad_scale: 0.5 2024-06-19 16:34:09,823 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.65 vs. limit=15.0 2024-06-19 16:34:21,115 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=22.33 vs. limit=15.0 2024-06-19 16:34:30,008 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=7.94 vs. limit=12.0 2024-06-19 16:34:32,700 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.91 vs. limit=15.0 2024-06-19 16:34:33,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=20419.666666666668, ans=0.125 2024-06-19 16:34:36,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=20419.666666666668, ans=0.125 2024-06-19 16:34:40,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=20438.0, ans=0.1 2024-06-19 16:34:41,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=20438.0, ans=0.1 2024-06-19 16:34:48,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=20456.333333333332, ans=0.0 2024-06-19 16:34:49,444 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=7.954e+00 2024-06-19 16:34:50,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=20456.333333333332, ans=0.125 2024-06-19 16:34:54,017 INFO [train.py:1028] (1/2) Epoch 2, batch 1050, loss[loss=0.973, simple_loss=0.6179, pruned_loss=0.6641, over 13230.00 frames. ], tot_loss[loss=0.9863, simple_loss=0.6272, pruned_loss=0.6727, over 2564909.20 frames. ], batch size: 77, lr: 2.56e-02, grad_scale: 0.5 2024-06-19 16:35:03,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=20493.0, ans=0.2 2024-06-19 16:35:10,190 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.52 vs. limit=15.0 2024-06-19 16:35:17,679 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.32 vs. limit=15.0 2024-06-19 16:35:28,186 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.886e+03 3.650e+03 4.678e+03 6.104e+03 1.817e+04, threshold=9.355e+03, percent-clipped=1.0 2024-06-19 16:35:31,018 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.04 vs. limit=15.0 2024-06-19 16:35:33,626 INFO [train.py:1028] (1/2) Epoch 2, batch 1100, loss[loss=0.9481, simple_loss=0.6031, pruned_loss=0.6466, over 13296.00 frames. ], tot_loss[loss=0.9897, simple_loss=0.6282, pruned_loss=0.6755, over 2570360.43 frames. ], batch size: 52, lr: 2.56e-02, grad_scale: 1.0 2024-06-19 16:35:35,029 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=20566.333333333332, ans=0.125 2024-06-19 16:36:19,482 INFO [train.py:1028] (1/2) Epoch 2, batch 1150, loss[loss=1.028, simple_loss=0.6453, pruned_loss=0.705, over 13225.00 frames. ], tot_loss[loss=0.9922, simple_loss=0.6291, pruned_loss=0.6776, over 2571257.38 frames. ], batch size: 52, lr: 2.55e-02, grad_scale: 0.5 2024-06-19 16:37:01,175 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.875e+03 3.546e+03 4.246e+03 5.312e+03 1.969e+04, threshold=8.492e+03, percent-clipped=6.0 2024-06-19 16:37:02,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=20731.333333333332, ans=0.006362753623188406 2024-06-19 16:37:03,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=20731.333333333332, ans=0.0 2024-06-19 16:37:05,417 INFO [train.py:1028] (1/2) Epoch 2, batch 1200, loss[loss=0.9996, simple_loss=0.6398, pruned_loss=0.6796, over 13228.00 frames. ], tot_loss[loss=0.9907, simple_loss=0.6291, pruned_loss=0.6761, over 2573292.05 frames. ], batch size: 77, lr: 2.55e-02, grad_scale: 1.0 2024-06-19 16:37:11,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=20749.666666666668, ans=0.0 2024-06-19 16:37:15,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=20768.0, ans=0.125 2024-06-19 16:37:16,945 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.06 vs. limit=10.0 2024-06-19 16:37:19,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=20786.333333333332, ans=0.125 2024-06-19 16:37:20,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=20786.333333333332, ans=0.125 2024-06-19 16:37:27,030 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.43 vs. limit=22.5 2024-06-19 16:37:35,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=20804.666666666668, ans=0.0 2024-06-19 16:37:41,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=20823.0, ans=15.0 2024-06-19 16:37:44,645 INFO [train.py:1028] (1/2) Epoch 2, batch 1250, loss[loss=0.9457, simple_loss=0.6107, pruned_loss=0.6403, over 13138.00 frames. ], tot_loss[loss=0.9849, simple_loss=0.6267, pruned_loss=0.6716, over 2582446.34 frames. ], batch size: 112, lr: 2.55e-02, grad_scale: 0.125 2024-06-19 16:37:49,065 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=25.00 vs. limit=22.5 2024-06-19 16:37:49,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=20841.333333333332, ans=0.025 2024-06-19 16:37:59,301 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.50 vs. limit=10.0 2024-06-19 16:38:00,111 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=30.68 vs. limit=15.0 2024-06-19 16:38:02,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=20878.0, ans=0.07 2024-06-19 16:38:05,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=20878.0, ans=0.125 2024-06-19 16:38:23,465 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.349e+03 4.978e+03 6.251e+03 8.107e+03 6.184e+04, threshold=1.250e+04, percent-clipped=21.0 2024-06-19 16:38:26,062 INFO [train.py:1028] (1/2) Epoch 2, batch 1300, loss[loss=0.9695, simple_loss=0.6398, pruned_loss=0.6496, over 12708.00 frames. ], tot_loss[loss=0.9874, simple_loss=0.6276, pruned_loss=0.6736, over 2583354.86 frames. ], batch size: 176, lr: 2.54e-02, grad_scale: 0.25 2024-06-19 16:38:34,902 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.18 vs. limit=15.0 2024-06-19 16:38:35,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=20951.333333333332, ans=0.0 2024-06-19 16:38:40,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.33 vs. limit=15.0 2024-06-19 16:38:42,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=20969.666666666668, ans=0.1 2024-06-19 16:38:49,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=20969.666666666668, ans=0.025 2024-06-19 16:38:50,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=20988.0, ans=0.006306956521739131 2024-06-19 16:38:52,953 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=16.53 vs. limit=15.0 2024-06-19 16:39:08,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=21006.333333333332, ans=0.2 2024-06-19 16:39:14,581 INFO [train.py:1028] (1/2) Epoch 2, batch 1350, loss[loss=1.017, simple_loss=0.6439, pruned_loss=0.6946, over 13214.00 frames. ], tot_loss[loss=0.987, simple_loss=0.6277, pruned_loss=0.6732, over 2584989.49 frames. ], batch size: 59, lr: 2.54e-02, grad_scale: 0.25 2024-06-19 16:39:17,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=21024.666666666668, ans=0.0 2024-06-19 16:39:22,262 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.70 vs. limit=15.0 2024-06-19 16:39:23,367 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=45.90 vs. limit=22.5 2024-06-19 16:39:23,426 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=12.73 vs. limit=10.0 2024-06-19 16:39:27,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=21043.0, ans=0.125 2024-06-19 16:39:29,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=21043.0, ans=0.0 2024-06-19 16:39:46,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=21079.666666666668, ans=0.2 2024-06-19 16:39:58,202 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.465e+03 4.270e+03 5.208e+03 6.837e+03 4.593e+04, threshold=1.042e+04, percent-clipped=7.0 2024-06-19 16:40:00,913 INFO [train.py:1028] (1/2) Epoch 2, batch 1400, loss[loss=0.9973, simple_loss=0.6207, pruned_loss=0.687, over 12483.00 frames. ], tot_loss[loss=0.9833, simple_loss=0.6262, pruned_loss=0.6702, over 2587588.23 frames. ], batch size: 25, lr: 2.54e-02, grad_scale: 0.5 2024-06-19 16:40:04,788 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=15.0 2024-06-19 16:40:06,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21116.333333333332, ans=0.1 2024-06-19 16:40:06,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21116.333333333332, ans=0.1 2024-06-19 16:40:10,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=21134.666666666668, ans=0.0 2024-06-19 16:40:11,000 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.26 vs. limit=15.0 2024-06-19 16:40:14,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=21134.666666666668, ans=0.0 2024-06-19 16:40:16,014 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.54 vs. limit=15.0 2024-06-19 16:40:21,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=21153.0, ans=0.2 2024-06-19 16:40:25,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=21171.333333333332, ans=0.006267101449275362 2024-06-19 16:40:25,237 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=21171.333333333332, ans=0.0 2024-06-19 16:40:26,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=21171.333333333332, ans=0.125 2024-06-19 16:40:35,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=21189.666666666668, ans=12.0 2024-06-19 16:40:37,478 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.31 vs. limit=15.0 2024-06-19 16:40:40,964 INFO [train.py:1028] (1/2) Epoch 2, batch 1450, loss[loss=0.9103, simple_loss=0.593, pruned_loss=0.6138, over 13119.00 frames. ], tot_loss[loss=0.9802, simple_loss=0.6249, pruned_loss=0.6678, over 2586740.52 frames. ], batch size: 121, lr: 2.53e-02, grad_scale: 0.5 2024-06-19 16:40:41,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=21208.0, ans=0.125 2024-06-19 16:40:41,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=21208.0, ans=0.125 2024-06-19 16:40:44,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.59 vs. limit=15.0 2024-06-19 16:40:50,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=21226.333333333332, ans=0.0 2024-06-19 16:40:52,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=21226.333333333332, ans=0.125 2024-06-19 16:40:52,509 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2024-06-19 16:40:53,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=21226.333333333332, ans=0.125 2024-06-19 16:41:07,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=21263.0, ans=0.025 2024-06-19 16:41:15,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=21281.333333333332, ans=0.125 2024-06-19 16:41:16,133 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.77 vs. limit=15.0 2024-06-19 16:41:16,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=21281.333333333332, ans=0.025 2024-06-19 16:41:19,727 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.936e+03 5.345e+03 6.900e+03 9.769e+03 2.413e+04, threshold=1.380e+04, percent-clipped=21.0 2024-06-19 16:41:20,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=21281.333333333332, ans=15.0 2024-06-19 16:41:21,011 INFO [train.py:1028] (1/2) Epoch 2, batch 1500, loss[loss=1.003, simple_loss=0.6392, pruned_loss=0.6829, over 13185.00 frames. ], tot_loss[loss=0.9829, simple_loss=0.6257, pruned_loss=0.6701, over 2589005.59 frames. ], batch size: 83, lr: 2.53e-02, grad_scale: 0.5 2024-06-19 16:41:21,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=21299.666666666668, ans=0.125 2024-06-19 16:41:32,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=21318.0, ans=0.125 2024-06-19 16:41:34,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=21318.0, ans=0.125 2024-06-19 16:41:37,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=21336.333333333332, ans=0.0 2024-06-19 16:41:46,543 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=54.82 vs. limit=15.0 2024-06-19 16:41:47,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=21354.666666666668, ans=0.125 2024-06-19 16:41:57,830 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.52 vs. limit=15.0 2024-06-19 16:42:00,137 INFO [train.py:1028] (1/2) Epoch 2, batch 1550, loss[loss=0.9543, simple_loss=0.6052, pruned_loss=0.6517, over 13054.00 frames. ], tot_loss[loss=0.9836, simple_loss=0.6268, pruned_loss=0.6702, over 2584375.42 frames. ], batch size: 102, lr: 2.52e-02, grad_scale: 0.5 2024-06-19 16:42:10,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=21409.666666666668, ans=0.125 2024-06-19 16:42:14,157 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.52 vs. limit=15.0 2024-06-19 16:42:17,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=21428.0, ans=0.09899494936611666 2024-06-19 16:42:17,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=21428.0, ans=15.0 2024-06-19 16:42:19,482 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=30.12 vs. limit=22.5 2024-06-19 16:42:32,104 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.455e+03 4.539e+03 5.393e+03 7.229e+03 3.192e+04, threshold=1.079e+04, percent-clipped=3.0 2024-06-19 16:42:32,761 INFO [train.py:1028] (1/2) Epoch 2, batch 1600, loss[loss=0.9867, simple_loss=0.6329, pruned_loss=0.6702, over 13161.00 frames. ], tot_loss[loss=0.9849, simple_loss=0.627, pruned_loss=0.6714, over 2579124.32 frames. ], batch size: 77, lr: 2.52e-02, grad_scale: 0.5 2024-06-19 16:42:46,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=21519.666666666668, ans=0.1 2024-06-19 16:42:50,847 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=46.21 vs. limit=15.0 2024-06-19 16:43:02,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=21556.333333333332, ans=0.07 2024-06-19 16:43:02,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.13 vs. limit=15.0 2024-06-19 16:43:05,135 INFO [train.py:1028] (1/2) Epoch 2, batch 1650, loss[loss=0.9386, simple_loss=0.6095, pruned_loss=0.6338, over 13136.00 frames. ], tot_loss[loss=0.9807, simple_loss=0.625, pruned_loss=0.6682, over 2575295.14 frames. ], batch size: 95, lr: 2.52e-02, grad_scale: 0.25 2024-06-19 16:43:08,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=21574.666666666668, ans=0.0 2024-06-19 16:43:09,918 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.54 vs. limit=15.0 2024-06-19 16:43:11,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=21593.0, ans=0.0 2024-06-19 16:43:12,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=21593.0, ans=0.0 2024-06-19 16:43:29,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.63 vs. limit=15.0 2024-06-19 16:43:34,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=21648.0, ans=0.0 2024-06-19 16:43:40,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=21648.0, ans=10.0 2024-06-19 16:43:41,498 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.132e+03 6.157e+03 7.365e+03 8.932e+03 2.808e+04, threshold=1.473e+04, percent-clipped=14.0 2024-06-19 16:43:41,527 INFO [train.py:1028] (1/2) Epoch 2, batch 1700, loss[loss=1.05, simple_loss=0.6559, pruned_loss=0.7224, over 12692.00 frames. ], tot_loss[loss=0.9805, simple_loss=0.6254, pruned_loss=0.6678, over 2581287.41 frames. ], batch size: 25, lr: 2.51e-02, grad_scale: 0.5 2024-06-19 16:43:48,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=21666.333333333332, ans=0.025 2024-06-19 16:43:50,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=21684.666666666668, ans=0.07 2024-06-19 16:43:58,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=21703.0, ans=0.006151521739130435 2024-06-19 16:44:00,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=21703.0, ans=0.2 2024-06-19 16:44:06,627 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.71 vs. limit=10.0 2024-06-19 16:44:06,702 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.61 vs. limit=15.0 2024-06-19 16:44:07,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=21721.333333333332, ans=0.125 2024-06-19 16:44:08,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=21721.333333333332, ans=0.125 2024-06-19 16:44:09,128 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.94 vs. limit=22.5 2024-06-19 16:44:16,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=21739.666666666668, ans=0.125 2024-06-19 16:44:16,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=21739.666666666668, ans=0.1 2024-06-19 16:44:17,684 INFO [train.py:1028] (1/2) Epoch 2, batch 1750, loss[loss=1.097, simple_loss=0.6743, pruned_loss=0.7603, over 12591.00 frames. ], tot_loss[loss=0.9813, simple_loss=0.6256, pruned_loss=0.6685, over 2582275.58 frames. ], batch size: 22, lr: 2.51e-02, grad_scale: 0.5 2024-06-19 16:44:21,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=21758.0, ans=0.09899494936611666 2024-06-19 16:44:26,735 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.12 vs. limit=15.0 2024-06-19 16:44:32,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=21794.666666666668, ans=0.025 2024-06-19 16:44:36,094 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=25.08 vs. limit=15.0 2024-06-19 16:44:42,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=33.95 vs. limit=22.5 2024-06-19 16:44:43,392 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.53 vs. limit=15.0 2024-06-19 16:44:43,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=21831.333333333332, ans=0.125 2024-06-19 16:44:50,479 INFO [train.py:1028] (1/2) Epoch 2, batch 1800, loss[loss=0.9587, simple_loss=0.6109, pruned_loss=0.6532, over 13179.00 frames. ], tot_loss[loss=0.9805, simple_loss=0.6257, pruned_loss=0.6676, over 2582735.53 frames. ], batch size: 67, lr: 2.50e-02, grad_scale: 0.25 2024-06-19 16:44:51,814 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.354e+03 5.226e+03 6.742e+03 8.545e+03 2.402e+04, threshold=1.348e+04, percent-clipped=2.0 2024-06-19 16:44:52,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=21849.666666666668, ans=10.0 2024-06-19 16:44:57,243 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=7.248e+00 2024-06-19 16:44:58,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=21868.0, ans=0.0 2024-06-19 16:45:14,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=21904.666666666668, ans=0.0 2024-06-19 16:45:20,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=21923.0, ans=0.125 2024-06-19 16:45:21,714 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.28 vs. limit=15.0 2024-06-19 16:45:23,281 INFO [train.py:1028] (1/2) Epoch 2, batch 1850, loss[loss=0.9356, simple_loss=0.6016, pruned_loss=0.6348, over 13190.00 frames. ], tot_loss[loss=0.9805, simple_loss=0.6259, pruned_loss=0.6676, over 2582852.49 frames. ], batch size: 83, lr: 2.50e-02, grad_scale: 0.25 2024-06-19 16:45:28,259 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=23.85 vs. limit=15.0 2024-06-19 16:45:49,260 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.16 vs. limit=15.0 2024-06-19 16:46:05,238 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=1.012e+00 2024-06-19 16:46:06,368 INFO [train.py:1028] (1/2) Epoch 2, batch 1900, loss[loss=0.9688, simple_loss=0.6255, pruned_loss=0.656, over 13172.00 frames. ], tot_loss[loss=0.9752, simple_loss=0.6236, pruned_loss=0.6634, over 2585357.23 frames. ], batch size: 95, lr: 2.50e-02, grad_scale: 0.25 2024-06-19 16:46:07,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=22033.0, ans=0.125 2024-06-19 16:46:08,208 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.676e+03 6.252e+03 7.371e+03 1.003e+04 3.293e+04, threshold=1.474e+04, percent-clipped=10.0 2024-06-19 16:46:08,683 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=26.37 vs. limit=15.0 2024-06-19 16:46:09,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=22033.0, ans=0.2 2024-06-19 16:46:09,368 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=11.82 vs. limit=12.0 2024-06-19 16:46:15,484 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=83.59 vs. limit=15.0 2024-06-19 16:46:19,740 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.34 vs. limit=22.5 2024-06-19 16:46:21,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=22069.666666666668, ans=0.0 2024-06-19 16:46:28,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=22088.0, ans=0.125 2024-06-19 16:46:30,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=22088.0, ans=0.125 2024-06-19 16:46:30,320 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.68 vs. limit=15.0 2024-06-19 16:46:30,959 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.42 vs. limit=10.0 2024-06-19 16:46:32,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=22106.333333333332, ans=0.125 2024-06-19 16:46:34,139 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=27.55 vs. limit=15.0 2024-06-19 16:46:36,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=22106.333333333332, ans=0.125 2024-06-19 16:46:38,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=22106.333333333332, ans=0.0 2024-06-19 16:46:39,695 INFO [train.py:1028] (1/2) Epoch 2, batch 1950, loss[loss=0.9671, simple_loss=0.6125, pruned_loss=0.6608, over 13209.00 frames. ], tot_loss[loss=0.9726, simple_loss=0.6217, pruned_loss=0.6618, over 2591723.32 frames. ], batch size: 52, lr: 2.49e-02, grad_scale: 0.25 2024-06-19 16:46:41,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=22124.666666666668, ans=0.1 2024-06-19 16:46:42,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.69 vs. limit=22.5 2024-06-19 16:47:12,794 INFO [train.py:1028] (1/2) Epoch 2, batch 2000, loss[loss=1.009, simple_loss=0.6244, pruned_loss=0.6972, over 12507.00 frames. ], tot_loss[loss=0.9727, simple_loss=0.6223, pruned_loss=0.6615, over 2587691.37 frames. ], batch size: 22, lr: 2.49e-02, grad_scale: 0.25 2024-06-19 16:47:15,381 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.303e+03 6.102e+03 7.605e+03 1.055e+04 4.573e+04, threshold=1.521e+04, percent-clipped=11.0 2024-06-19 16:47:17,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=33.10 vs. limit=22.5 2024-06-19 16:47:25,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=22234.666666666668, ans=0.1 2024-06-19 16:47:31,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=22253.0, ans=0.125 2024-06-19 16:47:33,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=22253.0, ans=0.1 2024-06-19 16:47:48,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=22289.666666666668, ans=0.125 2024-06-19 16:47:48,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=22289.666666666668, ans=0.125 2024-06-19 16:47:49,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=22289.666666666668, ans=0.2 2024-06-19 16:47:52,381 INFO [train.py:1028] (1/2) Epoch 2, batch 2050, loss[loss=1.068, simple_loss=0.6514, pruned_loss=0.7418, over 12684.00 frames. ], tot_loss[loss=0.9745, simple_loss=0.6224, pruned_loss=0.6633, over 2583000.19 frames. ], batch size: 29, lr: 2.49e-02, grad_scale: 0.25 2024-06-19 16:48:04,353 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.49 vs. limit=22.5 2024-06-19 16:48:05,634 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.61 vs. limit=15.0 2024-06-19 16:48:06,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=22344.666666666668, ans=0.0 2024-06-19 16:48:06,867 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=25.21 vs. limit=22.5 2024-06-19 16:48:07,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=22344.666666666668, ans=0.125 2024-06-19 16:48:12,722 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=22363.0, ans=0.0 2024-06-19 16:48:14,386 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=35.01 vs. limit=15.0 2024-06-19 16:48:17,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=22363.0, ans=0.125 2024-06-19 16:48:23,285 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=19.12 vs. limit=15.0 2024-06-19 16:48:25,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=22399.666666666668, ans=0.125 2024-06-19 16:48:26,315 INFO [train.py:1028] (1/2) Epoch 2, batch 2100, loss[loss=0.9398, simple_loss=0.5983, pruned_loss=0.6407, over 13203.00 frames. ], tot_loss[loss=0.9777, simple_loss=0.6241, pruned_loss=0.6657, over 2585502.63 frames. ], batch size: 59, lr: 2.48e-02, grad_scale: 0.5 2024-06-19 16:48:27,422 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=23.35 vs. limit=15.0 2024-06-19 16:48:28,960 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.061e+03 3.980e+03 4.910e+03 5.983e+03 2.832e+04, threshold=9.821e+03, percent-clipped=1.0 2024-06-19 16:48:33,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=22418.0, ans=0.005996086956521739 2024-06-19 16:48:36,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=22418.0, ans=0.125 2024-06-19 16:48:38,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=22418.0, ans=0.04949747468305833 2024-06-19 16:48:42,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=22436.333333333332, ans=0.125 2024-06-19 16:48:43,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=22436.333333333332, ans=0.125 2024-06-19 16:48:43,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=22436.333333333332, ans=0.125 2024-06-19 16:48:49,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=22454.666666666668, ans=0.1 2024-06-19 16:48:52,630 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=33.79 vs. limit=15.0 2024-06-19 16:48:57,926 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.68 vs. limit=15.0 2024-06-19 16:48:57,977 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.96 vs. limit=10.0 2024-06-19 16:48:58,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=22473.0, ans=0.07 2024-06-19 16:49:00,121 INFO [train.py:1028] (1/2) Epoch 2, batch 2150, loss[loss=0.9744, simple_loss=0.6173, pruned_loss=0.6658, over 13326.00 frames. ], tot_loss[loss=0.9764, simple_loss=0.623, pruned_loss=0.6649, over 2588696.32 frames. ], batch size: 52, lr: 2.48e-02, grad_scale: 0.5 2024-06-19 16:49:01,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=22491.333333333332, ans=0.2 2024-06-19 16:49:06,515 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.29 vs. limit=10.0 2024-06-19 16:49:06,520 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.13 vs. limit=15.0 2024-06-19 16:49:07,436 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.75 vs. limit=15.0 2024-06-19 16:49:08,324 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=18.26 vs. limit=15.0 2024-06-19 16:49:08,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=22509.666666666668, ans=0.125 2024-06-19 16:49:21,585 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.81 vs. limit=6.0 2024-06-19 16:49:25,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=22546.333333333332, ans=0.125 2024-06-19 16:49:32,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=22564.666666666668, ans=0.125 2024-06-19 16:49:33,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=22583.0, ans=0.125 2024-06-19 16:49:33,508 INFO [train.py:1028] (1/2) Epoch 2, batch 2200, loss[loss=0.9405, simple_loss=0.6113, pruned_loss=0.6349, over 13194.00 frames. ], tot_loss[loss=0.9766, simple_loss=0.6231, pruned_loss=0.6651, over 2588558.89 frames. ], batch size: 83, lr: 2.47e-02, grad_scale: 0.25 2024-06-19 16:49:34,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=22583.0, ans=0.125 2024-06-19 16:49:41,120 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.031e+03 6.121e+03 7.867e+03 9.684e+03 3.625e+04, threshold=1.573e+04, percent-clipped=24.0 2024-06-19 16:49:48,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=22619.666666666668, ans=0.005952246376811594 2024-06-19 16:49:49,302 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=19.95 vs. limit=15.0 2024-06-19 16:49:52,485 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2024-06-19 16:50:08,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=22656.333333333332, ans=0.025 2024-06-19 16:50:11,819 INFO [train.py:1028] (1/2) Epoch 2, batch 2250, loss[loss=1.01, simple_loss=0.6459, pruned_loss=0.6869, over 13260.00 frames. ], tot_loss[loss=0.9756, simple_loss=0.6228, pruned_loss=0.6642, over 2586902.30 frames. ], batch size: 63, lr: 2.47e-02, grad_scale: 0.125 2024-06-19 16:50:12,893 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.04 vs. limit=5.0 2024-06-19 16:50:15,597 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=4.456e+01 2024-06-19 16:50:20,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=22693.0, ans=0.07 2024-06-19 16:50:20,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=22693.0, ans=0.0 2024-06-19 16:50:24,097 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.39 vs. limit=10.0 2024-06-19 16:50:25,458 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2024-06-19 16:50:26,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=22711.333333333332, ans=0.125 2024-06-19 16:50:27,570 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.48 vs. limit=15.0 2024-06-19 16:50:28,665 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.12 vs. limit=15.0 2024-06-19 16:50:33,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=22729.666666666668, ans=0.125 2024-06-19 16:50:35,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=22729.666666666668, ans=0.125 2024-06-19 16:50:36,016 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.63 vs. limit=15.0 2024-06-19 16:50:42,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.43 vs. limit=22.5 2024-06-19 16:50:44,353 INFO [train.py:1028] (1/2) Epoch 2, batch 2300, loss[loss=1.045, simple_loss=0.6543, pruned_loss=0.7181, over 12875.00 frames. ], tot_loss[loss=0.9804, simple_loss=0.6253, pruned_loss=0.6677, over 2581968.80 frames. ], batch size: 33, lr: 2.47e-02, grad_scale: 0.25 2024-06-19 16:50:44,743 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=25.08 vs. limit=15.0 2024-06-19 16:50:48,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=22766.333333333332, ans=0.125 2024-06-19 16:50:49,095 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.948e+03 5.933e+03 8.161e+03 9.991e+03 5.428e+04, threshold=1.632e+04, percent-clipped=5.0 2024-06-19 16:50:52,776 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.53 vs. limit=15.0 2024-06-19 16:51:02,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=22803.0, ans=0.125 2024-06-19 16:51:02,985 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=32.22 vs. limit=22.5 2024-06-19 16:51:17,892 INFO [train.py:1028] (1/2) Epoch 2, batch 2350, loss[loss=0.9968, simple_loss=0.6323, pruned_loss=0.6806, over 13218.00 frames. ], tot_loss[loss=0.9824, simple_loss=0.6264, pruned_loss=0.6693, over 2585016.30 frames. ], batch size: 67, lr: 2.46e-02, grad_scale: 0.25 2024-06-19 16:51:19,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=22858.0, ans=0.0 2024-06-19 16:51:19,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=22858.0, ans=0.0 2024-06-19 16:51:24,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=22876.333333333332, ans=0.09899494936611666 2024-06-19 16:51:36,352 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=23.72 vs. limit=22.5 2024-06-19 16:51:44,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=22913.0, ans=0.05 2024-06-19 16:51:49,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=22931.333333333332, ans=0.005884492753623189 2024-06-19 16:51:53,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=22931.333333333332, ans=0.005884492753623189 2024-06-19 16:51:54,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=22931.333333333332, ans=0.125 2024-06-19 16:51:56,935 INFO [train.py:1028] (1/2) Epoch 2, batch 2400, loss[loss=0.9979, simple_loss=0.6343, pruned_loss=0.6807, over 13292.00 frames. ], tot_loss[loss=0.98, simple_loss=0.6247, pruned_loss=0.6677, over 2587882.40 frames. ], batch size: 46, lr: 2.46e-02, grad_scale: 0.5 2024-06-19 16:52:00,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=22949.666666666668, ans=0.0 2024-06-19 16:52:02,031 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.554e+03 6.217e+03 7.491e+03 9.810e+03 6.010e+04, threshold=1.498e+04, percent-clipped=6.0 2024-06-19 16:52:02,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=22949.666666666668, ans=0.125 2024-06-19 16:52:04,433 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.15 vs. limit=15.0 2024-06-19 16:52:12,351 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=27.36 vs. limit=22.5 2024-06-19 16:52:18,930 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.03 vs. limit=22.5 2024-06-19 16:52:25,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=23023.0, ans=0.125 2024-06-19 16:52:29,439 INFO [train.py:1028] (1/2) Epoch 2, batch 2450, loss[loss=0.956, simple_loss=0.607, pruned_loss=0.6525, over 13277.00 frames. ], tot_loss[loss=0.9768, simple_loss=0.6233, pruned_loss=0.6652, over 2584682.65 frames. ], batch size: 63, lr: 2.46e-02, grad_scale: 0.25 2024-06-19 16:52:34,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=23041.333333333332, ans=0.025 2024-06-19 16:52:48,661 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=24.90 vs. limit=15.0 2024-06-19 16:52:53,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=23096.333333333332, ans=0.0 2024-06-19 16:52:53,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=23096.333333333332, ans=0.025 2024-06-19 16:52:57,532 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=18.08 vs. limit=15.0 2024-06-19 16:52:59,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=23114.666666666668, ans=0.025 2024-06-19 16:53:02,370 INFO [train.py:1028] (1/2) Epoch 2, batch 2500, loss[loss=0.9669, simple_loss=0.6205, pruned_loss=0.6567, over 13242.00 frames. ], tot_loss[loss=0.9712, simple_loss=0.6205, pruned_loss=0.6609, over 2588027.14 frames. ], batch size: 83, lr: 2.45e-02, grad_scale: 0.125 2024-06-19 16:53:08,850 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.566e+03 9.530e+03 1.158e+04 1.651e+04 1.128e+05, threshold=2.317e+04, percent-clipped=29.0 2024-06-19 16:53:12,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=23151.333333333332, ans=0.125 2024-06-19 16:53:15,608 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=31.29 vs. limit=15.0 2024-06-19 16:53:16,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=23169.666666666668, ans=0.025 2024-06-19 16:53:16,907 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.33 vs. limit=10.0 2024-06-19 16:53:19,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=23169.666666666668, ans=15.0 2024-06-19 16:53:27,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=23188.0, ans=0.005828695652173913 2024-06-19 16:53:34,652 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.69 vs. limit=15.0 2024-06-19 16:53:35,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=23206.333333333332, ans=0.125 2024-06-19 16:53:38,896 INFO [train.py:1028] (1/2) Epoch 2, batch 2550, loss[loss=1.05, simple_loss=0.6372, pruned_loss=0.7316, over 12356.00 frames. ], tot_loss[loss=0.9673, simple_loss=0.618, pruned_loss=0.6583, over 2587304.73 frames. ], batch size: 22, lr: 2.45e-02, grad_scale: 0.125 2024-06-19 16:53:39,261 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.10 vs. limit=15.0 2024-06-19 16:53:54,108 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.27 vs. limit=15.0 2024-06-19 16:53:57,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=23261.333333333332, ans=0.125 2024-06-19 16:53:59,968 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=18.40 vs. limit=15.0 2024-06-19 16:54:05,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=23279.666666666668, ans=0.125 2024-06-19 16:54:14,390 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=19.66 vs. limit=15.0 2024-06-19 16:54:14,669 INFO [train.py:1028] (1/2) Epoch 2, batch 2600, loss[loss=0.967, simple_loss=0.6023, pruned_loss=0.6659, over 13247.00 frames. ], tot_loss[loss=0.9647, simple_loss=0.6153, pruned_loss=0.6571, over 2586466.51 frames. ], batch size: 52, lr: 2.45e-02, grad_scale: 0.25 2024-06-19 16:54:18,523 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.54 vs. limit=15.0 2024-06-19 16:54:21,272 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.178e+03 5.812e+03 7.896e+03 9.891e+03 5.199e+04, threshold=1.579e+04, percent-clipped=4.0 2024-06-19 16:54:24,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=23334.666666666668, ans=0.125 2024-06-19 16:54:26,332 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=20.29 vs. limit=15.0 2024-06-19 16:54:31,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=23353.0, ans=0.1 2024-06-19 16:54:32,885 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=25.90 vs. limit=22.5 2024-06-19 16:54:44,157 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.27 vs. limit=12.0 2024-06-19 16:54:48,223 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.03 vs. limit=22.5 2024-06-19 16:54:48,403 INFO [train.py:1028] (1/2) Epoch 2, batch 2650, loss[loss=0.8664, simple_loss=0.5719, pruned_loss=0.5805, over 13048.00 frames. ], tot_loss[loss=0.9597, simple_loss=0.6122, pruned_loss=0.6536, over 2586626.97 frames. ], batch size: 144, lr: 2.44e-02, grad_scale: 0.25 2024-06-19 16:54:48,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=23408.0, ans=0.1 2024-06-19 16:54:57,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=23426.333333333332, ans=0.125 2024-06-19 16:54:57,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=23426.333333333332, ans=0.1 2024-06-19 16:54:58,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=23426.333333333332, ans=0.1 2024-06-19 16:54:59,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=23426.333333333332, ans=0.0 2024-06-19 16:55:00,692 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.21 vs. limit=15.0 2024-06-19 16:55:00,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.81 vs. limit=15.0 2024-06-19 16:55:01,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=23444.666666666668, ans=0.125 2024-06-19 16:55:05,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=23444.666666666668, ans=0.005772898550724637 2024-06-19 16:55:05,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=23444.666666666668, ans=0.125 2024-06-19 16:55:06,118 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=19.88 vs. limit=15.0 2024-06-19 16:55:07,573 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.71 vs. limit=15.0 2024-06-19 16:55:11,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=23463.0, ans=0.0 2024-06-19 16:55:12,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=23463.0, ans=0.125 2024-06-19 16:55:25,218 INFO [train.py:1028] (1/2) Epoch 2, batch 2700, loss[loss=0.9359, simple_loss=0.6126, pruned_loss=0.6296, over 13292.00 frames. ], tot_loss[loss=0.9514, simple_loss=0.6083, pruned_loss=0.6472, over 2585209.17 frames. ], batch size: 89, lr: 2.44e-02, grad_scale: 0.5 2024-06-19 16:55:31,848 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.261e+03 3.882e+03 4.889e+03 6.103e+03 2.908e+04, threshold=9.777e+03, percent-clipped=1.0 2024-06-19 16:55:40,220 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.37 vs. limit=15.0 2024-06-19 16:55:42,245 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.07 vs. limit=15.0 2024-06-19 16:55:46,083 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.36 vs. limit=15.0 2024-06-19 16:55:55,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=23573.0, ans=0.025 2024-06-19 16:56:02,122 INFO [train.py:1028] (1/2) Epoch 2, batch 2750, loss[loss=0.9691, simple_loss=0.615, pruned_loss=0.6616, over 13218.00 frames. ], tot_loss[loss=0.9466, simple_loss=0.6058, pruned_loss=0.6437, over 2582840.54 frames. ], batch size: 43, lr: 2.44e-02, grad_scale: 0.25 2024-06-19 16:56:24,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=23646.333333333332, ans=0.005729057971014493 2024-06-19 16:56:25,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=23646.333333333332, ans=0.0 2024-06-19 16:56:35,585 INFO [train.py:1028] (1/2) Epoch 2, batch 2800, loss[loss=0.8464, simple_loss=0.5746, pruned_loss=0.5591, over 10901.00 frames. ], tot_loss[loss=0.9451, simple_loss=0.6049, pruned_loss=0.6426, over 2579994.53 frames. ], batch size: 303, lr: 2.43e-02, grad_scale: 0.5 2024-06-19 16:56:36,029 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=42.61 vs. limit=15.0 2024-06-19 16:56:42,909 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.304e+03 3.671e+03 4.627e+03 6.254e+03 2.311e+04, threshold=9.254e+03, percent-clipped=6.0 2024-06-19 16:56:47,200 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.60 vs. limit=15.0 2024-06-19 16:56:49,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=23719.666666666668, ans=0.125 2024-06-19 16:56:56,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=30.70 vs. limit=22.5 2024-06-19 16:57:04,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=23756.333333333332, ans=0.2 2024-06-19 16:57:11,362 INFO [train.py:1028] (1/2) Epoch 2, batch 2850, loss[loss=0.92, simple_loss=0.5875, pruned_loss=0.6262, over 13282.00 frames. ], tot_loss[loss=0.9383, simple_loss=0.6017, pruned_loss=0.6374, over 2577926.56 frames. ], batch size: 49, lr: 2.43e-02, grad_scale: 0.25 2024-06-19 16:57:11,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=23774.666666666668, ans=0.125 2024-06-19 16:57:12,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=23774.666666666668, ans=0.125 2024-06-19 16:57:14,316 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.12 vs. limit=15.0 2024-06-19 16:57:16,389 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.77 vs. limit=6.0 2024-06-19 16:57:23,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=23811.333333333332, ans=0.125 2024-06-19 16:57:33,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=23829.666666666668, ans=0.125 2024-06-19 16:57:35,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=23829.666666666668, ans=0.025 2024-06-19 16:57:47,030 INFO [train.py:1028] (1/2) Epoch 2, batch 2900, loss[loss=0.9197, simple_loss=0.5795, pruned_loss=0.6299, over 13175.00 frames. ], tot_loss[loss=0.9311, simple_loss=0.5977, pruned_loss=0.6323, over 2585459.38 frames. ], batch size: 55, lr: 2.42e-02, grad_scale: 0.5 2024-06-19 16:57:55,347 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+03 2.936e+03 3.943e+03 5.282e+03 2.928e+04, threshold=7.887e+03, percent-clipped=7.0 2024-06-19 16:57:57,081 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=18.84 vs. limit=15.0 2024-06-19 16:58:08,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=23921.333333333332, ans=0.125 2024-06-19 16:58:10,479 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.84 vs. limit=10.0 2024-06-19 16:58:11,159 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=22.02 vs. limit=15.0 2024-06-19 16:58:12,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=23921.333333333332, ans=0.07 2024-06-19 16:58:13,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=23921.333333333332, ans=0.5 2024-06-19 16:58:20,857 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=17.57 vs. limit=15.0 2024-06-19 16:58:21,078 INFO [train.py:1028] (1/2) Epoch 2, batch 2950, loss[loss=0.9267, simple_loss=0.5854, pruned_loss=0.634, over 13241.00 frames. ], tot_loss[loss=0.9301, simple_loss=0.597, pruned_loss=0.6316, over 2579377.30 frames. ], batch size: 43, lr: 2.42e-02, grad_scale: 0.5 2024-06-19 16:58:43,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.88 vs. limit=15.0 2024-06-19 16:58:49,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=24031.333333333332, ans=0.125 2024-06-19 16:58:54,956 INFO [train.py:1028] (1/2) Epoch 2, batch 3000, loss[loss=0.9435, simple_loss=0.6005, pruned_loss=0.6433, over 13206.00 frames. ], tot_loss[loss=0.9268, simple_loss=0.5951, pruned_loss=0.6293, over 2579275.98 frames. ], batch size: 59, lr: 2.42e-02, grad_scale: 0.5 2024-06-19 16:58:54,957 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 16:59:05,800 INFO [train.py:1060] (1/2) Epoch 2, validation: loss=0.9755, simple_loss=0.6278, pruned_loss=0.6616, over 351949.00 frames. 2024-06-19 16:59:05,800 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 16946MB 2024-06-19 16:59:08,229 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.52 vs. limit=15.0 2024-06-19 16:59:14,396 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.714e+03 3.929e+03 5.019e+03 6.116e+03 2.164e+04, threshold=1.004e+04, percent-clipped=11.0 2024-06-19 16:59:14,832 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.45 vs. limit=10.0 2024-06-19 16:59:23,784 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.72 vs. limit=10.0 2024-06-19 16:59:28,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=24104.666666666668, ans=0.125 2024-06-19 16:59:29,118 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.76 vs. limit=15.0 2024-06-19 16:59:39,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=24123.0, ans=0.0 2024-06-19 16:59:40,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24123.0, ans=0.1 2024-06-19 16:59:42,150 INFO [train.py:1028] (1/2) Epoch 2, batch 3050, loss[loss=0.9466, simple_loss=0.5952, pruned_loss=0.6491, over 13254.00 frames. ], tot_loss[loss=0.922, simple_loss=0.5928, pruned_loss=0.6256, over 2579315.60 frames. ], batch size: 46, lr: 2.41e-02, grad_scale: 0.5 2024-06-19 16:59:42,660 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=15.77 vs. limit=15.0 2024-06-19 16:59:46,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=24141.333333333332, ans=0.125 2024-06-19 16:59:51,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=24159.666666666668, ans=0.125 2024-06-19 16:59:52,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=24159.666666666668, ans=0.125 2024-06-19 16:59:56,518 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.67 vs. limit=15.0 2024-06-19 17:00:04,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=24196.333333333332, ans=0.025 2024-06-19 17:00:14,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24214.666666666668, ans=0.1 2024-06-19 17:00:15,468 INFO [train.py:1028] (1/2) Epoch 2, batch 3100, loss[loss=0.8381, simple_loss=0.5557, pruned_loss=0.5603, over 13015.00 frames. ], tot_loss[loss=0.9197, simple_loss=0.5914, pruned_loss=0.624, over 2579187.30 frames. ], batch size: 144, lr: 2.41e-02, grad_scale: 0.5 2024-06-19 17:00:15,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24233.0, ans=0.1 2024-06-19 17:00:17,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=24233.0, ans=0.125 2024-06-19 17:00:24,660 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=1.302e+03 2024-06-19 17:00:25,952 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.912e+03 5.586e+03 7.131e+03 1.055e+04 3.160e+04, threshold=1.426e+04, percent-clipped=25.0 2024-06-19 17:00:26,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=24251.333333333332, ans=0.025 2024-06-19 17:00:30,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=24269.666666666668, ans=0.0 2024-06-19 17:00:38,195 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.50 vs. limit=15.0 2024-06-19 17:00:46,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=24306.333333333332, ans=0.125 2024-06-19 17:00:48,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=24306.333333333332, ans=0.125 2024-06-19 17:00:50,096 INFO [train.py:1028] (1/2) Epoch 2, batch 3150, loss[loss=0.8461, simple_loss=0.5648, pruned_loss=0.5637, over 12936.00 frames. ], tot_loss[loss=0.9168, simple_loss=0.59, pruned_loss=0.6218, over 2580863.14 frames. ], batch size: 158, lr: 2.41e-02, grad_scale: 0.25 2024-06-19 17:00:52,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24324.666666666668, ans=0.1 2024-06-19 17:00:57,746 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=7.853e+00 2024-06-19 17:00:59,214 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=15.0 2024-06-19 17:00:59,957 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.36 vs. limit=10.0 2024-06-19 17:01:01,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=24343.0, ans=15.0 2024-06-19 17:01:15,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=24379.666666666668, ans=0.125 2024-06-19 17:01:19,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=24398.0, ans=0.125 2024-06-19 17:01:20,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=24398.0, ans=0.2 2024-06-19 17:01:21,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=24398.0, ans=0.125 2024-06-19 17:01:26,606 INFO [train.py:1028] (1/2) Epoch 2, batch 3200, loss[loss=0.9096, simple_loss=0.5759, pruned_loss=0.6217, over 13143.00 frames. ], tot_loss[loss=0.9189, simple_loss=0.5896, pruned_loss=0.6241, over 2581380.12 frames. ], batch size: 55, lr: 2.40e-02, grad_scale: 0.5 2024-06-19 17:01:37,273 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.665e+03 6.229e+03 7.152e+03 9.416e+03 4.728e+04, threshold=1.430e+04, percent-clipped=6.0 2024-06-19 17:01:38,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24434.666666666668, ans=0.1 2024-06-19 17:01:39,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=24453.0, ans=0.09899494936611666 2024-06-19 17:01:40,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=24453.0, ans=0.125 2024-06-19 17:01:55,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=24471.333333333332, ans=0.05 2024-06-19 17:01:59,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=24489.666666666668, ans=0.0 2024-06-19 17:02:03,371 INFO [train.py:1028] (1/2) Epoch 2, batch 3250, loss[loss=0.9175, simple_loss=0.5771, pruned_loss=0.6289, over 13284.00 frames. ], tot_loss[loss=0.9185, simple_loss=0.5884, pruned_loss=0.6243, over 2584920.88 frames. ], batch size: 72, lr: 2.40e-02, grad_scale: 0.125 2024-06-19 17:02:13,133 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.84 vs. limit=15.0 2024-06-19 17:02:14,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=24526.333333333332, ans=0.09899494936611666 2024-06-19 17:02:16,765 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=15.90 vs. limit=12.0 2024-06-19 17:02:30,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=24563.0, ans=0.1 2024-06-19 17:02:36,520 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=25.71 vs. limit=15.0 2024-06-19 17:02:38,668 INFO [train.py:1028] (1/2) Epoch 2, batch 3300, loss[loss=0.856, simple_loss=0.5743, pruned_loss=0.5689, over 12723.00 frames. ], tot_loss[loss=0.9181, simple_loss=0.5874, pruned_loss=0.6244, over 2581288.86 frames. ], batch size: 176, lr: 2.40e-02, grad_scale: 0.25 2024-06-19 17:02:39,082 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.88 vs. limit=10.0 2024-06-19 17:02:43,258 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=28.56 vs. limit=22.5 2024-06-19 17:02:47,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=24618.0, ans=0.0 2024-06-19 17:02:51,025 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.694e+03 4.848e+03 5.677e+03 7.153e+03 8.373e+04, threshold=1.135e+04, percent-clipped=8.0 2024-06-19 17:03:02,162 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.21 vs. limit=22.5 2024-06-19 17:03:10,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=24673.0, ans=0.125 2024-06-19 17:03:12,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=24673.0, ans=0.125 2024-06-19 17:03:13,836 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.46 vs. limit=15.0 2024-06-19 17:03:14,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24673.0, ans=0.1 2024-06-19 17:03:15,475 INFO [train.py:1028] (1/2) Epoch 2, batch 3350, loss[loss=0.9018, simple_loss=0.6002, pruned_loss=0.6017, over 12924.00 frames. ], tot_loss[loss=0.912, simple_loss=0.5852, pruned_loss=0.6194, over 2575815.45 frames. ], batch size: 158, lr: 2.39e-02, grad_scale: 0.125 2024-06-19 17:03:15,996 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.84 vs. limit=15.0 2024-06-19 17:03:20,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=24691.333333333332, ans=0.95 2024-06-19 17:03:20,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=24691.333333333332, ans=0.2 2024-06-19 17:03:27,029 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.738e+01 2024-06-19 17:03:32,154 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=18.82 vs. limit=15.0 2024-06-19 17:03:40,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=24746.333333333332, ans=0.125 2024-06-19 17:03:44,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=24764.666666666668, ans=0.125 2024-06-19 17:03:46,940 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.76 vs. limit=15.0 2024-06-19 17:03:51,817 INFO [train.py:1028] (1/2) Epoch 2, batch 3400, loss[loss=0.9868, simple_loss=0.6062, pruned_loss=0.6837, over 12821.00 frames. ], tot_loss[loss=0.9092, simple_loss=0.5838, pruned_loss=0.6173, over 2574217.65 frames. ], batch size: 22, lr: 2.39e-02, grad_scale: 0.25 2024-06-19 17:03:53,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=24783.0, ans=0.1 2024-06-19 17:03:58,569 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=23.57 vs. limit=15.0 2024-06-19 17:04:04,088 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.343e+03 4.329e+03 5.335e+03 7.046e+03 2.045e+04, threshold=1.067e+04, percent-clipped=6.0 2024-06-19 17:04:23,787 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.51 vs. limit=15.0 2024-06-19 17:04:26,106 INFO [train.py:1028] (1/2) Epoch 2, batch 3450, loss[loss=0.8898, simple_loss=0.5912, pruned_loss=0.5942, over 12681.00 frames. ], tot_loss[loss=0.9081, simple_loss=0.5827, pruned_loss=0.6168, over 2576419.95 frames. ], batch size: 176, lr: 2.39e-02, grad_scale: 0.25 2024-06-19 17:04:30,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=24874.666666666668, ans=0.125 2024-06-19 17:04:31,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=24874.666666666668, ans=0.005462028985507247 2024-06-19 17:04:33,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=24893.0, ans=0.025 2024-06-19 17:04:34,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=24893.0, ans=0.0 2024-06-19 17:04:38,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24893.0, ans=0.1 2024-06-19 17:04:40,139 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.50 vs. limit=15.0 2024-06-19 17:04:41,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=24911.333333333332, ans=0.125 2024-06-19 17:04:42,838 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=41.12 vs. limit=15.0 2024-06-19 17:04:46,895 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.24 vs. limit=15.0 2024-06-19 17:04:53,077 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.46 vs. limit=12.0 2024-06-19 17:04:53,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=24948.0, ans=0.0 2024-06-19 17:05:00,609 INFO [train.py:1028] (1/2) Epoch 2, batch 3500, loss[loss=0.9282, simple_loss=0.5803, pruned_loss=0.638, over 12992.00 frames. ], tot_loss[loss=0.9073, simple_loss=0.5816, pruned_loss=0.6164, over 2575788.48 frames. ], batch size: 33, lr: 2.38e-02, grad_scale: 0.5 2024-06-19 17:05:08,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=24966.333333333332, ans=0.125 2024-06-19 17:05:13,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=24984.666666666668, ans=0.0054381159420289855 2024-06-19 17:05:13,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24984.666666666668, ans=0.1 2024-06-19 17:05:16,697 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.513e+03 3.727e+03 4.442e+03 5.346e+03 2.556e+04, threshold=8.884e+03, percent-clipped=6.0 2024-06-19 17:05:17,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=24984.666666666668, ans=15.0 2024-06-19 17:05:21,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=25003.0, ans=0.125 2024-06-19 17:05:39,056 INFO [train.py:1028] (1/2) Epoch 2, batch 3550, loss[loss=0.8689, simple_loss=0.5607, pruned_loss=0.5885, over 13098.00 frames. ], tot_loss[loss=0.9076, simple_loss=0.581, pruned_loss=0.6171, over 2578859.07 frames. ], batch size: 95, lr: 2.38e-02, grad_scale: 0.5 2024-06-19 17:05:45,155 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=11.14 vs. limit=10.0 2024-06-19 17:05:45,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=25058.0, ans=0.125 2024-06-19 17:05:50,372 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=163.04 vs. limit=22.5 2024-06-19 17:05:51,761 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.55 vs. limit=15.0 2024-06-19 17:05:55,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=25094.666666666668, ans=0.2 2024-06-19 17:05:59,879 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.94 vs. limit=15.0 2024-06-19 17:06:02,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=25113.0, ans=0.125 2024-06-19 17:06:07,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=25113.0, ans=0.125 2024-06-19 17:06:08,489 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=45.88 vs. limit=22.5 2024-06-19 17:06:12,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25131.333333333332, ans=0.1 2024-06-19 17:06:13,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=25131.333333333332, ans=0.125 2024-06-19 17:06:15,726 INFO [train.py:1028] (1/2) Epoch 2, batch 3600, loss[loss=0.9346, simple_loss=0.5901, pruned_loss=0.6396, over 13034.00 frames. ], tot_loss[loss=0.9048, simple_loss=0.5805, pruned_loss=0.6145, over 2582249.46 frames. ], batch size: 48, lr: 2.38e-02, grad_scale: 0.5 2024-06-19 17:06:16,100 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=41.73 vs. limit=22.5 2024-06-19 17:06:25,858 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=38.98 vs. limit=22.5 2024-06-19 17:06:28,813 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+03 4.574e+03 5.849e+03 8.000e+03 5.142e+04, threshold=1.170e+04, percent-clipped=17.0 2024-06-19 17:06:49,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=25241.333333333332, ans=0.125 2024-06-19 17:06:50,234 INFO [train.py:1028] (1/2) Epoch 2, batch 3650, loss[loss=0.8311, simple_loss=0.5394, pruned_loss=0.5613, over 12975.00 frames. ], tot_loss[loss=0.9048, simple_loss=0.5804, pruned_loss=0.6146, over 2579226.54 frames. ], batch size: 102, lr: 2.37e-02, grad_scale: 0.5 2024-06-19 17:06:51,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=25241.333333333332, ans=0.125 2024-06-19 17:06:55,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=25241.333333333332, ans=0.125 2024-06-19 17:06:57,382 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.39 vs. limit=10.0 2024-06-19 17:07:05,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=25278.0, ans=22.5 2024-06-19 17:07:19,566 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=24.47 vs. limit=15.0 2024-06-19 17:07:27,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=25314.666666666668, ans=0.125 2024-06-19 17:07:29,001 INFO [train.py:1028] (1/2) Epoch 2, batch 3700, loss[loss=0.8711, simple_loss=0.5612, pruned_loss=0.5905, over 13272.00 frames. ], tot_loss[loss=0.901, simple_loss=0.5784, pruned_loss=0.6118, over 2584559.50 frames. ], batch size: 72, lr: 2.37e-02, grad_scale: 1.0 2024-06-19 17:07:30,095 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=15.0 2024-06-19 17:07:35,122 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=27.36 vs. limit=22.5 2024-06-19 17:07:37,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=25351.333333333332, ans=0.1 2024-06-19 17:07:41,702 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.50 vs. limit=15.0 2024-06-19 17:07:43,271 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.525e+03 5.418e+03 6.607e+03 8.945e+03 3.636e+04, threshold=1.321e+04, percent-clipped=15.0 2024-06-19 17:07:53,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=25388.0, ans=0.2 2024-06-19 17:07:54,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=25388.0, ans=0.125 2024-06-19 17:08:03,759 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.63 vs. limit=6.0 2024-06-19 17:08:04,424 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.30 vs. limit=22.5 2024-06-19 17:08:05,413 INFO [train.py:1028] (1/2) Epoch 2, batch 3750, loss[loss=1.004, simple_loss=0.6104, pruned_loss=0.6989, over 12568.00 frames. ], tot_loss[loss=0.899, simple_loss=0.5771, pruned_loss=0.6104, over 2585924.78 frames. ], batch size: 22, lr: 2.37e-02, grad_scale: 0.125 2024-06-19 17:08:13,113 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=16.61 vs. limit=15.0 2024-06-19 17:08:20,139 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=35.52 vs. limit=15.0 2024-06-19 17:08:24,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25479.666666666668, ans=0.1 2024-06-19 17:08:38,372 INFO [train.py:1028] (1/2) Epoch 2, batch 3800, loss[loss=0.9423, simple_loss=0.5988, pruned_loss=0.6429, over 13209.00 frames. ], tot_loss[loss=0.9019, simple_loss=0.5778, pruned_loss=0.613, over 2583915.94 frames. ], batch size: 83, lr: 2.36e-02, grad_scale: 0.25 2024-06-19 17:08:46,832 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.76 vs. limit=15.0 2024-06-19 17:08:53,128 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.455e+03 7.849e+03 9.486e+03 1.301e+04 3.817e+04, threshold=1.897e+04, percent-clipped=21.0 2024-06-19 17:08:55,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=25553.0, ans=0.2 2024-06-19 17:08:56,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=25553.0, ans=0.125 2024-06-19 17:08:58,884 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.44 vs. limit=15.0 2024-06-19 17:08:59,570 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.26 vs. limit=15.0 2024-06-19 17:09:06,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=25589.666666666668, ans=0.125 2024-06-19 17:09:06,463 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.53 vs. limit=15.0 2024-06-19 17:09:12,236 INFO [train.py:1028] (1/2) Epoch 2, batch 3850, loss[loss=0.8385, simple_loss=0.5587, pruned_loss=0.5592, over 12988.00 frames. ], tot_loss[loss=0.903, simple_loss=0.577, pruned_loss=0.6145, over 2582901.45 frames. ], batch size: 144, lr: 2.36e-02, grad_scale: 0.25 2024-06-19 17:09:13,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=25608.0, ans=0.125 2024-06-19 17:09:18,292 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.24 vs. limit=15.0 2024-06-19 17:09:24,995 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.20 vs. limit=15.0 2024-06-19 17:09:29,105 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=22.04 vs. limit=15.0 2024-06-19 17:09:34,142 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=26.89 vs. limit=22.5 2024-06-19 17:09:36,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=25663.0, ans=0.0 2024-06-19 17:09:37,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=25663.0, ans=6.0 2024-06-19 17:09:39,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=25663.0, ans=0.025 2024-06-19 17:09:46,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=25681.333333333332, ans=0.125 2024-06-19 17:09:47,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=25681.333333333332, ans=0.04949747468305833 2024-06-19 17:09:48,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=25699.666666666668, ans=0.125 2024-06-19 17:09:49,221 INFO [train.py:1028] (1/2) Epoch 2, batch 3900, loss[loss=0.845, simple_loss=0.551, pruned_loss=0.5695, over 13218.00 frames. ], tot_loss[loss=0.8976, simple_loss=0.5755, pruned_loss=0.6099, over 2586329.06 frames. ], batch size: 83, lr: 2.36e-02, grad_scale: 0.25 2024-06-19 17:10:07,964 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.584e+03 6.112e+03 8.278e+03 1.239e+04 5.655e+04, threshold=1.656e+04, percent-clipped=6.0 2024-06-19 17:10:11,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=25754.666666666668, ans=0.07 2024-06-19 17:10:12,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=25754.666666666668, ans=0.2 2024-06-19 17:10:15,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=25754.666666666668, ans=0.5 2024-06-19 17:10:18,759 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.88 vs. limit=15.0 2024-06-19 17:10:23,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=25773.0, ans=0.125 2024-06-19 17:10:25,648 INFO [train.py:1028] (1/2) Epoch 2, batch 3950, loss[loss=0.7869, simple_loss=0.5216, pruned_loss=0.5261, over 13129.00 frames. ], tot_loss[loss=0.8939, simple_loss=0.5738, pruned_loss=0.607, over 2587911.41 frames. ], batch size: 132, lr: 2.35e-02, grad_scale: 0.125 2024-06-19 17:10:25,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=25791.333333333332, ans=0.0 2024-06-19 17:10:32,378 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.82 vs. limit=15.0 2024-06-19 17:10:42,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=25828.0, ans=0.025 2024-06-19 17:10:58,274 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=26.02 vs. limit=22.5 2024-06-19 17:11:00,022 INFO [train.py:1028] (1/2) Epoch 2, batch 4000, loss[loss=0.8959, simple_loss=0.5663, pruned_loss=0.6127, over 13002.00 frames. ], tot_loss[loss=0.8927, simple_loss=0.5735, pruned_loss=0.606, over 2583028.18 frames. ], batch size: 39, lr: 2.35e-02, grad_scale: 0.25 2024-06-19 17:11:01,250 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.59 vs. limit=15.0 2024-06-19 17:11:01,801 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=12.0 2024-06-19 17:11:02,603 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=19.61 vs. limit=15.0 2024-06-19 17:11:05,331 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=27.52 vs. limit=22.5 2024-06-19 17:11:16,508 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.331e+03 6.234e+03 8.336e+03 1.220e+04 4.733e+04, threshold=1.667e+04, percent-clipped=10.0 2024-06-19 17:11:17,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=25919.666666666668, ans=0.005234855072463768 2024-06-19 17:11:28,096 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=26.66 vs. limit=22.5 2024-06-19 17:11:37,183 INFO [train.py:1028] (1/2) Epoch 2, batch 4050, loss[loss=0.774, simple_loss=0.5268, pruned_loss=0.5106, over 11109.00 frames. ], tot_loss[loss=0.8914, simple_loss=0.573, pruned_loss=0.6049, over 2581406.04 frames. ], batch size: 304, lr: 2.35e-02, grad_scale: 0.25 2024-06-19 17:11:38,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=19.16 vs. limit=15.0 2024-06-19 17:11:41,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=25974.666666666668, ans=0.125 2024-06-19 17:11:41,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=25974.666666666668, ans=0.1 2024-06-19 17:11:43,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=25993.0, ans=0.125 2024-06-19 17:11:44,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=25993.0, ans=0.1 2024-06-19 17:11:49,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=25993.0, ans=0.125 2024-06-19 17:11:51,983 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.37 vs. limit=15.0 2024-06-19 17:11:53,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=26011.333333333332, ans=0.0 2024-06-19 17:11:57,919 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.08 vs. limit=10.0 2024-06-19 17:12:06,248 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=30.62 vs. limit=22.5 2024-06-19 17:12:07,504 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.23 vs. limit=6.0 2024-06-19 17:12:09,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=26048.0, ans=0.015 2024-06-19 17:12:09,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=26048.0, ans=0.2 2024-06-19 17:12:14,332 INFO [train.py:1028] (1/2) Epoch 2, batch 4100, loss[loss=0.8687, simple_loss=0.5648, pruned_loss=0.5863, over 13040.00 frames. ], tot_loss[loss=0.892, simple_loss=0.573, pruned_loss=0.6055, over 2577400.06 frames. ], batch size: 102, lr: 2.34e-02, grad_scale: 0.5 2024-06-19 17:12:16,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=26066.333333333332, ans=0.125 2024-06-19 17:12:20,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.92 vs. limit=5.0 2024-06-19 17:12:24,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=26084.666666666668, ans=0.025 2024-06-19 17:12:30,939 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.681e+03 6.194e+03 7.838e+03 1.003e+04 3.680e+04, threshold=1.568e+04, percent-clipped=6.0 2024-06-19 17:12:32,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=26103.0, ans=0.125 2024-06-19 17:12:45,988 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=21.94 vs. limit=15.0 2024-06-19 17:12:49,107 INFO [train.py:1028] (1/2) Epoch 2, batch 4150, loss[loss=0.8556, simple_loss=0.5421, pruned_loss=0.5845, over 13103.00 frames. ], tot_loss[loss=0.8921, simple_loss=0.5723, pruned_loss=0.6059, over 2575029.57 frames. ], batch size: 55, lr: 2.34e-02, grad_scale: 0.125 2024-06-19 17:12:58,790 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.45 vs. limit=15.0 2024-06-19 17:12:59,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=26176.333333333332, ans=0.125 2024-06-19 17:13:06,749 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=6.15 vs. limit=12.0 2024-06-19 17:13:13,235 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.48 vs. limit=15.0 2024-06-19 17:13:13,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=26213.0, ans=0.0 2024-06-19 17:13:16,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=26231.333333333332, ans=0.005167101449275363 2024-06-19 17:13:17,785 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=32.45 vs. limit=15.0 2024-06-19 17:13:21,960 INFO [train.py:1028] (1/2) Epoch 2, batch 4200, loss[loss=0.8716, simple_loss=0.5662, pruned_loss=0.5885, over 13171.00 frames. ], tot_loss[loss=0.8882, simple_loss=0.5707, pruned_loss=0.6029, over 2578013.37 frames. ], batch size: 103, lr: 2.34e-02, grad_scale: 0.25 2024-06-19 17:13:21,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=26249.666666666668, ans=0.125 2024-06-19 17:13:22,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=26249.666666666668, ans=0.2 2024-06-19 17:13:27,062 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=34.81 vs. limit=15.0 2024-06-19 17:13:37,731 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=14.78 vs. limit=15.0 2024-06-19 17:13:38,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=26286.333333333332, ans=0.125 2024-06-19 17:13:39,130 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.71 vs. limit=15.0 2024-06-19 17:13:41,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=26286.333333333332, ans=0.2 2024-06-19 17:13:41,830 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.46 vs. limit=12.0 2024-06-19 17:13:42,032 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.212e+03 6.757e+03 8.192e+03 1.010e+04 4.087e+04, threshold=1.638e+04, percent-clipped=5.0 2024-06-19 17:13:42,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=26286.333333333332, ans=0.125 2024-06-19 17:13:50,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=26304.666666666668, ans=0.1 2024-06-19 17:13:51,344 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.62 vs. limit=6.0 2024-06-19 17:13:55,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=26323.0, ans=0.125 2024-06-19 17:13:56,587 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.57 vs. limit=15.0 2024-06-19 17:13:58,220 INFO [train.py:1028] (1/2) Epoch 2, batch 4250, loss[loss=0.9375, simple_loss=0.6009, pruned_loss=0.6371, over 13290.00 frames. ], tot_loss[loss=0.8864, simple_loss=0.5695, pruned_loss=0.6016, over 2580218.96 frames. ], batch size: 46, lr: 2.33e-02, grad_scale: 0.25 2024-06-19 17:13:59,838 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=26.41 vs. limit=22.5 2024-06-19 17:14:01,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=26341.333333333332, ans=0.2 2024-06-19 17:14:13,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=26378.0, ans=0.025 2024-06-19 17:14:18,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=26378.0, ans=0.2 2024-06-19 17:14:21,133 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=36.44 vs. limit=15.0 2024-06-19 17:14:22,713 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.46 vs. limit=15.0 2024-06-19 17:14:30,703 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.52 vs. limit=6.0 2024-06-19 17:14:34,574 INFO [train.py:1028] (1/2) Epoch 2, batch 4300, loss[loss=0.8886, simple_loss=0.565, pruned_loss=0.6061, over 13186.00 frames. ], tot_loss[loss=0.8862, simple_loss=0.5689, pruned_loss=0.6018, over 2580612.26 frames. ], batch size: 59, lr: 2.33e-02, grad_scale: 0.5 2024-06-19 17:14:47,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=26469.666666666668, ans=0.1 2024-06-19 17:14:50,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=26469.666666666668, ans=0.025 2024-06-19 17:14:51,763 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.168e+03 6.667e+03 7.842e+03 1.035e+04 5.575e+04, threshold=1.568e+04, percent-clipped=11.0 2024-06-19 17:14:54,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=26488.0, ans=0.125 2024-06-19 17:14:54,273 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=20.39 vs. limit=15.0 2024-06-19 17:14:55,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=26488.0, ans=0.005111304347826087 2024-06-19 17:14:57,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=26488.0, ans=0.125 2024-06-19 17:15:00,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=26506.333333333332, ans=0.125 2024-06-19 17:15:03,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=26506.333333333332, ans=0.125 2024-06-19 17:15:04,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=26506.333333333332, ans=0.2 2024-06-19 17:15:07,460 INFO [train.py:1028] (1/2) Epoch 2, batch 4350, loss[loss=0.8823, simple_loss=0.5684, pruned_loss=0.5981, over 13199.00 frames. ], tot_loss[loss=0.8815, simple_loss=0.5668, pruned_loss=0.5981, over 2585616.77 frames. ], batch size: 59, lr: 2.33e-02, grad_scale: 0.125 2024-06-19 17:15:11,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=26524.666666666668, ans=0.125 2024-06-19 17:15:13,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=26524.666666666668, ans=0.07 2024-06-19 17:15:14,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=26543.0, ans=0.0 2024-06-19 17:15:22,009 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.75 vs. limit=10.0 2024-06-19 17:15:24,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=26561.333333333332, ans=0.0 2024-06-19 17:15:35,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=26579.666666666668, ans=0.025 2024-06-19 17:15:37,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=26598.0, ans=0.2 2024-06-19 17:15:37,760 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=2.973e+00 2024-06-19 17:15:40,760 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.62 vs. limit=15.0 2024-06-19 17:15:44,628 INFO [train.py:1028] (1/2) Epoch 2, batch 4400, loss[loss=0.8206, simple_loss=0.5384, pruned_loss=0.5514, over 13219.00 frames. ], tot_loss[loss=0.8802, simple_loss=0.5671, pruned_loss=0.5967, over 2585508.13 frames. ], batch size: 83, lr: 2.33e-02, grad_scale: 0.25 2024-06-19 17:15:46,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=26616.333333333332, ans=0.125 2024-06-19 17:15:55,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=26634.666666666668, ans=0.05 2024-06-19 17:15:57,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=26653.0, ans=0.04949747468305833 2024-06-19 17:16:00,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=26653.0, ans=0.125 2024-06-19 17:16:00,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=26653.0, ans=0.0 2024-06-19 17:16:01,185 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=7.66 vs. limit=12.0 2024-06-19 17:16:03,569 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.691e+03 8.130e+03 9.748e+03 1.231e+04 6.746e+04, threshold=1.950e+04, percent-clipped=14.0 2024-06-19 17:16:05,513 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.60 vs. limit=15.0 2024-06-19 17:16:10,090 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.20 vs. limit=22.5 2024-06-19 17:16:10,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=26671.333333333332, ans=0.2 2024-06-19 17:16:10,805 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.96 vs. limit=6.0 2024-06-19 17:16:15,784 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.46 vs. limit=15.0 2024-06-19 17:16:17,811 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=27.48 vs. limit=15.0 2024-06-19 17:16:19,229 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.62 vs. limit=15.0 2024-06-19 17:16:21,632 INFO [train.py:1028] (1/2) Epoch 2, batch 4450, loss[loss=0.9068, simple_loss=0.5766, pruned_loss=0.6185, over 12930.00 frames. ], tot_loss[loss=0.8801, simple_loss=0.5677, pruned_loss=0.5963, over 2581119.09 frames. ], batch size: 33, lr: 2.32e-02, grad_scale: 0.125 2024-06-19 17:16:24,212 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.74 vs. limit=15.0 2024-06-19 17:16:27,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=26726.333333333332, ans=0.0 2024-06-19 17:16:36,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=26744.666666666668, ans=0.0 2024-06-19 17:16:41,104 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.39 vs. limit=10.0 2024-06-19 17:16:41,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=26763.0, ans=0.125 2024-06-19 17:16:48,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=26781.333333333332, ans=0.005047536231884058 2024-06-19 17:16:51,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=26781.333333333332, ans=0.125 2024-06-19 17:16:54,992 INFO [train.py:1028] (1/2) Epoch 2, batch 4500, loss[loss=0.8393, simple_loss=0.5395, pruned_loss=0.5695, over 13259.00 frames. ], tot_loss[loss=0.8798, simple_loss=0.5675, pruned_loss=0.5961, over 2584910.40 frames. ], batch size: 89, lr: 2.32e-02, grad_scale: 0.25 2024-06-19 17:16:56,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=26799.666666666668, ans=0.1 2024-06-19 17:16:56,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=26799.666666666668, ans=0.2 2024-06-19 17:17:01,032 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=38.50 vs. limit=15.0 2024-06-19 17:17:05,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=26818.0, ans=0.005039565217391305 2024-06-19 17:17:11,847 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.65 vs. limit=15.0 2024-06-19 17:17:13,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=26836.333333333332, ans=0.025 2024-06-19 17:17:13,805 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.73 vs. limit=15.0 2024-06-19 17:17:14,733 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.863e+03 9.925e+03 1.187e+04 1.777e+04 1.221e+05, threshold=2.375e+04, percent-clipped=23.0 2024-06-19 17:17:15,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=26854.666666666668, ans=0.0 2024-06-19 17:17:21,552 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.70 vs. limit=6.0 2024-06-19 17:17:22,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=26873.0, ans=0.125 2024-06-19 17:17:22,786 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=38.15 vs. limit=15.0 2024-06-19 17:17:26,801 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=1.039e-02 2024-06-19 17:17:29,365 INFO [train.py:1028] (1/2) Epoch 2, batch 4550, loss[loss=0.869, simple_loss=0.547, pruned_loss=0.5955, over 13265.00 frames. ], tot_loss[loss=0.8792, simple_loss=0.5665, pruned_loss=0.596, over 2588303.53 frames. ], batch size: 52, lr: 2.32e-02, grad_scale: 0.25 2024-06-19 17:17:37,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=26891.333333333332, ans=0.0 2024-06-19 17:17:42,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=26909.666666666668, ans=0.2 2024-06-19 17:17:46,883 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.55 vs. limit=22.5 2024-06-19 17:18:02,002 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.94 vs. limit=15.0 2024-06-19 17:18:02,046 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.85 vs. limit=15.0 2024-06-19 17:18:05,224 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=18.04 vs. limit=15.0 2024-06-19 17:18:06,127 INFO [train.py:1028] (1/2) Epoch 2, batch 4600, loss[loss=0.8966, simple_loss=0.5852, pruned_loss=0.604, over 12565.00 frames. ], tot_loss[loss=0.8841, simple_loss=0.5682, pruned_loss=0.6, over 2584997.89 frames. ], batch size: 202, lr: 2.31e-02, grad_scale: 0.5 2024-06-19 17:18:06,893 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=26983.0, ans=0.0 2024-06-19 17:18:07,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=26983.0, ans=0.1 2024-06-19 17:18:08,151 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.43 vs. limit=15.0 2024-06-19 17:18:16,447 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=15.0 2024-06-19 17:18:27,160 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=5.262e-03 2024-06-19 17:18:27,478 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.62 vs. limit=6.0 2024-06-19 17:18:28,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27038.0, ans=0.1 2024-06-19 17:18:30,211 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.985e+03 6.616e+03 8.756e+03 1.168e+04 5.373e+04, threshold=1.751e+04, percent-clipped=6.0 2024-06-19 17:18:34,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27038.0, ans=0.1 2024-06-19 17:18:40,763 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.22 vs. limit=15.0 2024-06-19 17:18:42,212 INFO [train.py:1028] (1/2) Epoch 2, batch 4650, loss[loss=0.8073, simple_loss=0.5278, pruned_loss=0.5434, over 13144.00 frames. ], tot_loss[loss=0.8831, simple_loss=0.5672, pruned_loss=0.5995, over 2588184.76 frames. ], batch size: 132, lr: 2.31e-02, grad_scale: 0.125 2024-06-19 17:18:42,687 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=18.12 vs. limit=15.0 2024-06-19 17:18:43,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=27074.666666666668, ans=0.125 2024-06-19 17:18:45,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=27074.666666666668, ans=0.125 2024-06-19 17:18:55,061 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.51 vs. limit=22.5 2024-06-19 17:19:10,684 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.81 vs. limit=22.5 2024-06-19 17:19:13,463 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=31.38 vs. limit=22.5 2024-06-19 17:19:15,934 INFO [train.py:1028] (1/2) Epoch 2, batch 4700, loss[loss=0.9987, simple_loss=0.622, pruned_loss=0.6877, over 12352.00 frames. ], tot_loss[loss=0.8832, simple_loss=0.5674, pruned_loss=0.5995, over 2581766.83 frames. ], batch size: 25, lr: 2.31e-02, grad_scale: 0.25 2024-06-19 17:19:22,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=27184.666666666668, ans=0.125 2024-06-19 17:19:23,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=27184.666666666668, ans=0.2 2024-06-19 17:19:31,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=27203.0, ans=0.5 2024-06-19 17:19:32,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=27203.0, ans=0.125 2024-06-19 17:19:39,771 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.452e+03 7.474e+03 9.848e+03 1.235e+04 3.821e+04, threshold=1.970e+04, percent-clipped=6.0 2024-06-19 17:19:40,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=27221.333333333332, ans=0.5 2024-06-19 17:19:46,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=27239.666666666668, ans=0.2 2024-06-19 17:19:46,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=27239.666666666668, ans=0.025 2024-06-19 17:19:47,052 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.88 vs. limit=15.0 2024-06-19 17:19:52,364 INFO [train.py:1028] (1/2) Epoch 2, batch 4750, loss[loss=0.8694, simple_loss=0.5852, pruned_loss=0.5768, over 12545.00 frames. ], tot_loss[loss=0.8797, simple_loss=0.5662, pruned_loss=0.5966, over 2578817.12 frames. ], batch size: 202, lr: 2.30e-02, grad_scale: 0.0625 2024-06-19 17:20:01,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=27276.333333333332, ans=0.125 2024-06-19 17:20:01,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=27276.333333333332, ans=0.2 2024-06-19 17:20:10,750 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.56 vs. limit=15.0 2024-06-19 17:20:15,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=27313.0, ans=0.015 2024-06-19 17:20:16,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=27313.0, ans=0.125 2024-06-19 17:20:26,933 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.41 vs. limit=15.0 2024-06-19 17:20:29,245 INFO [train.py:1028] (1/2) Epoch 2, batch 4800, loss[loss=0.8975, simple_loss=0.5725, pruned_loss=0.6113, over 13273.00 frames. ], tot_loss[loss=0.8803, simple_loss=0.5667, pruned_loss=0.5969, over 2576304.96 frames. ], batch size: 63, lr: 2.30e-02, grad_scale: 0.125 2024-06-19 17:20:29,814 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=23.73 vs. limit=22.5 2024-06-19 17:20:34,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=27349.666666666668, ans=0.2 2024-06-19 17:20:39,129 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.43 vs. limit=22.5 2024-06-19 17:20:51,173 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.322e+03 9.685e+03 1.241e+04 1.513e+04 6.193e+04, threshold=2.482e+04, percent-clipped=12.0 2024-06-19 17:20:52,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=27404.666666666668, ans=0.0 2024-06-19 17:21:02,362 INFO [train.py:1028] (1/2) Epoch 2, batch 4850, loss[loss=0.8465, simple_loss=0.5492, pruned_loss=0.5719, over 13245.00 frames. ], tot_loss[loss=0.8783, simple_loss=0.5658, pruned_loss=0.5954, over 2574436.30 frames. ], batch size: 89, lr: 2.30e-02, grad_scale: 0.125 2024-06-19 17:21:04,095 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.75 vs. limit=22.5 2024-06-19 17:21:04,721 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=9.21 vs. limit=12.0 2024-06-19 17:21:08,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=27459.666666666668, ans=0.2 2024-06-19 17:21:10,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=27459.666666666668, ans=0.125 2024-06-19 17:21:19,187 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.53 vs. limit=15.0 2024-06-19 17:21:19,850 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.68 vs. limit=15.0 2024-06-19 17:21:22,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=27496.333333333332, ans=0.0 2024-06-19 17:21:30,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=27496.333333333332, ans=0.95 2024-06-19 17:21:31,899 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.76 vs. limit=15.0 2024-06-19 17:21:32,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=27514.666666666668, ans=0.07 2024-06-19 17:21:38,862 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.88 vs. limit=10.0 2024-06-19 17:21:39,877 INFO [train.py:1028] (1/2) Epoch 2, batch 4900, loss[loss=0.8918, simple_loss=0.5625, pruned_loss=0.6105, over 13222.00 frames. ], tot_loss[loss=0.8793, simple_loss=0.5659, pruned_loss=0.5964, over 2574793.53 frames. ], batch size: 59, lr: 2.29e-02, grad_scale: 0.125 2024-06-19 17:21:45,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=27533.0, ans=0.0 2024-06-19 17:21:59,933 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.67 vs. limit=10.0 2024-06-19 17:22:03,572 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.661e+03 7.110e+03 9.637e+03 1.218e+04 7.815e+04, threshold=1.927e+04, percent-clipped=6.0 2024-06-19 17:22:16,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=27624.666666666668, ans=0.004864202898550725 2024-06-19 17:22:17,348 INFO [train.py:1028] (1/2) Epoch 2, batch 4950, loss[loss=0.8727, simple_loss=0.5881, pruned_loss=0.5786, over 11034.00 frames. ], tot_loss[loss=0.8778, simple_loss=0.5661, pruned_loss=0.5948, over 2568967.72 frames. ], batch size: 304, lr: 2.29e-02, grad_scale: 0.125 2024-06-19 17:22:18,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=27624.666666666668, ans=0.125 2024-06-19 17:22:26,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=27643.0, ans=0.004860217391304348 2024-06-19 17:22:27,924 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=26.64 vs. limit=12.0 2024-06-19 17:22:32,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=27661.333333333332, ans=0.004856231884057972 2024-06-19 17:22:38,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=27679.666666666668, ans=22.5 2024-06-19 17:22:45,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=27698.0, ans=0.125 2024-06-19 17:22:49,463 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.43 vs. limit=15.0 2024-06-19 17:22:49,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=27698.0, ans=0.125 2024-06-19 17:22:50,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=27716.333333333332, ans=0.125 2024-06-19 17:22:50,959 INFO [train.py:1028] (1/2) Epoch 2, batch 5000, loss[loss=0.8743, simple_loss=0.5546, pruned_loss=0.5969, over 13120.00 frames. ], tot_loss[loss=0.8787, simple_loss=0.5656, pruned_loss=0.5959, over 2572807.33 frames. ], batch size: 95, lr: 2.29e-02, grad_scale: 0.25 2024-06-19 17:22:52,025 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.92 vs. limit=10.0 2024-06-19 17:22:59,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=27734.666666666668, ans=0.015 2024-06-19 17:23:04,257 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.23 vs. limit=10.0 2024-06-19 17:23:11,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=27771.333333333332, ans=0.125 2024-06-19 17:23:14,579 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.824e+03 8.419e+03 1.067e+04 1.248e+04 2.932e+04, threshold=2.134e+04, percent-clipped=2.0 2024-06-19 17:23:21,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27789.666666666668, ans=0.1 2024-06-19 17:23:25,278 INFO [train.py:1028] (1/2) Epoch 2, batch 5050, loss[loss=0.8644, simple_loss=0.5431, pruned_loss=0.5928, over 12909.00 frames. ], tot_loss[loss=0.8836, simple_loss=0.5665, pruned_loss=0.6003, over 2572313.17 frames. ], batch size: 36, lr: 2.28e-02, grad_scale: 0.125 2024-06-19 17:23:26,244 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=39.46 vs. limit=15.0 2024-06-19 17:23:27,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=27808.0, ans=0.2 2024-06-19 17:23:30,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=27808.0, ans=0.125 2024-06-19 17:23:57,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=27881.333333333332, ans=0.125 2024-06-19 17:23:58,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=27881.333333333332, ans=0.125 2024-06-19 17:23:59,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27881.333333333332, ans=0.1 2024-06-19 17:24:02,638 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.31 vs. limit=22.5 2024-06-19 17:24:02,888 INFO [train.py:1028] (1/2) Epoch 2, batch 5100, loss[loss=0.9223, simple_loss=0.5874, pruned_loss=0.6286, over 12978.00 frames. ], tot_loss[loss=0.881, simple_loss=0.5664, pruned_loss=0.5978, over 2568528.88 frames. ], batch size: 39, lr: 2.28e-02, grad_scale: 0.125 2024-06-19 17:24:04,577 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=33.70 vs. limit=15.0 2024-06-19 17:24:17,988 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=19.81 vs. limit=15.0 2024-06-19 17:24:22,535 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.65 vs. limit=15.0 2024-06-19 17:24:30,717 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.514e+03 8.316e+03 1.075e+04 1.479e+04 1.402e+05, threshold=2.150e+04, percent-clipped=15.0 2024-06-19 17:24:32,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=27954.666666666668, ans=0.125 2024-06-19 17:24:32,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=27973.0, ans=0.125 2024-06-19 17:24:39,856 INFO [train.py:1028] (1/2) Epoch 2, batch 5150, loss[loss=0.7654, simple_loss=0.5067, pruned_loss=0.512, over 13138.00 frames. ], tot_loss[loss=0.8773, simple_loss=0.5647, pruned_loss=0.5949, over 2571741.11 frames. ], batch size: 132, lr: 2.28e-02, grad_scale: 0.125 2024-06-19 17:24:45,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=27991.333333333332, ans=0.125 2024-06-19 17:24:47,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=28009.666666666668, ans=0.2 2024-06-19 17:24:49,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=28009.666666666668, ans=0.025 2024-06-19 17:24:52,903 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.14 vs. limit=22.5 2024-06-19 17:24:53,436 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 17:24:57,305 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=12.0 2024-06-19 17:24:57,396 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.15 vs. limit=15.0 2024-06-19 17:25:04,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=28046.333333333332, ans=0.125 2024-06-19 17:25:13,552 INFO [train.py:1028] (1/2) Epoch 2, batch 5200, loss[loss=0.83, simple_loss=0.5468, pruned_loss=0.5566, over 13133.00 frames. ], tot_loss[loss=0.8763, simple_loss=0.5642, pruned_loss=0.5942, over 2575109.96 frames. ], batch size: 95, lr: 2.28e-02, grad_scale: 0.25 2024-06-19 17:25:30,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=28119.666666666668, ans=0.125 2024-06-19 17:25:35,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=28119.666666666668, ans=0.0 2024-06-19 17:25:37,551 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.54 vs. limit=15.0 2024-06-19 17:25:39,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=28138.0, ans=0.125 2024-06-19 17:25:41,068 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.391e+03 7.121e+03 9.053e+03 1.078e+04 2.821e+04, threshold=1.811e+04, percent-clipped=3.0 2024-06-19 17:25:43,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=28156.333333333332, ans=0.0 2024-06-19 17:25:43,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=28156.333333333332, ans=0.0 2024-06-19 17:25:50,410 INFO [train.py:1028] (1/2) Epoch 2, batch 5250, loss[loss=0.9426, simple_loss=0.5916, pruned_loss=0.6469, over 13252.00 frames. ], tot_loss[loss=0.8778, simple_loss=0.5644, pruned_loss=0.5957, over 2572951.50 frames. ], batch size: 52, lr: 2.27e-02, grad_scale: 0.25 2024-06-19 17:25:58,962 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.98 vs. limit=10.0 2024-06-19 17:26:07,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=28211.333333333332, ans=0.0 2024-06-19 17:26:11,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=28229.666666666668, ans=0.125 2024-06-19 17:26:16,482 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2024-06-19 17:26:16,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=28229.666666666668, ans=0.1 2024-06-19 17:26:17,175 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=17.89 vs. limit=12.0 2024-06-19 17:26:19,135 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.03 vs. limit=15.0 2024-06-19 17:26:25,829 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.75 vs. limit=15.0 2024-06-19 17:26:27,459 INFO [train.py:1028] (1/2) Epoch 2, batch 5300, loss[loss=0.8464, simple_loss=0.5568, pruned_loss=0.5679, over 13064.00 frames. ], tot_loss[loss=0.8779, simple_loss=0.564, pruned_loss=0.5959, over 2568510.48 frames. ], batch size: 144, lr: 2.27e-02, grad_scale: 0.5 2024-06-19 17:26:42,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=28303.0, ans=0.125 2024-06-19 17:26:43,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=28303.0, ans=0.0 2024-06-19 17:26:44,349 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.73 vs. limit=22.5 2024-06-19 17:26:48,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=28321.333333333332, ans=0.125 2024-06-19 17:26:53,285 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.446e+03 7.835e+03 9.752e+03 1.158e+04 3.315e+04, threshold=1.950e+04, percent-clipped=8.0 2024-06-19 17:27:01,693 INFO [train.py:1028] (1/2) Epoch 2, batch 5350, loss[loss=0.9854, simple_loss=0.6106, pruned_loss=0.6801, over 11269.00 frames. ], tot_loss[loss=0.873, simple_loss=0.5615, pruned_loss=0.5922, over 2574422.03 frames. ], batch size: 16, lr: 2.27e-02, grad_scale: 0.0625 2024-06-19 17:27:06,152 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=40.04 vs. limit=15.0 2024-06-19 17:27:14,746 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.52 vs. limit=15.0 2024-06-19 17:27:15,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=28394.666666666668, ans=0.125 2024-06-19 17:27:21,331 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.98 vs. limit=22.5 2024-06-19 17:27:21,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=28413.0, ans=0.0 2024-06-19 17:27:22,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=28413.0, ans=0.025 2024-06-19 17:27:29,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=28413.0, ans=0.125 2024-06-19 17:27:30,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=28413.0, ans=0.125 2024-06-19 17:27:34,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=28431.333333333332, ans=0.004688840579710145 2024-06-19 17:27:34,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=28431.333333333332, ans=0.125 2024-06-19 17:27:37,663 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.06 vs. limit=15.0 2024-06-19 17:27:37,836 INFO [train.py:1028] (1/2) Epoch 2, batch 5400, loss[loss=0.8574, simple_loss=0.5774, pruned_loss=0.5687, over 12195.00 frames. ], tot_loss[loss=0.8703, simple_loss=0.5613, pruned_loss=0.5896, over 2567172.56 frames. ], batch size: 240, lr: 2.26e-02, grad_scale: 0.125 2024-06-19 17:27:42,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=28449.666666666668, ans=15.0 2024-06-19 17:27:46,981 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=2.94 vs. limit=15.0 2024-06-19 17:27:49,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=28468.0, ans=0.125 2024-06-19 17:27:56,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=28486.333333333332, ans=0.125 2024-06-19 17:27:56,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=28486.333333333332, ans=0.0 2024-06-19 17:27:58,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=28504.666666666668, ans=0.125 2024-06-19 17:27:58,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=28504.666666666668, ans=0.2 2024-06-19 17:27:59,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=28504.666666666668, ans=0.0 2024-06-19 17:28:04,682 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.551e+03 7.963e+03 9.925e+03 1.254e+04 6.192e+04, threshold=1.985e+04, percent-clipped=7.0 2024-06-19 17:28:15,397 INFO [train.py:1028] (1/2) Epoch 2, batch 5450, loss[loss=0.8925, simple_loss=0.5656, pruned_loss=0.6096, over 12373.00 frames. ], tot_loss[loss=0.8706, simple_loss=0.5616, pruned_loss=0.5897, over 2570702.51 frames. ], batch size: 25, lr: 2.26e-02, grad_scale: 0.125 2024-06-19 17:28:16,318 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.588e-01 2024-06-19 17:28:20,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=28541.333333333332, ans=0.125 2024-06-19 17:28:28,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=28559.666666666668, ans=0.025 2024-06-19 17:28:42,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=28614.666666666668, ans=0.125 2024-06-19 17:28:45,820 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.69 vs. limit=15.0 2024-06-19 17:28:46,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=28614.666666666668, ans=0.125 2024-06-19 17:28:48,460 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=13.75 vs. limit=15.0 2024-06-19 17:28:49,382 INFO [train.py:1028] (1/2) Epoch 2, batch 5500, loss[loss=0.8297, simple_loss=0.5549, pruned_loss=0.5523, over 12215.00 frames. ], tot_loss[loss=0.8697, simple_loss=0.5616, pruned_loss=0.5889, over 2564472.94 frames. ], batch size: 241, lr: 2.26e-02, grad_scale: 0.25 2024-06-19 17:28:49,729 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=28.26 vs. limit=15.0 2024-06-19 17:28:50,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=28633.0, ans=0.1 2024-06-19 17:28:51,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=28633.0, ans=0.125 2024-06-19 17:29:01,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=28651.333333333332, ans=0.125 2024-06-19 17:29:04,003 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=47.29 vs. limit=15.0 2024-06-19 17:29:05,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=28669.666666666668, ans=0.125 2024-06-19 17:29:07,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=28669.666666666668, ans=0.1 2024-06-19 17:29:08,961 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.58 vs. limit=15.0 2024-06-19 17:29:10,190 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.14 vs. limit=10.0 2024-06-19 17:29:14,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=28688.0, ans=0.00463304347826087 2024-06-19 17:29:15,962 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.573e+03 9.456e+03 1.166e+04 1.441e+04 6.858e+04, threshold=2.333e+04, percent-clipped=12.0 2024-06-19 17:29:20,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=28706.333333333332, ans=0.0 2024-06-19 17:29:21,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=28706.333333333332, ans=0.125 2024-06-19 17:29:26,847 INFO [train.py:1028] (1/2) Epoch 2, batch 5550, loss[loss=0.9102, simple_loss=0.5728, pruned_loss=0.6238, over 13226.00 frames. ], tot_loss[loss=0.8712, simple_loss=0.5615, pruned_loss=0.5904, over 2568392.96 frames. ], batch size: 43, lr: 2.26e-02, grad_scale: 0.25 2024-06-19 17:29:29,181 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.04 vs. limit=22.5 2024-06-19 17:29:38,403 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=20.46 vs. limit=15.0 2024-06-19 17:29:40,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=28761.333333333332, ans=0.125 2024-06-19 17:29:42,846 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=42.25 vs. limit=15.0 2024-06-19 17:29:44,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=28761.333333333332, ans=0.2 2024-06-19 17:29:45,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=28761.333333333332, ans=0.0 2024-06-19 17:29:56,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=28798.0, ans=0.125 2024-06-19 17:29:59,928 INFO [train.py:1028] (1/2) Epoch 2, batch 5600, loss[loss=0.7984, simple_loss=0.517, pruned_loss=0.5399, over 13246.00 frames. ], tot_loss[loss=0.8675, simple_loss=0.5597, pruned_loss=0.5877, over 2570172.87 frames. ], batch size: 89, lr: 2.25e-02, grad_scale: 0.5 2024-06-19 17:30:00,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=28816.333333333332, ans=0.2 2024-06-19 17:30:07,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=28834.666666666668, ans=0.025 2024-06-19 17:30:19,392 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=15.0 2024-06-19 17:30:23,633 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.69 vs. limit=22.5 2024-06-19 17:30:28,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=28871.333333333332, ans=0.2 2024-06-19 17:30:30,423 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.90 vs. limit=15.0 2024-06-19 17:30:32,008 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.180e+03 1.028e+04 1.357e+04 2.041e+04 6.376e+04, threshold=2.714e+04, percent-clipped=17.0 2024-06-19 17:30:38,498 INFO [train.py:1028] (1/2) Epoch 2, batch 5650, loss[loss=0.8807, simple_loss=0.5914, pruned_loss=0.585, over 12561.00 frames. ], tot_loss[loss=0.8696, simple_loss=0.5607, pruned_loss=0.5893, over 2575003.84 frames. ], batch size: 202, lr: 2.25e-02, grad_scale: 0.125 2024-06-19 17:30:46,894 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.72 vs. limit=15.0 2024-06-19 17:30:46,936 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.92 vs. limit=15.0 2024-06-19 17:30:59,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.40 vs. limit=10.0 2024-06-19 17:31:02,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=28963.0, ans=0.0 2024-06-19 17:31:10,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=28981.333333333332, ans=0.0 2024-06-19 17:31:13,270 INFO [train.py:1028] (1/2) Epoch 2, batch 5700, loss[loss=0.9207, simple_loss=0.583, pruned_loss=0.6292, over 13272.00 frames. ], tot_loss[loss=0.8689, simple_loss=0.5602, pruned_loss=0.5888, over 2579374.73 frames. ], batch size: 63, lr: 2.25e-02, grad_scale: 0.25 2024-06-19 17:31:38,502 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.509e+00 2024-06-19 17:31:38,866 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.97 vs. limit=10.0 2024-06-19 17:31:44,364 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.609e+03 5.248e+03 7.904e+03 1.265e+04 5.551e+04, threshold=1.581e+04, percent-clipped=5.0 2024-06-19 17:31:50,187 INFO [train.py:1028] (1/2) Epoch 2, batch 5750, loss[loss=0.883, simple_loss=0.5799, pruned_loss=0.593, over 12756.00 frames. ], tot_loss[loss=0.8763, simple_loss=0.5639, pruned_loss=0.5943, over 2581149.82 frames. ], batch size: 176, lr: 2.24e-02, grad_scale: 0.25 2024-06-19 17:31:51,289 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.07 vs. limit=15.0 2024-06-19 17:31:53,721 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.59 vs. limit=15.0 2024-06-19 17:32:05,362 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.48 vs. limit=15.0 2024-06-19 17:32:23,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=29164.666666666668, ans=0.1 2024-06-19 17:32:26,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=29164.666666666668, ans=0.2 2024-06-19 17:32:28,798 INFO [train.py:1028] (1/2) Epoch 2, batch 5800, loss[loss=0.8918, simple_loss=0.5911, pruned_loss=0.5963, over 12744.00 frames. ], tot_loss[loss=0.8779, simple_loss=0.5652, pruned_loss=0.5953, over 2580131.73 frames. ], batch size: 176, lr: 2.24e-02, grad_scale: 0.5 2024-06-19 17:32:44,114 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.59 vs. limit=15.0 2024-06-19 17:32:56,839 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.321e+03 5.380e+03 6.532e+03 7.733e+03 3.613e+04, threshold=1.306e+04, percent-clipped=3.0 2024-06-19 17:32:58,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=29256.333333333332, ans=0.0 2024-06-19 17:32:59,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=29256.333333333332, ans=0.125 2024-06-19 17:33:01,447 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=23.75 vs. limit=15.0 2024-06-19 17:33:02,229 INFO [train.py:1028] (1/2) Epoch 2, batch 5850, loss[loss=0.9596, simple_loss=0.6421, pruned_loss=0.6386, over 12494.00 frames. ], tot_loss[loss=0.8833, simple_loss=0.5688, pruned_loss=0.5989, over 2576689.84 frames. ], batch size: 202, lr: 2.24e-02, grad_scale: 0.25 2024-06-19 17:33:12,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=29293.0, ans=0.0 2024-06-19 17:33:12,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=29293.0, ans=0.0045015217391304345 2024-06-19 17:33:19,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=29311.333333333332, ans=0.5 2024-06-19 17:33:23,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29329.666666666668, ans=0.1 2024-06-19 17:33:49,541 INFO [train.py:1028] (1/2) Epoch 2, batch 5900, loss[loss=0.8403, simple_loss=0.5552, pruned_loss=0.5627, over 13113.00 frames. ], tot_loss[loss=0.8905, simple_loss=0.5734, pruned_loss=0.6038, over 2577679.33 frames. ], batch size: 121, lr: 2.23e-02, grad_scale: 0.5 2024-06-19 17:33:49,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=29366.333333333332, ans=0.2 2024-06-19 17:33:58,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=29384.666666666668, ans=0.125 2024-06-19 17:34:19,044 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.046e+03 7.162e+03 8.402e+03 1.029e+04 3.384e+04, threshold=1.680e+04, percent-clipped=13.0 2024-06-19 17:34:23,190 INFO [train.py:1028] (1/2) Epoch 2, batch 5950, loss[loss=0.8504, simple_loss=0.5533, pruned_loss=0.5738, over 13122.00 frames. ], tot_loss[loss=0.8923, simple_loss=0.5749, pruned_loss=0.6049, over 2582071.49 frames. ], batch size: 121, lr: 2.23e-02, grad_scale: 0.125 2024-06-19 17:34:30,492 WARNING [optim.py:503] (1/2) Scaling gradients by 0.08075393736362457, model_norm_threshold=16804.384765625 2024-06-19 17:34:30,649 WARNING [optim.py:575] (1/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.29, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=1.248e+10, grad_sumsq=3.936e+12, orig_rms_sq=3.171e-03 2024-06-19 17:34:32,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29458.0, ans=0.1 2024-06-19 17:34:32,942 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=29476.333333333332, ans=0.125 2024-06-19 17:34:37,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=29476.333333333332, ans=0.125 2024-06-19 17:34:43,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=29494.666666666668, ans=0.2 2024-06-19 17:34:44,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=29494.666666666668, ans=0.125 2024-06-19 17:34:47,058 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.03 vs. limit=15.0 2024-06-19 17:34:55,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=29531.333333333332, ans=0.025 2024-06-19 17:35:00,317 INFO [train.py:1028] (1/2) Epoch 2, batch 6000, loss[loss=0.8864, simple_loss=0.5995, pruned_loss=0.5867, over 12106.00 frames. ], tot_loss[loss=0.8956, simple_loss=0.5769, pruned_loss=0.6071, over 2574964.92 frames. ], batch size: 240, lr: 2.23e-02, grad_scale: 0.0625 2024-06-19 17:35:00,318 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 17:35:08,186 INFO [train.py:1060] (1/2) Epoch 2, validation: loss=0.9699, simple_loss=0.6265, pruned_loss=0.6567, over 351949.00 frames. 2024-06-19 17:35:08,187 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 16946MB 2024-06-19 17:35:08,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=29549.666666666668, ans=0.125 2024-06-19 17:35:08,611 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=25.94 vs. limit=15.0 2024-06-19 17:35:15,541 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.24 vs. limit=15.0 2024-06-19 17:35:23,005 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.97 vs. limit=10.0 2024-06-19 17:35:40,094 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.962e+03 4.634e+03 5.657e+03 7.630e+03 2.081e+05, threshold=1.131e+04, percent-clipped=6.0 2024-06-19 17:35:42,719 INFO [train.py:1028] (1/2) Epoch 2, batch 6050, loss[loss=0.9028, simple_loss=0.5733, pruned_loss=0.6161, over 12916.00 frames. ], tot_loss[loss=0.9027, simple_loss=0.5811, pruned_loss=0.6122, over 2577835.22 frames. ], batch size: 39, lr: 2.23e-02, grad_scale: 0.0625 2024-06-19 17:35:43,920 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=18.73 vs. limit=15.0 2024-06-19 17:35:50,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=29641.333333333332, ans=0.125 2024-06-19 17:35:51,333 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=19.16 vs. limit=15.0 2024-06-19 17:35:52,856 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=21.25 vs. limit=15.0 2024-06-19 17:36:01,418 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.32 vs. limit=15.0 2024-06-19 17:36:05,256 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.87 vs. limit=10.0 2024-06-19 17:36:06,451 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=15.81 vs. limit=15.0 2024-06-19 17:36:17,359 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.94 vs. limit=22.5 2024-06-19 17:36:19,467 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.43 vs. limit=22.5 2024-06-19 17:36:19,724 INFO [train.py:1028] (1/2) Epoch 2, batch 6100, loss[loss=0.863, simple_loss=0.5625, pruned_loss=0.5817, over 13094.00 frames. ], tot_loss[loss=0.9055, simple_loss=0.5828, pruned_loss=0.6141, over 2579526.12 frames. ], batch size: 121, lr: 2.22e-02, grad_scale: 0.125 2024-06-19 17:36:21,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=29733.0, ans=0.0 2024-06-19 17:36:25,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=24.27 vs. limit=15.0 2024-06-19 17:36:26,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=29751.333333333332, ans=0.125 2024-06-19 17:36:47,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=29788.0, ans=0.004393913043478261 2024-06-19 17:36:48,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=29788.0, ans=0.125 2024-06-19 17:36:51,835 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.88 vs. limit=22.5 2024-06-19 17:36:52,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=29806.333333333332, ans=0.125 2024-06-19 17:36:54,753 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.092e+03 5.735e+03 7.517e+03 9.589e+03 5.845e+04, threshold=1.503e+04, percent-clipped=17.0 2024-06-19 17:36:57,682 INFO [train.py:1028] (1/2) Epoch 2, batch 6150, loss[loss=0.8161, simple_loss=0.5521, pruned_loss=0.54, over 11020.00 frames. ], tot_loss[loss=0.9132, simple_loss=0.5873, pruned_loss=0.6195, over 2577313.98 frames. ], batch size: 304, lr: 2.22e-02, grad_scale: 0.125 2024-06-19 17:37:02,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=29824.666666666668, ans=0.125 2024-06-19 17:37:07,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=29843.0, ans=0.125 2024-06-19 17:37:10,720 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.13 vs. limit=10.0 2024-06-19 17:37:13,595 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.38 vs. limit=22.5 2024-06-19 17:37:14,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=29861.333333333332, ans=0.2 2024-06-19 17:37:20,473 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=20.91 vs. limit=15.0 2024-06-19 17:37:26,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=29898.0, ans=0.025 2024-06-19 17:37:32,178 INFO [train.py:1028] (1/2) Epoch 2, batch 6200, loss[loss=1.021, simple_loss=0.6682, pruned_loss=0.6867, over 13242.00 frames. ], tot_loss[loss=0.9181, simple_loss=0.5907, pruned_loss=0.6228, over 2577177.10 frames. ], batch size: 89, lr: 2.22e-02, grad_scale: 0.25 2024-06-19 17:37:33,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=29916.333333333332, ans=0.2 2024-06-19 17:37:34,762 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.80 vs. limit=10.0 2024-06-19 17:37:35,746 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=1.016e-02 2024-06-19 17:37:53,984 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.93 vs. limit=12.0 2024-06-19 17:37:59,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=29971.333333333332, ans=0.0 2024-06-19 17:38:00,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=29971.333333333332, ans=0.0 2024-06-19 17:38:00,642 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=25.36 vs. limit=15.0 2024-06-19 17:38:06,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=29989.666666666668, ans=0.125 2024-06-19 17:38:07,236 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.721e+03 5.123e+03 6.015e+03 7.826e+03 3.365e+04, threshold=1.203e+04, percent-clipped=1.0 2024-06-19 17:38:10,184 INFO [train.py:1028] (1/2) Epoch 2, batch 6250, loss[loss=0.9512, simple_loss=0.612, pruned_loss=0.6452, over 13218.00 frames. ], tot_loss[loss=0.921, simple_loss=0.593, pruned_loss=0.6245, over 2569759.32 frames. ], batch size: 83, lr: 2.22e-02, grad_scale: 0.25 2024-06-19 17:38:26,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=30044.666666666668, ans=0.125 2024-06-19 17:38:48,303 INFO [train.py:1028] (1/2) Epoch 2, batch 6300, loss[loss=0.9744, simple_loss=0.5877, pruned_loss=0.6806, over 11342.00 frames. ], tot_loss[loss=0.9278, simple_loss=0.5973, pruned_loss=0.6291, over 2565118.64 frames. ], batch size: 16, lr: 2.21e-02, grad_scale: 0.5 2024-06-19 17:39:03,696 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=15.90 vs. limit=15.0 2024-06-19 17:39:07,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=30136.333333333332, ans=0.004318188405797101 2024-06-19 17:39:07,948 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=3.971e-02 2024-06-19 17:39:20,167 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.025e+03 6.058e+03 8.330e+03 1.068e+04 3.495e+04, threshold=1.666e+04, percent-clipped=15.0 2024-06-19 17:39:22,224 INFO [train.py:1028] (1/2) Epoch 2, batch 6350, loss[loss=0.9518, simple_loss=0.6398, pruned_loss=0.6319, over 12537.00 frames. ], tot_loss[loss=0.9366, simple_loss=0.6015, pruned_loss=0.6358, over 2574631.92 frames. ], batch size: 202, lr: 2.21e-02, grad_scale: 0.25 2024-06-19 17:39:22,667 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=28.25 vs. limit=15.0 2024-06-19 17:39:23,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=30191.333333333332, ans=0.125 2024-06-19 17:39:25,360 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.36 vs. limit=22.5 2024-06-19 17:39:27,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=30191.333333333332, ans=0.0 2024-06-19 17:39:34,863 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=23.76 vs. limit=15.0 2024-06-19 17:39:40,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=30228.0, ans=15.0 2024-06-19 17:39:45,571 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.95 vs. limit=10.0 2024-06-19 17:39:55,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=30264.666666666668, ans=0.125 2024-06-19 17:39:59,425 INFO [train.py:1028] (1/2) Epoch 2, batch 6400, loss[loss=0.9492, simple_loss=0.6017, pruned_loss=0.6484, over 13207.00 frames. ], tot_loss[loss=0.946, simple_loss=0.6068, pruned_loss=0.6426, over 2575334.97 frames. ], batch size: 67, lr: 2.21e-02, grad_scale: 0.5 2024-06-19 17:40:04,008 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.92 vs. limit=22.5 2024-06-19 17:40:08,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=30301.333333333332, ans=0.00428231884057971 2024-06-19 17:40:09,258 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.24 vs. limit=15.0 2024-06-19 17:40:12,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=30319.666666666668, ans=0.025 2024-06-19 17:40:16,389 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=39.12 vs. limit=15.0 2024-06-19 17:40:17,544 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.85 vs. limit=15.0 2024-06-19 17:40:19,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=30338.0, ans=0.07 2024-06-19 17:40:27,709 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.00 vs. limit=15.0 2024-06-19 17:40:28,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=30356.333333333332, ans=0.04949747468305833 2024-06-19 17:40:30,178 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.62 vs. limit=15.0 2024-06-19 17:40:30,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.20 vs. limit=12.0 2024-06-19 17:40:31,019 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.789e+03 5.308e+03 6.291e+03 7.965e+03 3.591e+04, threshold=1.258e+04, percent-clipped=3.0 2024-06-19 17:40:32,249 INFO [train.py:1028] (1/2) Epoch 2, batch 6450, loss[loss=0.9366, simple_loss=0.622, pruned_loss=0.6256, over 12635.00 frames. ], tot_loss[loss=0.9505, simple_loss=0.6095, pruned_loss=0.6457, over 2580772.14 frames. ], batch size: 202, lr: 2.20e-02, grad_scale: 0.25 2024-06-19 17:40:37,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=30374.666666666668, ans=0.09899494936611666 2024-06-19 17:40:47,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=30393.0, ans=0.0 2024-06-19 17:40:48,461 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=6.591e-02 2024-06-19 17:40:50,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=30411.333333333332, ans=6.0 2024-06-19 17:41:01,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=30448.0, ans=0.04949747468305833 2024-06-19 17:41:02,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=30448.0, ans=0.1 2024-06-19 17:41:06,378 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=12.0 2024-06-19 17:41:08,790 INFO [train.py:1028] (1/2) Epoch 2, batch 6500, loss[loss=0.8334, simple_loss=0.5706, pruned_loss=0.5481, over 10863.00 frames. ], tot_loss[loss=0.9544, simple_loss=0.6117, pruned_loss=0.6486, over 2584270.89 frames. ], batch size: 303, lr: 2.20e-02, grad_scale: 0.25 2024-06-19 17:41:23,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=30503.0, ans=0.004238478260869565 2024-06-19 17:41:28,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=30521.333333333332, ans=0.125 2024-06-19 17:41:31,765 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=16.94 vs. limit=15.0 2024-06-19 17:41:40,718 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.05 vs. limit=6.0 2024-06-19 17:41:40,914 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.795e+03 5.977e+03 7.488e+03 1.002e+04 4.931e+04, threshold=1.498e+04, percent-clipped=13.0 2024-06-19 17:41:41,646 INFO [train.py:1028] (1/2) Epoch 2, batch 6550, loss[loss=1.03, simple_loss=0.6363, pruned_loss=0.712, over 12637.00 frames. ], tot_loss[loss=0.9611, simple_loss=0.6159, pruned_loss=0.6531, over 2588218.71 frames. ], batch size: 22, lr: 2.20e-02, grad_scale: 0.25 2024-06-19 17:41:42,125 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.73 vs. limit=22.5 2024-06-19 17:41:43,634 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.93 vs. limit=22.5 2024-06-19 17:41:45,886 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=22.03 vs. limit=15.0 2024-06-19 17:41:49,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=30576.333333333332, ans=0.004222536231884059 2024-06-19 17:41:49,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=30576.333333333332, ans=0.1 2024-06-19 17:41:55,835 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.07 vs. limit=15.0 2024-06-19 17:42:02,644 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.92 vs. limit=10.0 2024-06-19 17:42:12,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=30631.333333333332, ans=0.0 2024-06-19 17:42:12,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=30631.333333333332, ans=0.0 2024-06-19 17:42:18,355 INFO [train.py:1028] (1/2) Epoch 2, batch 6600, loss[loss=0.9222, simple_loss=0.5865, pruned_loss=0.6289, over 13318.00 frames. ], tot_loss[loss=0.9632, simple_loss=0.617, pruned_loss=0.6547, over 2589940.10 frames. ], batch size: 72, lr: 2.20e-02, grad_scale: 0.25 2024-06-19 17:42:20,735 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=18.85 vs. limit=15.0 2024-06-19 17:42:24,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=30668.0, ans=0.0 2024-06-19 17:42:25,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=30668.0, ans=0.004202608695652174 2024-06-19 17:42:31,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=30668.0, ans=0.125 2024-06-19 17:42:33,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=30686.333333333332, ans=0.025 2024-06-19 17:42:50,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=30723.0, ans=0.5 2024-06-19 17:42:51,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=30723.0, ans=0.125 2024-06-19 17:42:52,791 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.407e+03 5.833e+03 6.857e+03 8.492e+03 7.723e+04, threshold=1.371e+04, percent-clipped=6.0 2024-06-19 17:42:52,822 INFO [train.py:1028] (1/2) Epoch 2, batch 6650, loss[loss=1.036, simple_loss=0.6718, pruned_loss=0.7, over 12971.00 frames. ], tot_loss[loss=0.9676, simple_loss=0.6196, pruned_loss=0.6578, over 2584691.25 frames. ], batch size: 158, lr: 2.19e-02, grad_scale: 0.25 2024-06-19 17:42:57,681 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=1.182e+02 2024-06-19 17:43:00,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=30741.333333333332, ans=0.1 2024-06-19 17:43:08,127 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=26.80 vs. limit=15.0 2024-06-19 17:43:21,393 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.59 vs. limit=15.0 2024-06-19 17:43:24,393 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=3.646e-02 2024-06-19 17:43:25,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=30814.666666666668, ans=0.0 2024-06-19 17:43:25,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=30814.666666666668, ans=0.125 2024-06-19 17:43:27,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=30814.666666666668, ans=0.0041707246376811585 2024-06-19 17:43:29,899 INFO [train.py:1028] (1/2) Epoch 2, batch 6700, loss[loss=0.9641, simple_loss=0.6395, pruned_loss=0.6443, over 12720.00 frames. ], tot_loss[loss=0.9696, simple_loss=0.6206, pruned_loss=0.6593, over 2583881.72 frames. ], batch size: 176, lr: 2.19e-02, grad_scale: 0.125 2024-06-19 17:43:34,809 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=7.72 vs. limit=6.0 2024-06-19 17:43:40,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=30851.333333333332, ans=0.125 2024-06-19 17:43:44,285 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.98 vs. limit=15.0 2024-06-19 17:43:46,578 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=1.468e+00 2024-06-19 17:43:47,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=30869.666666666668, ans=0.0 2024-06-19 17:43:50,781 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.11 vs. limit=15.0 2024-06-19 17:43:55,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=30888.0, ans=0.0 2024-06-19 17:44:00,538 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=20.41 vs. limit=15.0 2024-06-19 17:44:01,232 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.87 vs. limit=22.5 2024-06-19 17:44:06,683 INFO [train.py:1028] (1/2) Epoch 2, batch 6750, loss[loss=1.025, simple_loss=0.6808, pruned_loss=0.6847, over 12277.00 frames. ], tot_loss[loss=0.9729, simple_loss=0.6229, pruned_loss=0.6615, over 2578999.54 frames. ], batch size: 241, lr: 2.19e-02, grad_scale: 0.125 2024-06-19 17:44:08,020 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.388e+03 5.453e+03 7.214e+03 8.831e+03 9.283e+04, threshold=1.443e+04, percent-clipped=6.0 2024-06-19 17:44:16,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=30943.0, ans=0.125 2024-06-19 17:44:20,862 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=39.95 vs. limit=22.5 2024-06-19 17:44:32,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=30998.0, ans=0.2 2024-06-19 17:44:37,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=30998.0, ans=0.004130869565217391 2024-06-19 17:44:40,263 INFO [train.py:1028] (1/2) Epoch 2, batch 6800, loss[loss=0.9635, simple_loss=0.6004, pruned_loss=0.6633, over 13217.00 frames. ], tot_loss[loss=0.9761, simple_loss=0.6243, pruned_loss=0.6639, over 2580378.77 frames. ], batch size: 67, lr: 2.18e-02, grad_scale: 0.25 2024-06-19 17:44:44,390 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=31016.333333333332, ans=0.0 2024-06-19 17:44:49,288 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.02 vs. limit=22.5 2024-06-19 17:44:50,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=31034.666666666668, ans=0.125 2024-06-19 17:44:54,402 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=28.76 vs. limit=22.5 2024-06-19 17:44:59,916 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=16.10 vs. limit=15.0 2024-06-19 17:45:00,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=31053.0, ans=0.05 2024-06-19 17:45:03,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=22.20 vs. limit=15.0 2024-06-19 17:45:08,118 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=26.11 vs. limit=15.0 2024-06-19 17:45:11,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=21.94 vs. limit=15.0 2024-06-19 17:45:17,398 INFO [train.py:1028] (1/2) Epoch 2, batch 6850, loss[loss=1.119, simple_loss=0.711, pruned_loss=0.7632, over 13244.00 frames. ], tot_loss[loss=0.9827, simple_loss=0.6269, pruned_loss=0.6693, over 2583674.83 frames. ], batch size: 63, lr: 2.18e-02, grad_scale: 0.125 2024-06-19 17:45:19,559 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.351e+03 6.614e+03 8.113e+03 1.195e+04 4.472e+04, threshold=1.623e+04, percent-clipped=12.0 2024-06-19 17:45:26,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=31126.333333333332, ans=0.125 2024-06-19 17:45:32,495 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.05 vs. limit=15.0 2024-06-19 17:45:32,996 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=15.0 2024-06-19 17:45:36,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=31144.666666666668, ans=0.0 2024-06-19 17:45:43,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=31163.0, ans=0.0 2024-06-19 17:45:45,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=31181.333333333332, ans=0.125 2024-06-19 17:45:48,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31181.333333333332, ans=0.1 2024-06-19 17:45:51,700 INFO [train.py:1028] (1/2) Epoch 2, batch 6900, loss[loss=1.027, simple_loss=0.6464, pruned_loss=0.7035, over 13303.00 frames. ], tot_loss[loss=0.9838, simple_loss=0.6283, pruned_loss=0.6697, over 2585571.97 frames. ], batch size: 49, lr: 2.18e-02, grad_scale: 0.25 2024-06-19 17:45:57,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=31218.0, ans=0.1 2024-06-19 17:46:00,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=31218.0, ans=0.125 2024-06-19 17:46:06,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=31236.333333333332, ans=0.025 2024-06-19 17:46:08,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31236.333333333332, ans=0.1 2024-06-19 17:46:09,269 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=11.50 vs. limit=12.0 2024-06-19 17:46:10,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=31236.333333333332, ans=0.2 2024-06-19 17:46:25,754 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=24.65 vs. limit=15.0 2024-06-19 17:46:28,852 INFO [train.py:1028] (1/2) Epoch 2, batch 6950, loss[loss=1.014, simple_loss=0.623, pruned_loss=0.7028, over 10930.00 frames. ], tot_loss[loss=0.9851, simple_loss=0.629, pruned_loss=0.6706, over 2577977.56 frames. ], batch size: 16, lr: 2.18e-02, grad_scale: 0.25 2024-06-19 17:46:30,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=31291.333333333332, ans=0.0040671014492753635 2024-06-19 17:46:30,755 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.936e+03 8.290e+03 9.447e+03 1.194e+04 4.529e+04, threshold=1.889e+04, percent-clipped=17.0 2024-06-19 17:46:34,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=31291.333333333332, ans=0.09899494936611666 2024-06-19 17:46:35,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31309.666666666668, ans=0.1 2024-06-19 17:46:45,427 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=25.87 vs. limit=15.0 2024-06-19 17:46:51,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=31346.333333333332, ans=0.125 2024-06-19 17:46:58,644 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=25.13 vs. limit=22.5 2024-06-19 17:47:02,450 INFO [train.py:1028] (1/2) Epoch 2, batch 7000, loss[loss=0.9443, simple_loss=0.6224, pruned_loss=0.6331, over 12955.00 frames. ], tot_loss[loss=0.9865, simple_loss=0.6298, pruned_loss=0.6716, over 2574655.49 frames. ], batch size: 158, lr: 2.17e-02, grad_scale: 0.5 2024-06-19 17:47:11,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=31383.0, ans=0.125 2024-06-19 17:47:13,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=31401.333333333332, ans=0.004043188405797102 2024-06-19 17:47:29,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=31438.0, ans=0.004035217391304348 2024-06-19 17:47:41,394 INFO [train.py:1028] (1/2) Epoch 2, batch 7050, loss[loss=0.988, simple_loss=0.6612, pruned_loss=0.6574, over 12839.00 frames. ], tot_loss[loss=0.9916, simple_loss=0.6324, pruned_loss=0.6754, over 2582543.28 frames. ], batch size: 177, lr: 2.17e-02, grad_scale: 0.25 2024-06-19 17:47:44,125 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.072e+03 6.924e+03 8.022e+03 9.641e+03 3.370e+04, threshold=1.604e+04, percent-clipped=2.0 2024-06-19 17:47:51,143 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=40.83 vs. limit=22.5 2024-06-19 17:47:52,527 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=30.62 vs. limit=15.0 2024-06-19 17:47:59,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=31511.333333333332, ans=0.1 2024-06-19 17:48:04,127 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=1.653e-02 2024-06-19 17:48:05,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=31529.666666666668, ans=0.004015289855072463 2024-06-19 17:48:06,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31529.666666666668, ans=0.1 2024-06-19 17:48:14,571 INFO [train.py:1028] (1/2) Epoch 2, batch 7100, loss[loss=1.004, simple_loss=0.6531, pruned_loss=0.6774, over 13190.00 frames. ], tot_loss[loss=0.9896, simple_loss=0.6319, pruned_loss=0.6736, over 2575075.59 frames. ], batch size: 112, lr: 2.17e-02, grad_scale: 0.5 2024-06-19 17:48:14,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=31566.333333333332, ans=0.5 2024-06-19 17:48:25,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=31584.666666666668, ans=0.125 2024-06-19 17:48:28,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=31584.666666666668, ans=0.0 2024-06-19 17:48:28,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=31584.666666666668, ans=0.125 2024-06-19 17:48:29,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=31584.666666666668, ans=0.95 2024-06-19 17:48:41,790 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=20.75 vs. limit=15.0 2024-06-19 17:48:46,098 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.07 vs. limit=15.0 2024-06-19 17:48:49,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.99 vs. limit=15.0 2024-06-19 17:48:51,416 INFO [train.py:1028] (1/2) Epoch 2, batch 7150, loss[loss=0.99, simple_loss=0.6622, pruned_loss=0.6589, over 12550.00 frames. ], tot_loss[loss=0.9945, simple_loss=0.6347, pruned_loss=0.6771, over 2572820.32 frames. ], batch size: 202, lr: 2.17e-02, grad_scale: 0.125 2024-06-19 17:48:55,364 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.249e+03 6.596e+03 7.927e+03 9.449e+03 3.032e+04, threshold=1.585e+04, percent-clipped=5.0 2024-06-19 17:48:57,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=31676.333333333332, ans=0.125 2024-06-19 17:49:07,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=31694.666666666668, ans=0.0 2024-06-19 17:49:15,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=31713.0, ans=0.003975434782608696 2024-06-19 17:49:27,194 INFO [train.py:1028] (1/2) Epoch 2, batch 7200, loss[loss=1, simple_loss=0.6623, pruned_loss=0.6692, over 13193.00 frames. ], tot_loss[loss=0.9958, simple_loss=0.6368, pruned_loss=0.6774, over 2578063.54 frames. ], batch size: 112, lr: 2.16e-02, grad_scale: 0.25 2024-06-19 17:49:27,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=31749.666666666668, ans=0.0 2024-06-19 17:49:34,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31768.0, ans=0.1 2024-06-19 17:49:49,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=31804.666666666668, ans=0.125 2024-06-19 17:49:51,622 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.29 vs. limit=22.5 2024-06-19 17:49:57,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2024-06-19 17:49:58,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=31823.0, ans=0.1 2024-06-19 17:50:00,930 INFO [train.py:1028] (1/2) Epoch 2, batch 7250, loss[loss=1.166, simple_loss=0.7173, pruned_loss=0.8075, over 12945.00 frames. ], tot_loss[loss=0.9992, simple_loss=0.6393, pruned_loss=0.6796, over 2579487.34 frames. ], batch size: 36, lr: 2.16e-02, grad_scale: 0.125 2024-06-19 17:50:05,716 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.393e+03 5.949e+03 7.039e+03 8.681e+03 3.768e+04, threshold=1.408e+04, percent-clipped=4.0 2024-06-19 17:50:12,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=31859.666666666668, ans=15.0 2024-06-19 17:50:13,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31878.0, ans=0.1 2024-06-19 17:50:37,931 INFO [train.py:1028] (1/2) Epoch 2, batch 7300, loss[loss=0.9964, simple_loss=0.6177, pruned_loss=0.6876, over 13008.00 frames. ], tot_loss[loss=1.006, simple_loss=0.643, pruned_loss=0.6846, over 2579393.74 frames. ], batch size: 36, lr: 2.16e-02, grad_scale: 0.25 2024-06-19 17:50:41,790 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.04 vs. limit=15.0 2024-06-19 17:50:48,145 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.69 vs. limit=22.5 2024-06-19 17:50:56,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=31969.666666666668, ans=0.125 2024-06-19 17:51:05,255 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.41 vs. limit=10.0 2024-06-19 17:51:10,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=32006.333333333332, ans=0.125 2024-06-19 17:51:11,425 INFO [train.py:1028] (1/2) Epoch 2, batch 7350, loss[loss=1.043, simple_loss=0.6552, pruned_loss=0.7155, over 13262.00 frames. ], tot_loss[loss=1.009, simple_loss=0.6446, pruned_loss=0.6871, over 2580800.02 frames. ], batch size: 46, lr: 2.16e-02, grad_scale: 0.25 2024-06-19 17:51:16,118 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.148e+03 4.596e+03 5.810e+03 7.109e+03 3.624e+04, threshold=1.162e+04, percent-clipped=3.0 2024-06-19 17:51:22,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=32043.0, ans=15.0 2024-06-19 17:51:28,647 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.10 vs. limit=15.0 2024-06-19 17:51:28,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=32061.333333333332, ans=0.95 2024-06-19 17:51:29,829 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.91 vs. limit=6.0 2024-06-19 17:51:34,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=32079.666666666668, ans=0.1 2024-06-19 17:51:35,846 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.44 vs. limit=22.5 2024-06-19 17:51:39,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=32079.666666666668, ans=0.0 2024-06-19 17:51:45,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=19.50 vs. limit=15.0 2024-06-19 17:51:48,262 INFO [train.py:1028] (1/2) Epoch 2, batch 7400, loss[loss=1.036, simple_loss=0.6588, pruned_loss=0.7067, over 13262.00 frames. ], tot_loss[loss=1.008, simple_loss=0.6437, pruned_loss=0.6861, over 2585785.85 frames. ], batch size: 63, lr: 2.15e-02, grad_scale: 0.5 2024-06-19 17:51:58,033 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=32134.666666666668, ans=0.125 2024-06-19 17:51:58,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=32134.666666666668, ans=0.0 2024-06-19 17:51:59,760 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.89 vs. limit=6.0 2024-06-19 17:52:00,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=32134.666666666668, ans=0.0 2024-06-19 17:52:00,512 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=15.95 vs. limit=15.0 2024-06-19 17:52:01,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=32153.0, ans=0.0 2024-06-19 17:52:02,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=32153.0, ans=0.125 2024-06-19 17:52:07,784 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=20.68 vs. limit=15.0 2024-06-19 17:52:15,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=32189.666666666668, ans=15.0 2024-06-19 17:52:22,569 INFO [train.py:1028] (1/2) Epoch 2, batch 7450, loss[loss=0.9462, simple_loss=0.5893, pruned_loss=0.6516, over 12492.00 frames. ], tot_loss[loss=1.008, simple_loss=0.6437, pruned_loss=0.6861, over 2580415.98 frames. ], batch size: 29, lr: 2.15e-02, grad_scale: 0.125 2024-06-19 17:52:22,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=32208.0, ans=0.125 2024-06-19 17:52:23,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=32208.0, ans=0.125 2024-06-19 17:52:28,581 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.68 vs. limit=15.0 2024-06-19 17:52:28,733 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.752e+03 6.467e+03 8.542e+03 1.242e+04 6.905e+04, threshold=1.708e+04, percent-clipped=27.0 2024-06-19 17:52:44,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=32244.666666666668, ans=0.0 2024-06-19 17:53:00,516 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.80 vs. limit=15.0 2024-06-19 17:53:01,013 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=42.44 vs. limit=15.0 2024-06-19 17:53:01,294 INFO [train.py:1028] (1/2) Epoch 2, batch 7500, loss[loss=0.9263, simple_loss=0.63, pruned_loss=0.6114, over 10564.00 frames. ], tot_loss[loss=1.008, simple_loss=0.6449, pruned_loss=0.6858, over 2577954.44 frames. ], batch size: 303, lr: 2.15e-02, grad_scale: 0.25 2024-06-19 17:53:03,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=32299.666666666668, ans=0.2 2024-06-19 17:53:07,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=32318.0, ans=0.1 2024-06-19 17:53:09,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=32318.0, ans=0.125 2024-06-19 17:53:15,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=32336.333333333332, ans=0.125 2024-06-19 17:53:18,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=32336.333333333332, ans=0.05 2024-06-19 17:53:33,056 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.18 vs. limit=15.0 2024-06-19 17:53:38,041 INFO [train.py:1028] (1/2) Epoch 2, batch 7550, loss[loss=0.9175, simple_loss=0.6074, pruned_loss=0.6138, over 12900.00 frames. ], tot_loss[loss=1.007, simple_loss=0.646, pruned_loss=0.6836, over 2576845.52 frames. ], batch size: 158, lr: 2.15e-02, grad_scale: 0.25 2024-06-19 17:53:41,965 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.33 vs. limit=8.0 2024-06-19 17:53:44,091 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.234e+03 6.931e+03 8.347e+03 1.009e+04 5.236e+04, threshold=1.669e+04, percent-clipped=6.0 2024-06-19 17:53:45,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=32409.666666666668, ans=0.125 2024-06-19 17:53:47,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=32409.666666666668, ans=0.125 2024-06-19 17:53:47,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.49 vs. limit=15.0 2024-06-19 17:53:50,028 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.09 vs. limit=15.0 2024-06-19 17:54:00,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=32446.333333333332, ans=0.0 2024-06-19 17:54:01,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=32446.333333333332, ans=0.0 2024-06-19 17:54:05,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=32464.666666666668, ans=0.125 2024-06-19 17:54:11,951 INFO [train.py:1028] (1/2) Epoch 2, batch 7600, loss[loss=1.05, simple_loss=0.6789, pruned_loss=0.7107, over 13258.00 frames. ], tot_loss[loss=1.006, simple_loss=0.6463, pruned_loss=0.6833, over 2577614.37 frames. ], batch size: 83, lr: 2.14e-02, grad_scale: 0.5 2024-06-19 17:54:12,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=32483.0, ans=0.025 2024-06-19 17:54:18,475 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.92 vs. limit=10.0 2024-06-19 17:54:22,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=32501.333333333332, ans=0.125 2024-06-19 17:54:22,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=32501.333333333332, ans=0.125 2024-06-19 17:54:24,780 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.40 vs. limit=12.0 2024-06-19 17:54:33,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=32519.666666666668, ans=0.0038000724637681166 2024-06-19 17:54:35,095 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=32.60 vs. limit=15.0 2024-06-19 17:54:39,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=32538.0, ans=0.003796086956521739 2024-06-19 17:54:43,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=32556.333333333332, ans=0.125 2024-06-19 17:54:48,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=32556.333333333332, ans=15.0 2024-06-19 17:54:49,737 INFO [train.py:1028] (1/2) Epoch 2, batch 7650, loss[loss=1.035, simple_loss=0.6568, pruned_loss=0.7067, over 12933.00 frames. ], tot_loss[loss=1.007, simple_loss=0.6472, pruned_loss=0.6837, over 2572838.09 frames. ], batch size: 33, lr: 2.14e-02, grad_scale: 0.125 2024-06-19 17:54:54,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=32574.666666666668, ans=0.125 2024-06-19 17:54:57,052 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.570e+03 5.053e+03 6.423e+03 8.367e+03 3.575e+04, threshold=1.285e+04, percent-clipped=4.0 2024-06-19 17:54:57,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=16.21 vs. limit=15.0 2024-06-19 17:55:04,245 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.70 vs. limit=22.5 2024-06-19 17:55:26,850 INFO [train.py:1028] (1/2) Epoch 2, batch 7700, loss[loss=1.098, simple_loss=0.6974, pruned_loss=0.749, over 13282.00 frames. ], tot_loss[loss=1.008, simple_loss=0.6479, pruned_loss=0.6844, over 2569975.74 frames. ], batch size: 63, lr: 2.14e-02, grad_scale: 0.25 2024-06-19 17:55:30,552 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.04 vs. limit=22.5 2024-06-19 17:55:34,711 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.39 vs. limit=10.0 2024-06-19 17:55:34,734 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.75 vs. limit=15.0 2024-06-19 17:55:38,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=32684.666666666668, ans=0.125 2024-06-19 17:55:45,301 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=19.46 vs. limit=15.0 2024-06-19 17:55:50,293 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.04 vs. limit=15.0 2024-06-19 17:55:54,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=32739.666666666668, ans=0.07 2024-06-19 17:55:58,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=32739.666666666668, ans=0.125 2024-06-19 17:56:00,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=32758.0, ans=0.0037482608695652182 2024-06-19 17:56:00,637 INFO [train.py:1028] (1/2) Epoch 2, batch 7750, loss[loss=1.047, simple_loss=0.6708, pruned_loss=0.7114, over 13204.00 frames. ], tot_loss[loss=1.009, simple_loss=0.6495, pruned_loss=0.6846, over 2573707.71 frames. ], batch size: 72, lr: 2.14e-02, grad_scale: 0.25 2024-06-19 17:56:02,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=32758.0, ans=0.2 2024-06-19 17:56:08,330 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.911e+03 4.639e+03 5.370e+03 6.411e+03 1.729e+04, threshold=1.074e+04, percent-clipped=7.0 2024-06-19 17:56:08,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=32776.333333333336, ans=0.125 2024-06-19 17:56:19,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=32794.666666666664, ans=0.05 2024-06-19 17:56:21,020 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.84 vs. limit=15.0 2024-06-19 17:56:21,746 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.29 vs. limit=10.0 2024-06-19 17:56:35,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=32831.333333333336, ans=15.0 2024-06-19 17:56:35,131 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.13 vs. limit=15.0 2024-06-19 17:56:38,266 INFO [train.py:1028] (1/2) Epoch 2, batch 7800, loss[loss=0.998, simple_loss=0.6449, pruned_loss=0.6756, over 13142.00 frames. ], tot_loss[loss=1.011, simple_loss=0.6505, pruned_loss=0.6854, over 2578570.96 frames. ], batch size: 95, lr: 2.13e-02, grad_scale: 0.5 2024-06-19 17:56:38,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=32849.666666666664, ans=0.125 2024-06-19 17:56:39,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=32849.666666666664, ans=0.1 2024-06-19 17:56:50,428 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=28.84 vs. limit=15.0 2024-06-19 17:56:51,732 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.09 vs. limit=15.0 2024-06-19 17:56:55,942 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=23.09 vs. limit=15.0 2024-06-19 17:56:56,740 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.72 vs. limit=10.0 2024-06-19 17:57:01,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=32904.666666666664, ans=0.5 2024-06-19 17:57:02,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=32904.666666666664, ans=0.125 2024-06-19 17:57:06,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=32923.0, ans=0.125 2024-06-19 17:57:10,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=32923.0, ans=0.125 2024-06-19 17:57:12,891 INFO [train.py:1028] (1/2) Epoch 2, batch 7850, loss[loss=1.036, simple_loss=0.641, pruned_loss=0.7152, over 10837.00 frames. ], tot_loss[loss=1.013, simple_loss=0.6516, pruned_loss=0.6867, over 2572466.62 frames. ], batch size: 16, lr: 2.13e-02, grad_scale: 0.25 2024-06-19 17:57:13,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=32941.333333333336, ans=0.125 2024-06-19 17:57:24,405 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.713e+03 5.662e+03 6.789e+03 8.185e+03 3.824e+04, threshold=1.358e+04, percent-clipped=6.0 2024-06-19 17:57:25,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=32959.666666666664, ans=0.1 2024-06-19 17:57:26,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=32959.666666666664, ans=0.0 2024-06-19 17:57:37,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=32996.333333333336, ans=10.0 2024-06-19 17:57:39,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=32996.333333333336, ans=0.125 2024-06-19 17:57:40,113 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=14.82 vs. limit=15.0 2024-06-19 17:57:49,813 INFO [train.py:1028] (1/2) Epoch 2, batch 7900, loss[loss=0.9951, simple_loss=0.6359, pruned_loss=0.6771, over 13164.00 frames. ], tot_loss[loss=1.012, simple_loss=0.6513, pruned_loss=0.6863, over 2572640.49 frames. ], batch size: 77, lr: 2.13e-02, grad_scale: 0.5 2024-06-19 17:57:53,839 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=26.15 vs. limit=15.0 2024-06-19 17:57:58,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=33051.333333333336, ans=0.125 2024-06-19 17:57:59,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=33051.333333333336, ans=0.2 2024-06-19 17:58:13,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=33088.0, ans=0.125 2024-06-19 17:58:23,120 INFO [train.py:1028] (1/2) Epoch 2, batch 7950, loss[loss=0.9568, simple_loss=0.6564, pruned_loss=0.6286, over 10567.00 frames. ], tot_loss[loss=1.013, simple_loss=0.6516, pruned_loss=0.6872, over 2575547.95 frames. ], batch size: 303, lr: 2.12e-02, grad_scale: 0.5 2024-06-19 17:58:32,247 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=2.363e-01 2024-06-19 17:58:32,530 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=18.46 vs. limit=15.0 2024-06-19 17:58:34,719 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.940e+03 4.370e+03 5.379e+03 6.669e+03 1.454e+04, threshold=1.076e+04, percent-clipped=2.0 2024-06-19 17:58:37,549 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.46 vs. limit=15.0 2024-06-19 17:58:39,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=33143.0, ans=0.125 2024-06-19 17:58:39,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=33161.333333333336, ans=0.125 2024-06-19 17:58:50,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=20.03 vs. limit=15.0 2024-06-19 17:59:00,344 INFO [train.py:1028] (1/2) Epoch 2, batch 8000, loss[loss=1.147, simple_loss=0.7105, pruned_loss=0.7913, over 12632.00 frames. ], tot_loss[loss=1.019, simple_loss=0.6541, pruned_loss=0.6916, over 2571790.98 frames. ], batch size: 29, lr: 2.12e-02, grad_scale: 1.0 2024-06-19 17:59:01,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=33216.333333333336, ans=0.0 2024-06-19 17:59:08,709 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.40 vs. limit=22.5 2024-06-19 17:59:11,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=33234.666666666664, ans=0.2 2024-06-19 17:59:13,155 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.15 vs. limit=15.0 2024-06-19 17:59:15,985 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.63 vs. limit=8.0 2024-06-19 17:59:21,014 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=4.926e+00 2024-06-19 17:59:28,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=33271.333333333336, ans=0.2 2024-06-19 17:59:30,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=33289.666666666664, ans=0.125 2024-06-19 17:59:32,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=33289.666666666664, ans=0.1 2024-06-19 17:59:36,727 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=19.17 vs. limit=15.0 2024-06-19 17:59:36,979 INFO [train.py:1028] (1/2) Epoch 2, batch 8050, loss[loss=1.001, simple_loss=0.6504, pruned_loss=0.6762, over 13206.00 frames. ], tot_loss[loss=1.02, simple_loss=0.6544, pruned_loss=0.6925, over 2571598.80 frames. ], batch size: 83, lr: 2.12e-02, grad_scale: 0.25 2024-06-19 17:59:42,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=33326.333333333336, ans=0.05 2024-06-19 17:59:45,901 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.973e+03 3.672e+03 4.743e+03 5.647e+03 2.569e+04, threshold=9.485e+03, percent-clipped=5.0 2024-06-19 17:59:58,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=33363.0, ans=0.125 2024-06-19 18:00:07,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=33381.333333333336, ans=0.0 2024-06-19 18:00:09,205 INFO [train.py:1028] (1/2) Epoch 2, batch 8100, loss[loss=1.027, simple_loss=0.6719, pruned_loss=0.6913, over 13179.00 frames. ], tot_loss[loss=1.02, simple_loss=0.6561, pruned_loss=0.6922, over 2576245.08 frames. ], batch size: 112, lr: 2.12e-02, grad_scale: 0.5 2024-06-19 18:00:12,331 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 18:00:18,503 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=12.0 2024-06-19 18:00:19,183 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.69 vs. limit=15.0 2024-06-19 18:00:27,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=33436.333333333336, ans=0.1 2024-06-19 18:00:37,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=33454.666666666664, ans=10.0 2024-06-19 18:00:42,339 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.70 vs. limit=22.5 2024-06-19 18:00:43,155 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=26.45 vs. limit=15.0 2024-06-19 18:00:47,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=33491.333333333336, ans=0.0 2024-06-19 18:00:47,692 INFO [train.py:1028] (1/2) Epoch 2, batch 8150, loss[loss=0.941, simple_loss=0.6226, pruned_loss=0.6297, over 13091.00 frames. ], tot_loss[loss=1.02, simple_loss=0.656, pruned_loss=0.6925, over 2578715.92 frames. ], batch size: 121, lr: 2.12e-02, grad_scale: 0.25 2024-06-19 18:00:50,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=33491.333333333336, ans=0.1 2024-06-19 18:00:57,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=33509.666666666664, ans=0.125 2024-06-19 18:00:58,108 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.772e+03 5.329e+03 6.812e+03 8.456e+03 3.052e+04, threshold=1.362e+04, percent-clipped=10.0 2024-06-19 18:01:05,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=33528.0, ans=0.1 2024-06-19 18:01:09,720 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=2.128e-02 2024-06-19 18:01:18,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=33564.666666666664, ans=0.1 2024-06-19 18:01:20,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=33564.666666666664, ans=0.2 2024-06-19 18:01:22,116 INFO [train.py:1028] (1/2) Epoch 2, batch 8200, loss[loss=1.017, simple_loss=0.6709, pruned_loss=0.6814, over 13149.00 frames. ], tot_loss[loss=1.021, simple_loss=0.6571, pruned_loss=0.6929, over 2582299.79 frames. ], batch size: 112, lr: 2.11e-02, grad_scale: 0.5 2024-06-19 18:01:26,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=33583.0, ans=0.0 2024-06-19 18:01:35,922 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=15.0 2024-06-19 18:01:46,654 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.15 vs. limit=10.0 2024-06-19 18:02:00,406 INFO [train.py:1028] (1/2) Epoch 2, batch 8250, loss[loss=1.02, simple_loss=0.6454, pruned_loss=0.6969, over 13298.00 frames. ], tot_loss[loss=1.02, simple_loss=0.6568, pruned_loss=0.6914, over 2582483.36 frames. ], batch size: 52, lr: 2.11e-02, grad_scale: 0.125 2024-06-19 18:02:07,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=33693.0, ans=12.0 2024-06-19 18:02:09,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=33693.0, ans=0.125 2024-06-19 18:02:11,429 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.323e+03 8.435e+03 1.162e+04 1.808e+04 3.960e+04, threshold=2.323e+04, percent-clipped=39.0 2024-06-19 18:02:12,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=33693.0, ans=0.125 2024-06-19 18:02:29,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=33748.0, ans=0.125 2024-06-19 18:02:30,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=33748.0, ans=0.125 2024-06-19 18:02:33,694 INFO [train.py:1028] (1/2) Epoch 2, batch 8300, loss[loss=1.003, simple_loss=0.6477, pruned_loss=0.6787, over 12995.00 frames. ], tot_loss[loss=1.019, simple_loss=0.6565, pruned_loss=0.6908, over 2580124.84 frames. ], batch size: 102, lr: 2.11e-02, grad_scale: 0.25 2024-06-19 18:02:41,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=33766.333333333336, ans=0.0 2024-06-19 18:02:46,672 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=11.11 vs. limit=10.0 2024-06-19 18:02:50,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=33803.0, ans=0.125 2024-06-19 18:03:00,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=33821.333333333336, ans=0.2 2024-06-19 18:03:11,298 INFO [train.py:1028] (1/2) Epoch 2, batch 8350, loss[loss=1.046, simple_loss=0.6842, pruned_loss=0.7041, over 13160.00 frames. ], tot_loss[loss=1.025, simple_loss=0.6585, pruned_loss=0.6955, over 2580513.22 frames. ], batch size: 112, lr: 2.11e-02, grad_scale: 0.25 2024-06-19 18:03:14,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=33858.0, ans=0.125 2024-06-19 18:03:22,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=33876.333333333336, ans=10.0 2024-06-19 18:03:23,336 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.977e+03 8.620e+03 1.124e+04 1.451e+04 5.858e+04, threshold=2.248e+04, percent-clipped=6.0 2024-06-19 18:03:24,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=33894.666666666664, ans=0.125 2024-06-19 18:03:31,994 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.30 vs. limit=15.0 2024-06-19 18:03:43,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=33931.333333333336, ans=0.2 2024-06-19 18:03:48,669 INFO [train.py:1028] (1/2) Epoch 2, batch 8400, loss[loss=1.01, simple_loss=0.6288, pruned_loss=0.6955, over 13031.00 frames. ], tot_loss[loss=1.023, simple_loss=0.6568, pruned_loss=0.6942, over 2578197.81 frames. ], batch size: 39, lr: 2.10e-02, grad_scale: 0.25 2024-06-19 18:03:49,173 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.74 vs. limit=15.0 2024-06-19 18:03:59,707 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.98 vs. limit=15.0 2024-06-19 18:03:59,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=33968.0, ans=0.0 2024-06-19 18:04:09,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=34004.666666666664, ans=0.0 2024-06-19 18:04:14,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=34023.0, ans=0.025 2024-06-19 18:04:14,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34023.0, ans=0.1 2024-06-19 18:04:15,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=34023.0, ans=0.0 2024-06-19 18:04:16,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=34023.0, ans=0.0 2024-06-19 18:04:22,020 INFO [train.py:1028] (1/2) Epoch 2, batch 8450, loss[loss=1.01, simple_loss=0.6607, pruned_loss=0.6801, over 13201.00 frames. ], tot_loss[loss=1.024, simple_loss=0.6578, pruned_loss=0.6948, over 2580892.64 frames. ], batch size: 112, lr: 2.10e-02, grad_scale: 0.25 2024-06-19 18:04:29,987 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=11.62 vs. limit=10.0 2024-06-19 18:04:34,282 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.056e+03 8.668e+03 1.037e+04 1.252e+04 4.644e+04, threshold=2.074e+04, percent-clipped=5.0 2024-06-19 18:04:42,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=34078.0, ans=15.0 2024-06-19 18:04:43,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=34078.0, ans=0.0 2024-06-19 18:04:54,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=34114.666666666664, ans=0.125 2024-06-19 18:04:54,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=34114.666666666664, ans=0.0 2024-06-19 18:04:59,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=34114.666666666664, ans=0.0 2024-06-19 18:05:01,194 INFO [train.py:1028] (1/2) Epoch 2, batch 8500, loss[loss=1.023, simple_loss=0.6405, pruned_loss=0.7028, over 12695.00 frames. ], tot_loss[loss=1.023, simple_loss=0.6577, pruned_loss=0.6941, over 2579137.92 frames. ], batch size: 29, lr: 2.10e-02, grad_scale: 0.25 2024-06-19 18:05:05,075 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=25.95 vs. limit=15.0 2024-06-19 18:05:07,097 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=44.68 vs. limit=15.0 2024-06-19 18:05:08,366 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.23 vs. limit=6.0 2024-06-19 18:05:10,293 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=3.786e+00 2024-06-19 18:05:10,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=34151.333333333336, ans=0.0 2024-06-19 18:05:11,948 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=19.64 vs. limit=15.0 2024-06-19 18:05:12,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34151.333333333336, ans=0.1 2024-06-19 18:05:18,922 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.93 vs. limit=22.5 2024-06-19 18:05:21,382 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=23.43 vs. limit=15.0 2024-06-19 18:05:25,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=34188.0, ans=0.125 2024-06-19 18:05:26,258 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.50 vs. limit=10.0 2024-06-19 18:05:26,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=34188.0, ans=0.2 2024-06-19 18:05:27,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34188.0, ans=0.1 2024-06-19 18:05:28,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=34206.333333333336, ans=0.2 2024-06-19 18:05:38,929 INFO [train.py:1028] (1/2) Epoch 2, batch 8550, loss[loss=0.9735, simple_loss=0.6062, pruned_loss=0.6704, over 12606.00 frames. ], tot_loss[loss=1.021, simple_loss=0.6567, pruned_loss=0.6926, over 2576764.97 frames. ], batch size: 22, lr: 2.10e-02, grad_scale: 0.25 2024-06-19 18:05:39,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=34224.666666666664, ans=0.1 2024-06-19 18:05:39,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=34224.666666666664, ans=0.0 2024-06-19 18:05:47,286 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.58 vs. limit=15.0 2024-06-19 18:05:48,181 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.55 vs. limit=22.5 2024-06-19 18:05:49,565 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.73 vs. limit=15.0 2024-06-19 18:05:51,493 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.523e+03 8.931e+03 1.134e+04 1.489e+04 5.166e+04, threshold=2.268e+04, percent-clipped=8.0 2024-06-19 18:05:52,390 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=34261.333333333336, ans=0.0 2024-06-19 18:05:56,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34261.333333333336, ans=0.1 2024-06-19 18:06:00,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=34279.666666666664, ans=0.003417463768115943 2024-06-19 18:06:01,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=34279.666666666664, ans=0.05 2024-06-19 18:06:12,072 INFO [train.py:1028] (1/2) Epoch 2, batch 8600, loss[loss=0.9544, simple_loss=0.63, pruned_loss=0.6394, over 13120.00 frames. ], tot_loss[loss=1.022, simple_loss=0.658, pruned_loss=0.6933, over 2574388.48 frames. ], batch size: 121, lr: 2.09e-02, grad_scale: 0.5 2024-06-19 18:06:12,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=34316.333333333336, ans=0.125 2024-06-19 18:06:17,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=34316.333333333336, ans=0.125 2024-06-19 18:06:27,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=34353.0, ans=0.125 2024-06-19 18:06:30,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=34353.0, ans=0.2 2024-06-19 18:06:36,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=34371.333333333336, ans=0.05 2024-06-19 18:06:48,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=34389.666666666664, ans=0.125 2024-06-19 18:06:49,535 INFO [train.py:1028] (1/2) Epoch 2, batch 8650, loss[loss=0.9825, simple_loss=0.6495, pruned_loss=0.6578, over 13151.00 frames. ], tot_loss[loss=1.024, simple_loss=0.6592, pruned_loss=0.694, over 2577636.33 frames. ], batch size: 103, lr: 2.09e-02, grad_scale: 0.5 2024-06-19 18:06:54,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=34408.0, ans=0.0 2024-06-19 18:06:57,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.53 vs. limit=15.0 2024-06-19 18:06:59,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=34426.333333333336, ans=0.2 2024-06-19 18:07:02,502 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.826e+03 7.037e+03 8.400e+03 1.020e+04 2.596e+04, threshold=1.680e+04, percent-clipped=2.0 2024-06-19 18:07:04,993 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-19 18:07:05,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=34444.666666666664, ans=0.125 2024-06-19 18:07:13,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.84 vs. limit=15.0 2024-06-19 18:07:18,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=34481.333333333336, ans=0.0033736231884057962 2024-06-19 18:07:22,757 INFO [train.py:1028] (1/2) Epoch 2, batch 8700, loss[loss=1.084, simple_loss=0.6879, pruned_loss=0.7396, over 13178.00 frames. ], tot_loss[loss=1.02, simple_loss=0.6584, pruned_loss=0.6913, over 2573661.08 frames. ], batch size: 59, lr: 2.09e-02, grad_scale: 0.5 2024-06-19 18:07:27,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=34499.666666666664, ans=0.125 2024-06-19 18:07:27,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=34499.666666666664, ans=0.125 2024-06-19 18:07:45,446 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.86 vs. limit=22.5 2024-06-19 18:07:50,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=34554.666666666664, ans=0.0 2024-06-19 18:07:51,717 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.02 vs. limit=15.0 2024-06-19 18:07:53,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=34573.0, ans=0.125 2024-06-19 18:07:55,649 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.02 vs. limit=10.0 2024-06-19 18:07:56,317 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.76 vs. limit=15.0 2024-06-19 18:07:58,365 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.59 vs. limit=15.0 2024-06-19 18:07:59,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=34573.0, ans=0.2 2024-06-19 18:08:00,560 INFO [train.py:1028] (1/2) Epoch 2, batch 8750, loss[loss=0.9888, simple_loss=0.644, pruned_loss=0.6668, over 13083.00 frames. ], tot_loss[loss=1.02, simple_loss=0.6584, pruned_loss=0.6908, over 2568626.67 frames. ], batch size: 121, lr: 2.09e-02, grad_scale: 0.125 2024-06-19 18:08:09,614 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.31 vs. limit=10.0 2024-06-19 18:08:14,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=40.77 vs. limit=15.0 2024-06-19 18:08:15,175 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.385e+03 6.994e+03 9.550e+03 1.196e+04 3.720e+04, threshold=1.910e+04, percent-clipped=8.0 2024-06-19 18:08:15,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34628.0, ans=0.1 2024-06-19 18:08:24,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=34646.333333333336, ans=0.0 2024-06-19 18:08:31,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.81 vs. limit=22.5 2024-06-19 18:08:34,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=34683.0, ans=0.0 2024-06-19 18:08:34,851 INFO [train.py:1028] (1/2) Epoch 2, batch 8800, loss[loss=1.038, simple_loss=0.6598, pruned_loss=0.7084, over 13094.00 frames. ], tot_loss[loss=1.022, simple_loss=0.6591, pruned_loss=0.692, over 2573302.21 frames. ], batch size: 71, lr: 2.08e-02, grad_scale: 0.25 2024-06-19 18:08:35,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=34683.0, ans=22.5 2024-06-19 18:08:37,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=34683.0, ans=0.025 2024-06-19 18:08:42,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=34683.0, ans=0.125 2024-06-19 18:08:50,634 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=28.33 vs. limit=15.0 2024-06-19 18:08:55,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=34719.666666666664, ans=0.1 2024-06-19 18:09:04,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=34738.0, ans=0.125 2024-06-19 18:09:09,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=34756.333333333336, ans=0.125 2024-06-19 18:09:14,002 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.73 vs. limit=15.0 2024-06-19 18:09:14,095 INFO [train.py:1028] (1/2) Epoch 2, batch 8850, loss[loss=0.9797, simple_loss=0.6496, pruned_loss=0.6549, over 12574.00 frames. ], tot_loss[loss=1.023, simple_loss=0.6595, pruned_loss=0.6933, over 2562145.75 frames. ], batch size: 202, lr: 2.08e-02, grad_scale: 0.25 2024-06-19 18:09:14,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=34774.666666666664, ans=0.125 2024-06-19 18:09:30,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=34793.0, ans=0.0033058695652173915 2024-06-19 18:09:32,461 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.327e+03 4.716e+03 6.151e+03 7.583e+03 1.456e+04, threshold=1.230e+04, percent-clipped=0.0 2024-06-19 18:09:33,032 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.37 vs. limit=15.0 2024-06-19 18:09:49,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=34848.0, ans=0.2 2024-06-19 18:09:49,397 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.60 vs. limit=15.0 2024-06-19 18:09:51,479 INFO [train.py:1028] (1/2) Epoch 2, batch 8900, loss[loss=1.115, simple_loss=0.7008, pruned_loss=0.7651, over 12837.00 frames. ], tot_loss[loss=1.024, simple_loss=0.6598, pruned_loss=0.6941, over 2561021.18 frames. ], batch size: 33, lr: 2.08e-02, grad_scale: 0.5 2024-06-19 18:10:05,467 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.40 vs. limit=12.0 2024-06-19 18:10:07,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=34903.0, ans=0.125 2024-06-19 18:10:11,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=34921.333333333336, ans=0.125 2024-06-19 18:10:11,624 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.28 vs. limit=15.0 2024-06-19 18:10:16,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34921.333333333336, ans=0.1 2024-06-19 18:10:25,069 INFO [train.py:1028] (1/2) Epoch 2, batch 8950, loss[loss=0.9861, simple_loss=0.6618, pruned_loss=0.6552, over 12568.00 frames. ], tot_loss[loss=1.026, simple_loss=0.6611, pruned_loss=0.6953, over 2560600.11 frames. ], batch size: 202, lr: 2.08e-02, grad_scale: 0.25 2024-06-19 18:10:25,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=34958.0, ans=0.0 2024-06-19 18:10:32,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=34976.333333333336, ans=0.1 2024-06-19 18:10:44,255 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.515e+03 5.590e+03 7.381e+03 9.061e+03 3.015e+04, threshold=1.476e+04, percent-clipped=10.0 2024-06-19 18:10:46,789 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.83 vs. limit=15.0 2024-06-19 18:10:49,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=35013.0, ans=0.2 2024-06-19 18:10:49,736 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=18.20 vs. limit=15.0 2024-06-19 18:11:02,887 INFO [train.py:1028] (1/2) Epoch 2, batch 9000, loss[loss=1.119, simple_loss=0.7017, pruned_loss=0.768, over 13311.00 frames. ], tot_loss[loss=1.032, simple_loss=0.6637, pruned_loss=0.7003, over 2566819.85 frames. ], batch size: 46, lr: 2.07e-02, grad_scale: 0.5 2024-06-19 18:11:02,887 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 18:11:08,256 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.5016, 0.1761, 0.3912, 0.6977, 0.2452, 0.0794, 3.3608, 0.4319], device='cuda:1') 2024-06-19 18:11:10,682 INFO [train.py:1060] (1/2) Epoch 2, validation: loss=0.9784, simple_loss=0.6238, pruned_loss=0.6665, over 351949.00 frames. 2024-06-19 18:11:10,683 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 16946MB 2024-06-19 18:11:14,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=35049.666666666664, ans=0.0 2024-06-19 18:11:23,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=35086.333333333336, ans=0.0 2024-06-19 18:11:34,129 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.45 vs. limit=10.0 2024-06-19 18:11:36,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=35123.0, ans=0.1 2024-06-19 18:11:43,227 INFO [train.py:1028] (1/2) Epoch 2, batch 9050, loss[loss=0.9587, simple_loss=0.5971, pruned_loss=0.6601, over 11096.00 frames. ], tot_loss[loss=1.037, simple_loss=0.6665, pruned_loss=0.7041, over 2566239.15 frames. ], batch size: 16, lr: 2.07e-02, grad_scale: 0.25 2024-06-19 18:11:48,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=35141.333333333336, ans=0.2 2024-06-19 18:12:01,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=35178.0, ans=0.025 2024-06-19 18:12:01,550 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.455e+03 5.302e+03 6.733e+03 8.293e+03 2.331e+04, threshold=1.347e+04, percent-clipped=3.0 2024-06-19 18:12:03,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=35178.0, ans=0.2 2024-06-19 18:12:16,939 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.78 vs. limit=15.0 2024-06-19 18:12:19,240 INFO [train.py:1028] (1/2) Epoch 2, batch 9100, loss[loss=1.047, simple_loss=0.6647, pruned_loss=0.7143, over 13236.00 frames. ], tot_loss[loss=1.035, simple_loss=0.6638, pruned_loss=0.7028, over 2567241.35 frames. ], batch size: 72, lr: 2.07e-02, grad_scale: 0.5 2024-06-19 18:12:21,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.06 vs. limit=10.0 2024-06-19 18:12:24,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=27.64 vs. limit=22.5 2024-06-19 18:12:29,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=35251.333333333336, ans=0.125 2024-06-19 18:12:35,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=35269.666666666664, ans=0.0032022463768115947 2024-06-19 18:12:40,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=35288.0, ans=0.125 2024-06-19 18:12:44,640 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.56 vs. limit=10.0 2024-06-19 18:12:46,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=35306.333333333336, ans=0.0 2024-06-19 18:12:46,635 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.69 vs. limit=12.0 2024-06-19 18:12:51,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=35324.666666666664, ans=0.003190289855072464 2024-06-19 18:12:52,035 INFO [train.py:1028] (1/2) Epoch 2, batch 9150, loss[loss=0.9915, simple_loss=0.6359, pruned_loss=0.6736, over 13208.00 frames. ], tot_loss[loss=1.034, simple_loss=0.6636, pruned_loss=0.7022, over 2567307.40 frames. ], batch size: 77, lr: 2.07e-02, grad_scale: 0.25 2024-06-19 18:12:54,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=35324.666666666664, ans=0.025 2024-06-19 18:12:59,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=35343.0, ans=0.125 2024-06-19 18:13:02,411 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.09 vs. limit=6.0 2024-06-19 18:13:08,078 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.406e+03 5.533e+03 6.542e+03 8.494e+03 4.349e+04, threshold=1.308e+04, percent-clipped=9.0 2024-06-19 18:13:16,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=35379.666666666664, ans=0.05 2024-06-19 18:13:16,044 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=1.553e+00 2024-06-19 18:13:16,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=35379.666666666664, ans=0.125 2024-06-19 18:13:17,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=35398.0, ans=0.125 2024-06-19 18:13:18,870 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.80 vs. limit=22.5 2024-06-19 18:13:22,742 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.51 vs. limit=10.0 2024-06-19 18:13:24,367 INFO [train.py:1028] (1/2) Epoch 2, batch 9200, loss[loss=1.097, simple_loss=0.6879, pruned_loss=0.7535, over 12887.00 frames. ], tot_loss[loss=1.037, simple_loss=0.6647, pruned_loss=0.7042, over 2570941.90 frames. ], batch size: 36, lr: 2.06e-02, grad_scale: 0.5 2024-06-19 18:13:26,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=35416.333333333336, ans=0.125 2024-06-19 18:13:35,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=35434.666666666664, ans=0.0 2024-06-19 18:13:36,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=35434.666666666664, ans=0.0031663768115942036 2024-06-19 18:13:43,241 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.91 vs. limit=15.0 2024-06-19 18:13:56,296 INFO [train.py:1028] (1/2) Epoch 2, batch 9250, loss[loss=1.093, simple_loss=0.6964, pruned_loss=0.7451, over 13198.00 frames. ], tot_loss[loss=1.039, simple_loss=0.6652, pruned_loss=0.706, over 2572241.44 frames. ], batch size: 67, lr: 2.06e-02, grad_scale: 0.125 2024-06-19 18:13:57,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=35508.0, ans=0.125 2024-06-19 18:14:17,815 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.650e+03 6.395e+03 7.586e+03 9.001e+03 3.996e+04, threshold=1.517e+04, percent-clipped=3.0 2024-06-19 18:14:18,818 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=35.71 vs. limit=22.5 2024-06-19 18:14:33,106 INFO [train.py:1028] (1/2) Epoch 2, batch 9300, loss[loss=1.023, simple_loss=0.6305, pruned_loss=0.7077, over 13314.00 frames. ], tot_loss[loss=1.045, simple_loss=0.6667, pruned_loss=0.712, over 2571227.42 frames. ], batch size: 40, lr: 2.06e-02, grad_scale: 0.25 2024-06-19 18:14:34,712 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=24.73 vs. limit=15.0 2024-06-19 18:14:46,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=35636.333333333336, ans=0.0031225362318840584 2024-06-19 18:14:47,017 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=19.55 vs. limit=15.0 2024-06-19 18:14:49,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=35636.333333333336, ans=0.125 2024-06-19 18:14:56,515 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=26.92 vs. limit=15.0 2024-06-19 18:14:57,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=35654.666666666664, ans=0.2 2024-06-19 18:15:05,507 INFO [train.py:1028] (1/2) Epoch 2, batch 9350, loss[loss=1.105, simple_loss=0.6844, pruned_loss=0.7631, over 12962.00 frames. ], tot_loss[loss=1.048, simple_loss=0.6675, pruned_loss=0.7142, over 2568402.37 frames. ], batch size: 22, lr: 2.06e-02, grad_scale: 0.25 2024-06-19 18:15:06,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=35691.333333333336, ans=0.1 2024-06-19 18:15:08,451 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=6.02 vs. limit=12.0 2024-06-19 18:15:17,323 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.57 vs. limit=15.0 2024-06-19 18:15:19,243 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=28.32 vs. limit=22.5 2024-06-19 18:15:19,875 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=24.18 vs. limit=15.0 2024-06-19 18:15:22,572 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.668e+03 3.576e+03 4.368e+03 5.539e+03 2.115e+04, threshold=8.737e+03, percent-clipped=2.0 2024-06-19 18:15:22,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=35728.0, ans=0.125 2024-06-19 18:15:33,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=35764.666666666664, ans=0.1 2024-06-19 18:15:34,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=35764.666666666664, ans=0.125 2024-06-19 18:15:37,173 INFO [train.py:1028] (1/2) Epoch 2, batch 9400, loss[loss=1.087, simple_loss=0.6917, pruned_loss=0.7407, over 13252.00 frames. ], tot_loss[loss=1.048, simple_loss=0.6673, pruned_loss=0.7143, over 2567008.12 frames. ], batch size: 52, lr: 2.06e-02, grad_scale: 0.5 2024-06-19 18:15:37,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=35783.0, ans=0.0 2024-06-19 18:15:40,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=35783.0, ans=0.125 2024-06-19 18:15:42,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=35801.333333333336, ans=0.1 2024-06-19 18:15:52,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=35819.666666666664, ans=0.125 2024-06-19 18:15:55,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=35819.666666666664, ans=0.025 2024-06-19 18:16:05,707 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.87 vs. limit=15.0 2024-06-19 18:16:10,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=35874.666666666664, ans=0.125 2024-06-19 18:16:11,459 INFO [train.py:1028] (1/2) Epoch 2, batch 9450, loss[loss=1.08, simple_loss=0.6644, pruned_loss=0.7473, over 12477.00 frames. ], tot_loss[loss=1.052, simple_loss=0.6694, pruned_loss=0.717, over 2566979.41 frames. ], batch size: 22, lr: 2.05e-02, grad_scale: 0.25 2024-06-19 18:16:14,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=35874.666666666664, ans=0.2 2024-06-19 18:16:16,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=35874.666666666664, ans=0.125 2024-06-19 18:16:16,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=35874.666666666664, ans=0.125 2024-06-19 18:16:28,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=35911.333333333336, ans=0.1 2024-06-19 18:16:28,750 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.861e+03 4.735e+03 5.988e+03 7.794e+03 5.964e+04, threshold=1.198e+04, percent-clipped=17.0 2024-06-19 18:16:36,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=35948.0, ans=0.003054782608695652 2024-06-19 18:16:43,057 INFO [train.py:1028] (1/2) Epoch 2, batch 9500, loss[loss=1.066, simple_loss=0.6647, pruned_loss=0.7333, over 13266.00 frames. ], tot_loss[loss=1.05, simple_loss=0.6683, pruned_loss=0.7163, over 2574946.50 frames. ], batch size: 43, lr: 2.05e-02, grad_scale: 0.5 2024-06-19 18:16:51,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.29 vs. limit=15.0 2024-06-19 18:16:54,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=35984.666666666664, ans=0.125 2024-06-19 18:16:54,651 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.31 vs. limit=15.0 2024-06-19 18:16:56,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=36003.0, ans=0.003042826086956521 2024-06-19 18:16:59,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=36003.0, ans=0.0 2024-06-19 18:17:02,129 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=16.20 vs. limit=15.0 2024-06-19 18:17:03,265 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.24 vs. limit=15.0 2024-06-19 18:17:04,058 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.49 vs. limit=22.5 2024-06-19 18:17:07,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.77 vs. limit=22.5 2024-06-19 18:17:09,082 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.94 vs. limit=22.5 2024-06-19 18:17:09,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=36039.666666666664, ans=0.125 2024-06-19 18:17:10,890 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.13 vs. limit=15.0 2024-06-19 18:17:16,354 INFO [train.py:1028] (1/2) Epoch 2, batch 9550, loss[loss=1.03, simple_loss=0.652, pruned_loss=0.7041, over 12920.00 frames. ], tot_loss[loss=1.046, simple_loss=0.6663, pruned_loss=0.7124, over 2571451.66 frames. ], batch size: 39, lr: 2.05e-02, grad_scale: 0.25 2024-06-19 18:17:21,315 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.58 vs. limit=6.0 2024-06-19 18:17:21,664 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=7.971e+02 2024-06-19 18:17:28,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=36094.666666666664, ans=0.0 2024-06-19 18:17:30,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=36094.666666666664, ans=0.025 2024-06-19 18:17:31,981 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=14.74 vs. limit=15.0 2024-06-19 18:17:34,675 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.583e+03 5.388e+03 6.731e+03 7.845e+03 1.805e+04, threshold=1.346e+04, percent-clipped=6.0 2024-06-19 18:17:36,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=36113.0, ans=0.0 2024-06-19 18:17:37,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=36113.0, ans=0.0030189130434782616 2024-06-19 18:17:41,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=36131.333333333336, ans=0.0 2024-06-19 18:17:44,885 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.80 vs. limit=15.0 2024-06-19 18:17:47,815 INFO [train.py:1028] (1/2) Epoch 2, batch 9600, loss[loss=0.8742, simple_loss=0.5923, pruned_loss=0.5781, over 10366.00 frames. ], tot_loss[loss=1.039, simple_loss=0.6642, pruned_loss=0.7067, over 2569779.04 frames. ], batch size: 303, lr: 2.05e-02, grad_scale: 0.5 2024-06-19 18:17:52,832 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.04 vs. limit=15.0 2024-06-19 18:17:54,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=36168.0, ans=0.125 2024-06-19 18:17:54,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=36168.0, ans=0.1 2024-06-19 18:17:56,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=36168.0, ans=0.125 2024-06-19 18:18:01,305 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=9.85 vs. limit=12.0 2024-06-19 18:18:05,060 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=36.12 vs. limit=22.5 2024-06-19 18:18:06,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=36204.666666666664, ans=0.125 2024-06-19 18:18:07,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=36204.666666666664, ans=0.0 2024-06-19 18:18:14,554 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.24 vs. limit=15.0 2024-06-19 18:18:20,686 INFO [train.py:1028] (1/2) Epoch 2, batch 9650, loss[loss=0.9409, simple_loss=0.6191, pruned_loss=0.6314, over 13109.00 frames. ], tot_loss[loss=1.032, simple_loss=0.6618, pruned_loss=0.7011, over 2561477.62 frames. ], batch size: 132, lr: 2.04e-02, grad_scale: 0.25 2024-06-19 18:18:23,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36241.333333333336, ans=0.1 2024-06-19 18:18:28,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=36259.666666666664, ans=0.125 2024-06-19 18:18:29,897 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 18:18:36,594 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=8.854e+00 2024-06-19 18:18:38,872 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.472e+03 6.088e+03 7.459e+03 8.682e+03 2.157e+04, threshold=1.492e+04, percent-clipped=2.0 2024-06-19 18:18:46,021 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=24.45 vs. limit=15.0 2024-06-19 18:18:49,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36314.666666666664, ans=0.1 2024-06-19 18:18:51,758 INFO [train.py:1028] (1/2) Epoch 2, batch 9700, loss[loss=0.9786, simple_loss=0.6503, pruned_loss=0.6535, over 13032.00 frames. ], tot_loss[loss=1.028, simple_loss=0.6604, pruned_loss=0.6979, over 2555154.00 frames. ], batch size: 144, lr: 2.04e-02, grad_scale: 0.5 2024-06-19 18:18:55,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=36333.0, ans=10.0 2024-06-19 18:18:58,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=36351.333333333336, ans=0.2 2024-06-19 18:19:00,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=36351.333333333336, ans=0.07 2024-06-19 18:19:04,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=36369.666666666664, ans=0.2 2024-06-19 18:19:06,660 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.68 vs. limit=5.0 2024-06-19 18:19:16,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=36406.333333333336, ans=0.125 2024-06-19 18:19:16,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=36406.333333333336, ans=0.0 2024-06-19 18:19:23,096 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=21.11 vs. limit=15.0 2024-06-19 18:19:24,417 INFO [train.py:1028] (1/2) Epoch 2, batch 9750, loss[loss=0.9537, simple_loss=0.6344, pruned_loss=0.6365, over 13077.00 frames. ], tot_loss[loss=1.025, simple_loss=0.659, pruned_loss=0.6951, over 2552076.88 frames. ], batch size: 132, lr: 2.04e-02, grad_scale: 0.25 2024-06-19 18:19:29,499 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.55 vs. limit=5.0 2024-06-19 18:19:42,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=36479.666666666664, ans=0.0 2024-06-19 18:19:43,783 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.312e+03 6.708e+03 8.321e+03 9.772e+03 3.624e+04, threshold=1.664e+04, percent-clipped=6.0 2024-06-19 18:19:55,676 INFO [train.py:1028] (1/2) Epoch 2, batch 9800, loss[loss=1.031, simple_loss=0.6434, pruned_loss=0.7096, over 12955.00 frames. ], tot_loss[loss=1.025, simple_loss=0.6582, pruned_loss=0.6955, over 2545208.30 frames. ], batch size: 39, lr: 2.04e-02, grad_scale: 0.5 2024-06-19 18:19:56,122 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.18 vs. limit=15.0 2024-06-19 18:19:59,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=36516.333333333336, ans=0.025 2024-06-19 18:20:00,793 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.57 vs. limit=15.0 2024-06-19 18:20:02,661 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=18.04 vs. limit=15.0 2024-06-19 18:20:04,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=36534.666666666664, ans=0.125 2024-06-19 18:20:04,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=36534.666666666664, ans=0.125 2024-06-19 18:20:05,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten.whitening_limit, batch_count=36534.666666666664, ans=15.0 2024-06-19 18:20:07,567 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.48 vs. limit=15.0 2024-06-19 18:20:14,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=36553.0, ans=0.025 2024-06-19 18:20:17,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=36571.333333333336, ans=0.125 2024-06-19 18:20:22,804 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.64 vs. limit=22.5 2024-06-19 18:20:24,042 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.98 vs. limit=15.0 2024-06-19 18:20:27,257 INFO [train.py:1028] (1/2) Epoch 2, batch 9850, loss[loss=1.012, simple_loss=0.6605, pruned_loss=0.6815, over 13046.00 frames. ], tot_loss[loss=1.021, simple_loss=0.6573, pruned_loss=0.6923, over 2537999.16 frames. ], batch size: 102, lr: 2.04e-02, grad_scale: 0.25 2024-06-19 18:20:29,441 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.05 vs. limit=22.5 2024-06-19 18:20:34,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=36626.333333333336, ans=0.125 2024-06-19 18:20:39,012 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.22 vs. limit=6.0 2024-06-19 18:20:42,154 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.28 vs. limit=15.0 2024-06-19 18:20:42,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=36644.666666666664, ans=0.125 2024-06-19 18:20:45,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=36644.666666666664, ans=0.125 2024-06-19 18:20:45,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=36644.666666666664, ans=0.125 2024-06-19 18:20:46,227 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.71 vs. limit=15.0 2024-06-19 18:20:46,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=36644.666666666664, ans=0.125 2024-06-19 18:20:48,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=36663.0, ans=0.025 2024-06-19 18:20:53,960 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.929e+03 5.250e+03 6.975e+03 8.596e+03 1.760e+04, threshold=1.395e+04, percent-clipped=1.0 2024-06-19 18:20:56,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.83 vs. limit=10.0 2024-06-19 18:21:05,066 INFO [train.py:1028] (1/2) Epoch 2, batch 9900, loss[loss=1.023, simple_loss=0.6586, pruned_loss=0.6935, over 12849.00 frames. ], tot_loss[loss=1.014, simple_loss=0.6548, pruned_loss=0.6866, over 2531552.86 frames. ], batch size: 39, lr: 2.03e-02, grad_scale: 0.5 2024-06-19 18:21:06,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.66 vs. limit=22.5 2024-06-19 18:21:11,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=36718.0, ans=0.025 2024-06-19 18:21:16,558 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.07 vs. limit=22.5 2024-06-19 18:21:16,669 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.36 vs. limit=15.0 2024-06-19 18:21:16,681 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=15.0 2024-06-19 18:21:27,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=36754.666666666664, ans=0.125 2024-06-19 18:21:29,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=36754.666666666664, ans=0.1 2024-06-19 18:21:37,113 INFO [train.py:1028] (1/2) Epoch 2, batch 9950, loss[loss=1.077, simple_loss=0.6796, pruned_loss=0.7371, over 12642.00 frames. ], tot_loss[loss=1.005, simple_loss=0.6502, pruned_loss=0.6803, over 2522724.23 frames. ], batch size: 29, lr: 2.03e-02, grad_scale: 0.125 2024-06-19 18:21:41,112 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=19.56 vs. limit=15.0 2024-06-19 18:21:43,893 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=36809.666666666664, ans=0.025 2024-06-19 18:21:48,604 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.38 vs. limit=22.5 2024-06-19 18:21:56,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=36846.333333333336, ans=0.0 2024-06-19 18:21:59,342 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.311e+03 5.742e+03 7.351e+03 9.688e+03 4.180e+04, threshold=1.470e+04, percent-clipped=8.0 2024-06-19 18:22:08,022 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=8.620e-02 2024-06-19 18:22:09,826 INFO [train.py:1028] (1/2) Epoch 2, batch 10000, loss[loss=1.052, simple_loss=0.6637, pruned_loss=0.7205, over 12606.00 frames. ], tot_loss[loss=1.006, simple_loss=0.6497, pruned_loss=0.6807, over 2485421.06 frames. ], batch size: 22, lr: 2.03e-02, grad_scale: 0.25 2024-06-19 18:22:15,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=36883.0, ans=0.125 2024-06-19 18:22:17,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=36901.333333333336, ans=0.1 2024-06-19 18:22:19,509 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.57 vs. limit=10.0 2024-06-19 18:22:22,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=36919.666666666664, ans=0.04949747468305833 2024-06-19 18:22:22,551 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.43 vs. limit=10.0 2024-06-19 18:22:24,992 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.99 vs. limit=15.0 2024-06-19 18:22:29,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=36938.0, ans=0.025 2024-06-19 18:22:31,706 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 18:22:38,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=36956.333333333336, ans=0.0028355797101449268 2024-06-19 18:22:40,359 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=23.01 vs. limit=15.0 2024-06-19 18:22:42,355 INFO [train.py:1028] (1/2) Epoch 2, batch 10050, loss[loss=0.9799, simple_loss=0.6045, pruned_loss=0.6776, over 12541.00 frames. ], tot_loss[loss=1.012, simple_loss=0.6523, pruned_loss=0.6856, over 2443884.71 frames. ], batch size: 22, lr: 2.03e-02, grad_scale: 0.25 2024-06-19 18:22:43,318 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=28.60 vs. limit=22.5 2024-06-19 18:22:47,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=36974.666666666664, ans=15.0 2024-06-19 18:22:53,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=36993.0, ans=0.2 2024-06-19 18:22:55,373 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.54 vs. limit=10.0 2024-06-19 18:22:59,002 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.75 vs. limit=6.0 2024-06-19 18:22:59,049 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.19 vs. limit=6.0 2024-06-19 18:23:03,469 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.610e+03 2.909e+03 4.061e+03 4.962e+03 1.023e+04, threshold=8.122e+03, percent-clipped=0.0 2024-06-19 18:23:09,458 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.88 vs. limit=15.0 2024-06-19 18:23:12,150 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.89 vs. limit=15.0 2024-06-19 18:23:13,625 INFO [train.py:1028] (1/2) Epoch 2, batch 10100, loss[loss=0.9256, simple_loss=0.564, pruned_loss=0.6435, over 11420.00 frames. ], tot_loss[loss=1.016, simple_loss=0.6513, pruned_loss=0.6906, over 2423899.15 frames. ], batch size: 17, lr: 2.02e-02, grad_scale: 0.5 2024-06-19 18:23:18,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=37066.333333333336, ans=0.0 2024-06-19 18:23:18,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=37066.333333333336, ans=0.0 2024-06-19 18:25:11,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=37095.666666666664, ans=0.07 2024-06-19 18:25:11,769 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.91 vs. limit=22.5 2024-06-19 18:25:27,261 INFO [train.py:1028] (1/2) Epoch 3, batch 0, loss[loss=0.974, simple_loss=0.6104, pruned_loss=0.6688, over 12920.00 frames. ], tot_loss[loss=0.974, simple_loss=0.6104, pruned_loss=0.6688, over 12920.00 frames. ], batch size: 36, lr: 1.92e-02, grad_scale: 1.0 2024-06-19 18:25:27,261 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 18:25:34,561 INFO [train.py:1060] (1/2) Epoch 3, validation: loss=0.9885, simple_loss=0.6309, pruned_loss=0.673, over 351949.00 frames. 2024-06-19 18:25:34,562 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17340MB 2024-06-19 18:25:36,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=37095.666666666664, ans=0.0 2024-06-19 18:25:45,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=37114.0, ans=0.125 2024-06-19 18:25:47,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=37132.333333333336, ans=0.2 2024-06-19 18:25:59,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=37150.666666666664, ans=0.0 2024-06-19 18:25:59,495 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=22.74 vs. limit=15.0 2024-06-19 18:26:01,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=37150.666666666664, ans=0.0 2024-06-19 18:26:09,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=37169.0, ans=0.125 2024-06-19 18:26:12,007 INFO [train.py:1028] (1/2) Epoch 3, batch 50, loss[loss=1.051, simple_loss=0.6577, pruned_loss=0.7217, over 12815.00 frames. ], tot_loss[loss=0.9586, simple_loss=0.6204, pruned_loss=0.6484, over 574832.10 frames. ], batch size: 29, lr: 1.92e-02, grad_scale: 0.125 2024-06-19 18:26:17,582 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.91 vs. limit=15.0 2024-06-19 18:26:21,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=37205.666666666664, ans=0.1 2024-06-19 18:26:21,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=37205.666666666664, ans=10.0 2024-06-19 18:26:25,840 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+03 2.994e+03 3.984e+03 5.286e+03 1.594e+04, threshold=7.968e+03, percent-clipped=8.0 2024-06-19 18:26:34,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=37242.333333333336, ans=10.0 2024-06-19 18:26:44,527 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.32 vs. limit=5.0 2024-06-19 18:26:47,242 INFO [train.py:1028] (1/2) Epoch 3, batch 100, loss[loss=0.9843, simple_loss=0.6319, pruned_loss=0.6684, over 13306.00 frames. ], tot_loss[loss=0.9484, simple_loss=0.6169, pruned_loss=0.64, over 1018082.67 frames. ], batch size: 46, lr: 1.92e-02, grad_scale: 0.25 2024-06-19 18:26:51,100 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.37 vs. limit=10.0 2024-06-19 18:26:53,085 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=20.31 vs. limit=15.0 2024-06-19 18:26:58,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=37297.333333333336, ans=0.125 2024-06-19 18:26:59,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=37315.666666666664, ans=0.125 2024-06-19 18:27:00,480 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=4.173e-02 2024-06-19 18:27:11,126 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.29 vs. limit=15.0 2024-06-19 18:27:18,764 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.83 vs. limit=15.0 2024-06-19 18:27:19,777 INFO [train.py:1028] (1/2) Epoch 3, batch 150, loss[loss=0.9702, simple_loss=0.6093, pruned_loss=0.6655, over 12777.00 frames. ], tot_loss[loss=0.944, simple_loss=0.6128, pruned_loss=0.6376, over 1365477.29 frames. ], batch size: 29, lr: 1.92e-02, grad_scale: 0.25 2024-06-19 18:27:20,161 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.78 vs. limit=15.0 2024-06-19 18:27:24,321 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.55 vs. limit=22.5 2024-06-19 18:27:32,887 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.24 vs. limit=10.0 2024-06-19 18:27:33,778 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.831e+03 4.451e+03 5.431e+03 6.629e+03 3.161e+04, threshold=1.086e+04, percent-clipped=12.0 2024-06-19 18:27:38,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=37407.333333333336, ans=0.0 2024-06-19 18:27:39,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=37425.666666666664, ans=0.025 2024-06-19 18:27:55,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=26.69 vs. limit=22.5 2024-06-19 18:27:56,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=37444.0, ans=0.1 2024-06-19 18:27:57,884 INFO [train.py:1028] (1/2) Epoch 3, batch 200, loss[loss=0.9066, simple_loss=0.6123, pruned_loss=0.6004, over 12489.00 frames. ], tot_loss[loss=0.9469, simple_loss=0.6149, pruned_loss=0.6394, over 1634717.61 frames. ], batch size: 202, lr: 1.91e-02, grad_scale: 0.5 2024-06-19 18:28:02,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=37462.333333333336, ans=0.125 2024-06-19 18:28:18,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=37517.333333333336, ans=0.2 2024-06-19 18:28:18,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=37517.333333333336, ans=0.025 2024-06-19 18:28:20,717 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.73 vs. limit=22.5 2024-06-19 18:28:30,720 INFO [train.py:1028] (1/2) Epoch 3, batch 250, loss[loss=0.8693, simple_loss=0.5804, pruned_loss=0.5791, over 13047.00 frames. ], tot_loss[loss=0.941, simple_loss=0.612, pruned_loss=0.635, over 1845211.77 frames. ], batch size: 144, lr: 1.91e-02, grad_scale: 0.25 2024-06-19 18:28:31,716 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=12.0 2024-06-19 18:28:42,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=37572.333333333336, ans=0.125 2024-06-19 18:28:42,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=37572.333333333336, ans=0.125 2024-06-19 18:28:45,794 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=20.50 vs. limit=15.0 2024-06-19 18:28:48,286 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.818e+03 5.994e+03 7.404e+03 9.653e+03 2.050e+04, threshold=1.481e+04, percent-clipped=15.0 2024-06-19 18:28:55,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=37609.0, ans=0.0 2024-06-19 18:28:57,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=37609.0, ans=0.125 2024-06-19 18:28:57,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=37609.0, ans=22.5 2024-06-19 18:28:58,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=37609.0, ans=0.125 2024-06-19 18:29:06,625 INFO [train.py:1028] (1/2) Epoch 3, batch 300, loss[loss=0.8922, simple_loss=0.5826, pruned_loss=0.6009, over 13183.00 frames. ], tot_loss[loss=0.9437, simple_loss=0.6131, pruned_loss=0.6371, over 2008545.91 frames. ], batch size: 112, lr: 1.91e-02, grad_scale: 0.5 2024-06-19 18:29:06,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=37645.666666666664, ans=0.2 2024-06-19 18:29:10,124 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=19.00 vs. limit=15.0 2024-06-19 18:29:12,900 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=68.23 vs. limit=15.0 2024-06-19 18:29:16,486 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.668e+01 2024-06-19 18:29:17,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=37664.0, ans=0.2 2024-06-19 18:29:17,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=37664.0, ans=0.0 2024-06-19 18:29:24,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=37682.333333333336, ans=10.0 2024-06-19 18:29:27,731 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.69 vs. limit=22.5 2024-06-19 18:29:36,829 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.49 vs. limit=22.5 2024-06-19 18:29:37,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=37719.0, ans=0.125 2024-06-19 18:29:38,103 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.61 vs. limit=15.0 2024-06-19 18:29:38,131 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.38 vs. limit=15.0 2024-06-19 18:29:39,138 INFO [train.py:1028] (1/2) Epoch 3, batch 350, loss[loss=0.9487, simple_loss=0.6047, pruned_loss=0.6464, over 12846.00 frames. ], tot_loss[loss=0.9455, simple_loss=0.6132, pruned_loss=0.6389, over 2137353.87 frames. ], batch size: 33, lr: 1.91e-02, grad_scale: 0.25 2024-06-19 18:29:46,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=29.41 vs. limit=22.5 2024-06-19 18:29:55,737 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.67 vs. limit=15.0 2024-06-19 18:29:57,946 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.380e+03 6.298e+03 8.242e+03 1.107e+04 3.750e+04, threshold=1.648e+04, percent-clipped=9.0 2024-06-19 18:30:07,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=37792.333333333336, ans=0.09899494936611666 2024-06-19 18:30:13,875 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.92 vs. limit=8.0 2024-06-19 18:30:15,977 INFO [train.py:1028] (1/2) Epoch 3, batch 400, loss[loss=0.9976, simple_loss=0.6316, pruned_loss=0.6818, over 13278.00 frames. ], tot_loss[loss=0.9499, simple_loss=0.6145, pruned_loss=0.6427, over 2238038.63 frames. ], batch size: 63, lr: 1.91e-02, grad_scale: 0.5 2024-06-19 18:30:21,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=37829.0, ans=0.025 2024-06-19 18:30:25,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=37847.333333333336, ans=0.002641884057971014 2024-06-19 18:30:26,586 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.83 vs. limit=15.0 2024-06-19 18:30:27,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=37847.333333333336, ans=0.1 2024-06-19 18:30:31,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=37865.666666666664, ans=0.125 2024-06-19 18:30:34,515 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.31 vs. limit=22.5 2024-06-19 18:30:36,206 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.09 vs. limit=15.0 2024-06-19 18:30:38,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=37884.0, ans=0.125 2024-06-19 18:30:47,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=37902.333333333336, ans=0.0 2024-06-19 18:30:53,185 INFO [train.py:1028] (1/2) Epoch 3, batch 450, loss[loss=1.004, simple_loss=0.6487, pruned_loss=0.6801, over 13187.00 frames. ], tot_loss[loss=0.9513, simple_loss=0.6152, pruned_loss=0.6437, over 2312398.48 frames. ], batch size: 67, lr: 1.90e-02, grad_scale: 0.25 2024-06-19 18:30:58,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=37920.666666666664, ans=0.0026259420289855076 2024-06-19 18:31:00,160 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.81 vs. limit=15.0 2024-06-19 18:31:03,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=37939.0, ans=0.2 2024-06-19 18:31:08,959 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.193e+03 4.137e+03 4.930e+03 5.709e+03 2.950e+04, threshold=9.859e+03, percent-clipped=2.0 2024-06-19 18:31:10,549 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=19.40 vs. limit=15.0 2024-06-19 18:31:19,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=37994.0, ans=0.1 2024-06-19 18:31:26,164 INFO [train.py:1028] (1/2) Epoch 3, batch 500, loss[loss=0.9256, simple_loss=0.6072, pruned_loss=0.622, over 13123.00 frames. ], tot_loss[loss=0.9514, simple_loss=0.6146, pruned_loss=0.6441, over 2375049.67 frames. ], batch size: 121, lr: 1.90e-02, grad_scale: 0.5 2024-06-19 18:31:27,897 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.20 vs. limit=15.0 2024-06-19 18:31:30,419 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.13 vs. limit=22.5 2024-06-19 18:31:31,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=38012.333333333336, ans=0.0 2024-06-19 18:31:52,321 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=6.899e-02 2024-06-19 18:31:57,486 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.31 vs. limit=22.5 2024-06-19 18:31:59,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=38085.666666666664, ans=0.2 2024-06-19 18:32:01,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=38104.0, ans=0.5 2024-06-19 18:32:01,595 INFO [train.py:1028] (1/2) Epoch 3, batch 550, loss[loss=0.9026, simple_loss=0.6071, pruned_loss=0.5991, over 12920.00 frames. ], tot_loss[loss=0.9512, simple_loss=0.6148, pruned_loss=0.6438, over 2419957.50 frames. ], batch size: 158, lr: 1.90e-02, grad_scale: 0.5 2024-06-19 18:32:10,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38122.333333333336, ans=0.1 2024-06-19 18:32:12,442 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=15.0 2024-06-19 18:32:14,452 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.59 vs. limit=15.0 2024-06-19 18:32:15,773 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.70 vs. limit=22.5 2024-06-19 18:32:17,740 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=26.93 vs. limit=15.0 2024-06-19 18:32:17,900 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.783e+03 4.897e+03 5.981e+03 8.007e+03 2.975e+04, threshold=1.196e+04, percent-clipped=16.0 2024-06-19 18:32:18,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=38140.666666666664, ans=0.025 2024-06-19 18:32:21,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=38159.0, ans=0.125 2024-06-19 18:32:24,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=38159.0, ans=0.0 2024-06-19 18:32:31,634 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.26 vs. limit=22.5 2024-06-19 18:32:33,853 INFO [train.py:1028] (1/2) Epoch 3, batch 600, loss[loss=0.8232, simple_loss=0.5539, pruned_loss=0.5463, over 13011.00 frames. ], tot_loss[loss=0.9481, simple_loss=0.6134, pruned_loss=0.6415, over 2457335.85 frames. ], batch size: 144, lr: 1.90e-02, grad_scale: 0.5 2024-06-19 18:32:35,557 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=10.08 vs. limit=12.0 2024-06-19 18:32:35,591 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.37 vs. limit=15.0 2024-06-19 18:32:46,024 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=1.998e+02 2024-06-19 18:32:47,020 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.71 vs. limit=22.5 2024-06-19 18:32:48,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=38232.333333333336, ans=0.125 2024-06-19 18:32:54,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=38232.333333333336, ans=0.09899494936611666 2024-06-19 18:32:57,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=38250.666666666664, ans=0.125 2024-06-19 18:32:59,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=38250.666666666664, ans=0.125 2024-06-19 18:33:00,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=38250.666666666664, ans=0.0025542028985507254 2024-06-19 18:33:03,560 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=13.62 vs. limit=15.0 2024-06-19 18:33:05,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38269.0, ans=0.1 2024-06-19 18:33:06,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=38269.0, ans=0.2 2024-06-19 18:33:06,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=38269.0, ans=0.2 2024-06-19 18:33:09,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=38287.333333333336, ans=0.125 2024-06-19 18:33:09,480 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.71 vs. limit=15.0 2024-06-19 18:33:09,663 INFO [train.py:1028] (1/2) Epoch 3, batch 650, loss[loss=0.9309, simple_loss=0.5979, pruned_loss=0.6319, over 13184.00 frames. ], tot_loss[loss=0.948, simple_loss=0.614, pruned_loss=0.641, over 2490244.53 frames. ], batch size: 59, lr: 1.90e-02, grad_scale: 0.5 2024-06-19 18:33:12,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=38287.333333333336, ans=0.125 2024-06-19 18:33:21,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=38305.666666666664, ans=0.125 2024-06-19 18:33:25,888 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.762e+03 5.534e+03 7.156e+03 8.254e+03 5.489e+04, threshold=1.431e+04, percent-clipped=6.0 2024-06-19 18:33:30,282 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.70 vs. limit=15.0 2024-06-19 18:33:38,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=38360.666666666664, ans=0.125 2024-06-19 18:33:39,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=38360.666666666664, ans=0.2 2024-06-19 18:33:40,166 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.77 vs. limit=15.0 2024-06-19 18:33:41,672 INFO [train.py:1028] (1/2) Epoch 3, batch 700, loss[loss=0.9709, simple_loss=0.6235, pruned_loss=0.6592, over 13358.00 frames. ], tot_loss[loss=0.9429, simple_loss=0.6125, pruned_loss=0.6367, over 2513607.97 frames. ], batch size: 46, lr: 1.89e-02, grad_scale: 0.5 2024-06-19 18:33:41,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=38379.0, ans=0.125 2024-06-19 18:33:43,169 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.66 vs. limit=22.5 2024-06-19 18:33:50,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=38397.333333333336, ans=0.125 2024-06-19 18:33:51,789 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=22.77 vs. limit=15.0 2024-06-19 18:33:56,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=38415.666666666664, ans=0.125 2024-06-19 18:33:56,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=38415.666666666664, ans=0.0025183333333333342 2024-06-19 18:33:58,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=38415.666666666664, ans=0.0 2024-06-19 18:33:59,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=38434.0, ans=0.1 2024-06-19 18:34:00,095 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.87 vs. limit=15.0 2024-06-19 18:34:08,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=38434.0, ans=0.0 2024-06-19 18:34:11,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=38452.333333333336, ans=0.125 2024-06-19 18:34:16,166 INFO [train.py:1028] (1/2) Epoch 3, batch 750, loss[loss=0.9736, simple_loss=0.6202, pruned_loss=0.6635, over 13275.00 frames. ], tot_loss[loss=0.947, simple_loss=0.6143, pruned_loss=0.6399, over 2529344.57 frames. ], batch size: 63, lr: 1.89e-02, grad_scale: 0.5 2024-06-19 18:34:18,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=38470.666666666664, ans=0.0 2024-06-19 18:34:21,019 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.78 vs. limit=15.0 2024-06-19 18:34:21,766 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.85 vs. limit=6.0 2024-06-19 18:34:24,147 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.21 vs. limit=22.5 2024-06-19 18:34:26,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=38489.0, ans=0.0 2024-06-19 18:34:29,194 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.64 vs. limit=15.0 2024-06-19 18:34:32,196 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=12.0 2024-06-19 18:34:32,995 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.877e+03 4.119e+03 5.447e+03 6.871e+03 1.880e+04, threshold=1.089e+04, percent-clipped=2.0 2024-06-19 18:34:33,620 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.52 vs. limit=15.0 2024-06-19 18:34:35,520 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.87 vs. limit=10.0 2024-06-19 18:34:42,797 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=14.73 vs. limit=15.0 2024-06-19 18:34:48,706 INFO [train.py:1028] (1/2) Epoch 3, batch 800, loss[loss=0.9754, simple_loss=0.6184, pruned_loss=0.6662, over 12909.00 frames. ], tot_loss[loss=0.9499, simple_loss=0.6154, pruned_loss=0.6422, over 2541565.17 frames. ], batch size: 36, lr: 1.89e-02, grad_scale: 1.0 2024-06-19 18:34:58,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=38580.666666666664, ans=0.0 2024-06-19 18:35:19,269 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.38 vs. limit=10.0 2024-06-19 18:35:21,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=38635.666666666664, ans=0.0024705072463768124 2024-06-19 18:35:23,514 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.45 vs. limit=15.0 2024-06-19 18:35:24,407 INFO [train.py:1028] (1/2) Epoch 3, batch 850, loss[loss=0.9594, simple_loss=0.6366, pruned_loss=0.6411, over 13163.00 frames. ], tot_loss[loss=0.9475, simple_loss=0.6137, pruned_loss=0.6407, over 2551505.87 frames. ], batch size: 95, lr: 1.89e-02, grad_scale: 0.5 2024-06-19 18:35:35,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=38672.333333333336, ans=0.2 2024-06-19 18:35:41,678 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.654e+03 3.389e+03 4.557e+03 6.298e+03 1.584e+04, threshold=9.114e+03, percent-clipped=2.0 2024-06-19 18:35:42,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=38690.666666666664, ans=0.2 2024-06-19 18:35:49,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=38709.0, ans=0.125 2024-06-19 18:35:51,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=38727.333333333336, ans=0.0 2024-06-19 18:35:55,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=38727.333333333336, ans=0.1 2024-06-19 18:35:56,971 INFO [train.py:1028] (1/2) Epoch 3, batch 900, loss[loss=0.9195, simple_loss=0.593, pruned_loss=0.6229, over 12914.00 frames. ], tot_loss[loss=0.9436, simple_loss=0.6122, pruned_loss=0.6374, over 2556657.05 frames. ], batch size: 36, lr: 1.89e-02, grad_scale: 1.0 2024-06-19 18:35:58,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=38745.666666666664, ans=0.0 2024-06-19 18:36:06,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=38764.0, ans=0.1 2024-06-19 18:36:14,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=38782.333333333336, ans=0.0 2024-06-19 18:36:20,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=38800.666666666664, ans=0.0 2024-06-19 18:36:33,162 INFO [train.py:1028] (1/2) Epoch 3, batch 950, loss[loss=1.018, simple_loss=0.6398, pruned_loss=0.6986, over 13240.00 frames. ], tot_loss[loss=0.9439, simple_loss=0.6125, pruned_loss=0.6377, over 2560999.00 frames. ], batch size: 40, lr: 1.88e-02, grad_scale: 0.25 2024-06-19 18:36:38,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=38837.333333333336, ans=10.0 2024-06-19 18:36:41,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=38855.666666666664, ans=0.1 2024-06-19 18:36:45,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=38855.666666666664, ans=0.125 2024-06-19 18:36:52,258 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.494e+03 4.271e+03 5.142e+03 6.220e+03 2.085e+04, threshold=1.028e+04, percent-clipped=9.0 2024-06-19 18:36:53,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=38892.333333333336, ans=0.0 2024-06-19 18:36:58,975 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.10 vs. limit=10.0 2024-06-19 18:37:02,096 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.87 vs. limit=15.0 2024-06-19 18:37:08,669 INFO [train.py:1028] (1/2) Epoch 3, batch 1000, loss[loss=0.9952, simple_loss=0.632, pruned_loss=0.6792, over 13350.00 frames. ], tot_loss[loss=0.9416, simple_loss=0.6119, pruned_loss=0.6357, over 2563630.72 frames. ], batch size: 49, lr: 1.88e-02, grad_scale: 0.5 2024-06-19 18:37:09,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=38929.0, ans=0.125 2024-06-19 18:37:10,267 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.13 vs. limit=22.5 2024-06-19 18:37:18,107 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.23 vs. limit=10.0 2024-06-19 18:37:18,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=38947.333333333336, ans=0.0 2024-06-19 18:37:31,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=38984.0, ans=0.025 2024-06-19 18:37:36,892 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=18.22 vs. limit=15.0 2024-06-19 18:37:39,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=39002.333333333336, ans=0.0 2024-06-19 18:37:39,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=39002.333333333336, ans=0.125 2024-06-19 18:37:41,744 INFO [train.py:1028] (1/2) Epoch 3, batch 1050, loss[loss=0.9355, simple_loss=0.617, pruned_loss=0.627, over 13157.00 frames. ], tot_loss[loss=0.9449, simple_loss=0.6137, pruned_loss=0.6381, over 2567909.55 frames. ], batch size: 77, lr: 1.88e-02, grad_scale: 0.5 2024-06-19 18:37:49,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=39039.0, ans=0.125 2024-06-19 18:37:50,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=39039.0, ans=0.125 2024-06-19 18:37:51,165 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.48 vs. limit=15.0 2024-06-19 18:37:56,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=39057.333333333336, ans=0.0023788405797101437 2024-06-19 18:38:00,481 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.820e+03 4.352e+03 5.063e+03 6.175e+03 1.714e+04, threshold=1.013e+04, percent-clipped=2.0 2024-06-19 18:38:06,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=39075.666666666664, ans=0.95 2024-06-19 18:38:16,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=39094.0, ans=0.125 2024-06-19 18:38:18,032 INFO [train.py:1028] (1/2) Epoch 3, batch 1100, loss[loss=0.9242, simple_loss=0.5972, pruned_loss=0.6256, over 13212.00 frames. ], tot_loss[loss=0.9464, simple_loss=0.6157, pruned_loss=0.6386, over 2572976.01 frames. ], batch size: 52, lr: 1.88e-02, grad_scale: 1.0 2024-06-19 18:38:23,014 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.51 vs. limit=15.0 2024-06-19 18:38:26,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=39130.666666666664, ans=0.2 2024-06-19 18:38:30,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39130.666666666664, ans=0.1 2024-06-19 18:38:34,695 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.70 vs. limit=15.0 2024-06-19 18:38:37,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=39167.333333333336, ans=0.2 2024-06-19 18:38:46,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=39185.666666666664, ans=0.0023509420289855084 2024-06-19 18:38:47,106 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.34 vs. limit=15.0 2024-06-19 18:38:50,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=39185.666666666664, ans=0.04949747468305833 2024-06-19 18:38:51,868 INFO [train.py:1028] (1/2) Epoch 3, batch 1150, loss[loss=0.9448, simple_loss=0.6118, pruned_loss=0.6389, over 13257.00 frames. ], tot_loss[loss=0.9442, simple_loss=0.6152, pruned_loss=0.6366, over 2572599.61 frames. ], batch size: 52, lr: 1.88e-02, grad_scale: 0.25 2024-06-19 18:38:57,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=39204.0, ans=0.1 2024-06-19 18:39:15,074 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.781e+03 5.650e+03 7.309e+03 9.130e+03 2.030e+04, threshold=1.462e+04, percent-clipped=14.0 2024-06-19 18:39:17,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=39259.0, ans=0.125 2024-06-19 18:39:20,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=39277.333333333336, ans=0.002331014492753623 2024-06-19 18:39:27,150 INFO [train.py:1028] (1/2) Epoch 3, batch 1200, loss[loss=0.9118, simple_loss=0.6012, pruned_loss=0.6112, over 13191.00 frames. ], tot_loss[loss=0.9399, simple_loss=0.6138, pruned_loss=0.633, over 2574705.84 frames. ], batch size: 77, lr: 1.87e-02, grad_scale: 0.5 2024-06-19 18:39:29,027 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.40 vs. limit=15.0 2024-06-19 18:39:30,264 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.82 vs. limit=15.0 2024-06-19 18:39:30,383 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.26 vs. limit=22.5 2024-06-19 18:39:34,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=39314.0, ans=0.125 2024-06-19 18:39:35,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=16.47 vs. limit=15.0 2024-06-19 18:39:41,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.08 vs. limit=15.0 2024-06-19 18:39:59,769 INFO [train.py:1028] (1/2) Epoch 3, batch 1250, loss[loss=0.8837, simple_loss=0.5876, pruned_loss=0.5898, over 13169.00 frames. ], tot_loss[loss=0.9388, simple_loss=0.6137, pruned_loss=0.6319, over 2584046.31 frames. ], batch size: 112, lr: 1.87e-02, grad_scale: 0.5 2024-06-19 18:40:00,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=39387.333333333336, ans=0.125 2024-06-19 18:40:01,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=39387.333333333336, ans=0.04949747468305833 2024-06-19 18:40:02,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=39387.333333333336, ans=0.0023071014492753615 2024-06-19 18:40:04,153 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=15.0 2024-06-19 18:40:11,530 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.62 vs. limit=15.0 2024-06-19 18:40:11,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=39405.666666666664, ans=0.125 2024-06-19 18:40:17,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=39424.0, ans=0.025 2024-06-19 18:40:23,634 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.725e+03 5.399e+03 6.509e+03 7.810e+03 3.117e+04, threshold=1.302e+04, percent-clipped=4.0 2024-06-19 18:40:35,159 INFO [train.py:1028] (1/2) Epoch 3, batch 1300, loss[loss=0.8723, simple_loss=0.5884, pruned_loss=0.5781, over 12717.00 frames. ], tot_loss[loss=0.9371, simple_loss=0.6129, pruned_loss=0.6307, over 2583862.14 frames. ], batch size: 176, lr: 1.87e-02, grad_scale: 0.5 2024-06-19 18:40:41,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=39497.333333333336, ans=0.0 2024-06-19 18:40:41,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=39497.333333333336, ans=0.125 2024-06-19 18:40:43,841 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.32 vs. limit=10.0 2024-06-19 18:40:47,085 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.39 vs. limit=15.0 2024-06-19 18:40:48,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=39515.666666666664, ans=0.0 2024-06-19 18:40:49,001 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.65 vs. limit=22.5 2024-06-19 18:40:49,091 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.28 vs. limit=10.0 2024-06-19 18:40:50,386 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.76 vs. limit=22.5 2024-06-19 18:40:50,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=39515.666666666664, ans=0.1 2024-06-19 18:40:55,768 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.074e+03 2024-06-19 18:40:57,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=39534.0, ans=0.125 2024-06-19 18:40:58,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39534.0, ans=0.1 2024-06-19 18:41:00,665 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=6.515e+02 2024-06-19 18:41:02,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=39552.333333333336, ans=0.125 2024-06-19 18:41:10,755 INFO [train.py:1028] (1/2) Epoch 3, batch 1350, loss[loss=0.8996, simple_loss=0.595, pruned_loss=0.6021, over 13194.00 frames. ], tot_loss[loss=0.9365, simple_loss=0.613, pruned_loss=0.63, over 2584987.83 frames. ], batch size: 59, lr: 1.87e-02, grad_scale: 0.25 2024-06-19 18:41:12,419 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.20 vs. limit=15.0 2024-06-19 18:41:33,366 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.273e+03 4.394e+03 5.413e+03 6.474e+03 2.657e+04, threshold=1.083e+04, percent-clipped=4.0 2024-06-19 18:41:36,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=39625.666666666664, ans=0.125 2024-06-19 18:41:44,832 INFO [train.py:1028] (1/2) Epoch 3, batch 1400, loss[loss=0.9451, simple_loss=0.6072, pruned_loss=0.6415, over 12406.00 frames. ], tot_loss[loss=0.9342, simple_loss=0.6123, pruned_loss=0.6281, over 2587043.74 frames. ], batch size: 25, lr: 1.87e-02, grad_scale: 0.5 2024-06-19 18:41:49,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=39662.333333333336, ans=0.1 2024-06-19 18:41:59,970 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.61 vs. limit=15.0 2024-06-19 18:42:08,637 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.33 vs. limit=12.0 2024-06-19 18:42:09,927 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.96 vs. limit=10.0 2024-06-19 18:42:17,467 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.01 vs. limit=6.0 2024-06-19 18:42:18,239 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.07 vs. limit=10.0 2024-06-19 18:42:20,320 INFO [train.py:1028] (1/2) Epoch 3, batch 1450, loss[loss=0.8545, simple_loss=0.574, pruned_loss=0.5675, over 13121.00 frames. ], tot_loss[loss=0.9301, simple_loss=0.6107, pruned_loss=0.6247, over 2586567.52 frames. ], batch size: 121, lr: 1.86e-02, grad_scale: 0.5 2024-06-19 18:42:23,242 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.70 vs. limit=10.0 2024-06-19 18:42:26,246 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.73 vs. limit=22.5 2024-06-19 18:42:28,546 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.77 vs. limit=15.0 2024-06-19 18:42:36,664 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=27.93 vs. limit=15.0 2024-06-19 18:42:37,295 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.88 vs. limit=15.0 2024-06-19 18:42:40,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=39809.0, ans=0.1 2024-06-19 18:42:42,112 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.014e+03 3.850e+03 4.974e+03 5.928e+03 1.537e+04, threshold=9.949e+03, percent-clipped=3.0 2024-06-19 18:42:48,345 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.47 vs. limit=15.0 2024-06-19 18:42:49,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=39827.333333333336, ans=0.1 2024-06-19 18:42:53,285 INFO [train.py:1028] (1/2) Epoch 3, batch 1500, loss[loss=0.9066, simple_loss=0.6031, pruned_loss=0.6051, over 13263.00 frames. ], tot_loss[loss=0.9261, simple_loss=0.6094, pruned_loss=0.6214, over 2589639.41 frames. ], batch size: 83, lr: 1.86e-02, grad_scale: 1.0 2024-06-19 18:42:54,665 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=1.554e-02 2024-06-19 18:42:55,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=39845.666666666664, ans=0.125 2024-06-19 18:43:11,420 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.59 vs. limit=15.0 2024-06-19 18:43:14,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=39882.333333333336, ans=0.125 2024-06-19 18:43:17,281 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 18:43:28,812 INFO [train.py:1028] (1/2) Epoch 3, batch 1550, loss[loss=0.8955, simple_loss=0.6015, pruned_loss=0.5948, over 13108.00 frames. ], tot_loss[loss=0.9258, simple_loss=0.6098, pruned_loss=0.621, over 2585144.29 frames. ], batch size: 103, lr: 1.86e-02, grad_scale: 0.25 2024-06-19 18:43:32,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=39937.333333333336, ans=0.0 2024-06-19 18:43:41,376 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.54 vs. limit=15.0 2024-06-19 18:43:47,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=39974.0, ans=0.125 2024-06-19 18:43:47,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=39974.0, ans=0.1 2024-06-19 18:43:52,009 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.633e+03 5.266e+03 5.930e+03 7.301e+03 2.204e+04, threshold=1.186e+04, percent-clipped=10.0 2024-06-19 18:44:01,574 INFO [train.py:1028] (1/2) Epoch 3, batch 1600, loss[loss=0.9244, simple_loss=0.6141, pruned_loss=0.6174, over 13188.00 frames. ], tot_loss[loss=0.9248, simple_loss=0.6102, pruned_loss=0.6197, over 2580209.14 frames. ], batch size: 77, lr: 1.86e-02, grad_scale: 0.5 2024-06-19 18:44:19,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=40065.666666666664, ans=0.0021596376811594212 2024-06-19 18:44:26,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=40084.0, ans=0.2 2024-06-19 18:44:31,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=40102.333333333336, ans=0.125 2024-06-19 18:44:33,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=40102.333333333336, ans=0.0021516666666666663 2024-06-19 18:44:35,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=40102.333333333336, ans=0.2 2024-06-19 18:44:36,561 INFO [train.py:1028] (1/2) Epoch 3, batch 1650, loss[loss=0.9308, simple_loss=0.6252, pruned_loss=0.6183, over 13165.00 frames. ], tot_loss[loss=0.924, simple_loss=0.6109, pruned_loss=0.6185, over 2575865.94 frames. ], batch size: 95, lr: 1.86e-02, grad_scale: 0.5 2024-06-19 18:44:41,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=40120.666666666664, ans=0.125 2024-06-19 18:44:43,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=40139.0, ans=0.125 2024-06-19 18:44:45,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=40139.0, ans=0.125 2024-06-19 18:44:51,012 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.75 vs. limit=15.0 2024-06-19 18:45:03,787 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.181e+03 5.290e+03 6.209e+03 7.609e+03 2.355e+04, threshold=1.242e+04, percent-clipped=4.0 2024-06-19 18:45:05,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=40175.666666666664, ans=0.00213572463768116 2024-06-19 18:45:13,638 INFO [train.py:1028] (1/2) Epoch 3, batch 1700, loss[loss=0.9837, simple_loss=0.6435, pruned_loss=0.662, over 12827.00 frames. ], tot_loss[loss=0.9244, simple_loss=0.6119, pruned_loss=0.6184, over 2580890.01 frames. ], batch size: 26, lr: 1.86e-02, grad_scale: 1.0 2024-06-19 18:45:13,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=40212.333333333336, ans=0.125 2024-06-19 18:45:17,336 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=11.04 vs. limit=10.0 2024-06-19 18:45:29,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=40249.0, ans=0.125 2024-06-19 18:45:36,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40267.333333333336, ans=0.1 2024-06-19 18:45:39,604 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.98 vs. limit=5.0 2024-06-19 18:45:40,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=40285.666666666664, ans=0.0 2024-06-19 18:45:41,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=40285.666666666664, ans=0.2 2024-06-19 18:45:46,242 INFO [train.py:1028] (1/2) Epoch 3, batch 1750, loss[loss=0.9708, simple_loss=0.6337, pruned_loss=0.6539, over 12400.00 frames. ], tot_loss[loss=0.9235, simple_loss=0.6122, pruned_loss=0.6174, over 2581713.40 frames. ], batch size: 22, lr: 1.85e-02, grad_scale: 0.25 2024-06-19 18:45:49,289 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=12.0 2024-06-19 18:46:00,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=40340.666666666664, ans=0.1 2024-06-19 18:46:13,343 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.995e+03 5.130e+03 6.218e+03 7.636e+03 3.397e+04, threshold=1.244e+04, percent-clipped=8.0 2024-06-19 18:46:21,801 INFO [train.py:1028] (1/2) Epoch 3, batch 1800, loss[loss=0.8933, simple_loss=0.5995, pruned_loss=0.5936, over 13244.00 frames. ], tot_loss[loss=0.9205, simple_loss=0.6116, pruned_loss=0.6147, over 2581729.80 frames. ], batch size: 67, lr: 1.85e-02, grad_scale: 0.5 2024-06-19 18:46:27,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=40395.666666666664, ans=0.125 2024-06-19 18:46:31,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=40414.0, ans=0.0 2024-06-19 18:46:33,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=40414.0, ans=0.125 2024-06-19 18:46:46,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=40450.666666666664, ans=0.125 2024-06-19 18:46:47,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40469.0, ans=0.1 2024-06-19 18:46:50,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.77 vs. limit=15.0 2024-06-19 18:46:55,107 INFO [train.py:1028] (1/2) Epoch 3, batch 1850, loss[loss=0.8789, simple_loss=0.5878, pruned_loss=0.585, over 13212.00 frames. ], tot_loss[loss=0.9182, simple_loss=0.6115, pruned_loss=0.6124, over 2584139.33 frames. ], batch size: 83, lr: 1.85e-02, grad_scale: 0.5 2024-06-19 18:46:57,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=40487.333333333336, ans=0.07 2024-06-19 18:47:01,710 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.19 vs. limit=15.0 2024-06-19 18:47:03,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=40487.333333333336, ans=0.0 2024-06-19 18:47:09,970 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.94 vs. limit=6.0 2024-06-19 18:47:12,589 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.08 vs. limit=15.0 2024-06-19 18:47:16,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=40524.0, ans=0.09899494936611666 2024-06-19 18:47:22,542 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.019e+03 4.777e+03 5.751e+03 7.040e+03 2.086e+04, threshold=1.150e+04, percent-clipped=4.0 2024-06-19 18:47:24,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=40560.666666666664, ans=0.1 2024-06-19 18:47:25,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=40560.666666666664, ans=0.125 2024-06-19 18:47:26,889 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.52 vs. limit=22.5 2024-06-19 18:47:30,550 INFO [train.py:1028] (1/2) Epoch 3, batch 1900, loss[loss=0.9154, simple_loss=0.6153, pruned_loss=0.6077, over 13125.00 frames. ], tot_loss[loss=0.9126, simple_loss=0.6089, pruned_loss=0.6082, over 2587070.18 frames. ], batch size: 95, lr: 1.85e-02, grad_scale: 0.5 2024-06-19 18:47:31,073 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.19 vs. limit=15.0 2024-06-19 18:47:31,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=40579.0, ans=0.125 2024-06-19 18:47:31,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=40579.0, ans=0.125 2024-06-19 18:47:35,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=40579.0, ans=0.125 2024-06-19 18:47:41,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=40597.333333333336, ans=0.125 2024-06-19 18:47:47,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=40615.666666666664, ans=0.125 2024-06-19 18:47:49,564 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.28 vs. limit=15.0 2024-06-19 18:47:53,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=40634.0, ans=0.1 2024-06-19 18:47:56,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=40634.0, ans=0.125 2024-06-19 18:48:03,242 INFO [train.py:1028] (1/2) Epoch 3, batch 1950, loss[loss=0.9761, simple_loss=0.657, pruned_loss=0.6476, over 13242.00 frames. ], tot_loss[loss=0.9087, simple_loss=0.608, pruned_loss=0.6047, over 2593045.38 frames. ], batch size: 52, lr: 1.85e-02, grad_scale: 0.5 2024-06-19 18:48:22,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=40707.333333333336, ans=0.125 2024-06-19 18:48:29,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=40725.666666666664, ans=0.2 2024-06-19 18:48:31,266 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.04 vs. limit=22.5 2024-06-19 18:48:31,448 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.366e+03 5.194e+03 5.876e+03 7.303e+03 3.239e+04, threshold=1.175e+04, percent-clipped=6.0 2024-06-19 18:48:34,097 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.27 vs. limit=15.0 2024-06-19 18:48:39,676 INFO [train.py:1028] (1/2) Epoch 3, batch 2000, loss[loss=0.9147, simple_loss=0.6148, pruned_loss=0.6072, over 12537.00 frames. ], tot_loss[loss=0.9068, simple_loss=0.6084, pruned_loss=0.6025, over 2588357.28 frames. ], batch size: 22, lr: 1.84e-02, grad_scale: 1.0 2024-06-19 18:48:40,893 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.51 vs. limit=15.0 2024-06-19 18:48:47,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=40780.666666666664, ans=0.0 2024-06-19 18:48:50,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=40780.666666666664, ans=0.0 2024-06-19 18:48:52,476 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.26 vs. limit=15.0 2024-06-19 18:49:00,493 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.71 vs. limit=22.5 2024-06-19 18:49:07,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=40817.333333333336, ans=0.0 2024-06-19 18:49:09,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=40835.666666666664, ans=0.125 2024-06-19 18:49:15,428 INFO [train.py:1028] (1/2) Epoch 3, batch 2050, loss[loss=0.8519, simple_loss=0.5762, pruned_loss=0.5638, over 12837.00 frames. ], tot_loss[loss=0.9038, simple_loss=0.6079, pruned_loss=0.5998, over 2583713.50 frames. ], batch size: 29, lr: 1.84e-02, grad_scale: 0.5 2024-06-19 18:49:22,439 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.86 vs. limit=15.0 2024-06-19 18:49:26,234 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.18 vs. limit=15.0 2024-06-19 18:49:26,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40872.333333333336, ans=0.1 2024-06-19 18:49:26,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=40872.333333333336, ans=0.125 2024-06-19 18:49:27,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=40872.333333333336, ans=0.125 2024-06-19 18:49:28,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=40890.666666666664, ans=0.0 2024-06-19 18:49:33,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=40890.666666666664, ans=0.1 2024-06-19 18:49:39,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=40909.0, ans=0.07 2024-06-19 18:49:41,298 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.988e+03 4.935e+03 6.106e+03 7.334e+03 2.024e+04, threshold=1.221e+04, percent-clipped=5.0 2024-06-19 18:49:45,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=40927.333333333336, ans=0.125 2024-06-19 18:49:47,805 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.16 vs. limit=15.0 2024-06-19 18:49:48,011 INFO [train.py:1028] (1/2) Epoch 3, batch 2100, loss[loss=0.864, simple_loss=0.5811, pruned_loss=0.5735, over 13173.00 frames. ], tot_loss[loss=0.9026, simple_loss=0.6074, pruned_loss=0.5989, over 2587406.10 frames. ], batch size: 59, lr: 1.84e-02, grad_scale: 0.5 2024-06-19 18:49:49,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=40945.666666666664, ans=0.125 2024-06-19 18:49:50,633 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.512e-03 2024-06-19 18:49:57,146 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.05 vs. limit=6.0 2024-06-19 18:50:06,165 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.47 vs. limit=15.0 2024-06-19 18:50:13,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=41019.0, ans=0.125 2024-06-19 18:50:21,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=41019.0, ans=0.125 2024-06-19 18:50:23,675 INFO [train.py:1028] (1/2) Epoch 3, batch 2150, loss[loss=0.9582, simple_loss=0.6454, pruned_loss=0.6355, over 13309.00 frames. ], tot_loss[loss=0.903, simple_loss=0.6078, pruned_loss=0.5991, over 2589980.98 frames. ], batch size: 52, lr: 1.84e-02, grad_scale: 0.5 2024-06-19 18:50:29,569 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=14.82 vs. limit=15.0 2024-06-19 18:50:36,515 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=14.99 vs. limit=15.0 2024-06-19 18:50:37,563 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2024-06-19 18:50:37,620 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.64 vs. limit=22.5 2024-06-19 18:50:40,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=41074.0, ans=0.05 2024-06-19 18:50:40,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=41074.0, ans=0.025 2024-06-19 18:50:43,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=41092.333333333336, ans=0.125 2024-06-19 18:50:50,460 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.648e+03 3.926e+03 4.705e+03 5.559e+03 2.632e+04, threshold=9.410e+03, percent-clipped=1.0 2024-06-19 18:50:53,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=41110.666666666664, ans=0.125 2024-06-19 18:50:56,983 INFO [train.py:1028] (1/2) Epoch 3, batch 2200, loss[loss=0.8595, simple_loss=0.5853, pruned_loss=0.5669, over 13210.00 frames. ], tot_loss[loss=0.9013, simple_loss=0.6078, pruned_loss=0.5974, over 2589293.74 frames. ], batch size: 83, lr: 1.84e-02, grad_scale: 1.0 2024-06-19 18:50:58,836 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.71 vs. limit=22.5 2024-06-19 18:51:04,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=41129.0, ans=0.125 2024-06-19 18:51:05,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=41129.0, ans=0.125 2024-06-19 18:51:05,942 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=19.36 vs. limit=15.0 2024-06-19 18:51:24,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=41184.0, ans=0.5 2024-06-19 18:51:31,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=41202.333333333336, ans=0.0 2024-06-19 18:51:32,292 INFO [train.py:1028] (1/2) Epoch 3, batch 2250, loss[loss=0.8297, simple_loss=0.563, pruned_loss=0.5482, over 13234.00 frames. ], tot_loss[loss=0.8969, simple_loss=0.6058, pruned_loss=0.5939, over 2587769.64 frames. ], batch size: 63, lr: 1.83e-02, grad_scale: 0.5 2024-06-19 18:51:32,626 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.82 vs. limit=12.0 2024-06-19 18:51:33,293 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=19.52 vs. limit=15.0 2024-06-19 18:51:36,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=41220.666666666664, ans=0.1 2024-06-19 18:51:47,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=41257.333333333336, ans=0.2 2024-06-19 18:51:52,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=41275.666666666664, ans=0.125 2024-06-19 18:51:54,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=41275.666666666664, ans=0.125 2024-06-19 18:51:57,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=41275.666666666664, ans=0.1 2024-06-19 18:51:59,453 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.665e+03 4.281e+03 4.893e+03 5.711e+03 1.693e+04, threshold=9.786e+03, percent-clipped=6.0 2024-06-19 18:52:00,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=41294.0, ans=0.0 2024-06-19 18:52:05,674 INFO [train.py:1028] (1/2) Epoch 3, batch 2300, loss[loss=0.9325, simple_loss=0.6256, pruned_loss=0.6197, over 12950.00 frames. ], tot_loss[loss=0.8962, simple_loss=0.6061, pruned_loss=0.5932, over 2582128.03 frames. ], batch size: 33, lr: 1.83e-02, grad_scale: 1.0 2024-06-19 18:52:23,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=41349.0, ans=0.0018806521739130437 2024-06-19 18:52:27,851 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.55 vs. limit=15.0 2024-06-19 18:52:30,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=41367.333333333336, ans=0.025 2024-06-19 18:52:41,866 INFO [train.py:1028] (1/2) Epoch 3, batch 2350, loss[loss=0.8541, simple_loss=0.5855, pruned_loss=0.5613, over 13201.00 frames. ], tot_loss[loss=0.8919, simple_loss=0.6046, pruned_loss=0.5896, over 2585505.14 frames. ], batch size: 67, lr: 1.83e-02, grad_scale: 0.5 2024-06-19 18:52:54,444 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.99 vs. limit=10.0 2024-06-19 18:52:56,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=41440.666666666664, ans=0.125 2024-06-19 18:53:00,879 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.51 vs. limit=22.5 2024-06-19 18:53:11,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=41477.333333333336, ans=0.125 2024-06-19 18:53:13,275 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.98 vs. limit=10.0 2024-06-19 18:53:13,518 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.769e+03 4.073e+03 4.804e+03 5.808e+03 2.098e+04, threshold=9.609e+03, percent-clipped=4.0 2024-06-19 18:53:14,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.21 vs. limit=15.0 2024-06-19 18:53:18,288 INFO [train.py:1028] (1/2) Epoch 3, batch 2400, loss[loss=0.867, simple_loss=0.592, pruned_loss=0.571, over 13278.00 frames. ], tot_loss[loss=0.8865, simple_loss=0.6024, pruned_loss=0.5853, over 2588105.48 frames. ], batch size: 46, lr: 1.83e-02, grad_scale: 0.5 2024-06-19 18:53:25,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=41514.0, ans=0.125 2024-06-19 18:53:37,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=41550.666666666664, ans=0.2 2024-06-19 18:53:38,724 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2024-06-19 18:53:43,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=41569.0, ans=0.0 2024-06-19 18:53:50,821 INFO [train.py:1028] (1/2) Epoch 3, batch 2450, loss[loss=0.8955, simple_loss=0.6178, pruned_loss=0.5866, over 13289.00 frames. ], tot_loss[loss=0.8787, simple_loss=0.599, pruned_loss=0.5792, over 2584811.81 frames. ], batch size: 63, lr: 1.83e-02, grad_scale: 0.5 2024-06-19 18:53:51,914 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.91 vs. limit=10.0 2024-06-19 18:53:52,450 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=14.17 vs. limit=15.0 2024-06-19 18:53:58,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=41605.666666666664, ans=0.04949747468305833 2024-06-19 18:54:00,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=41605.666666666664, ans=0.125 2024-06-19 18:54:05,784 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.20 vs. limit=15.0 2024-06-19 18:54:14,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=41642.333333333336, ans=0.125 2024-06-19 18:54:15,609 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.70 vs. limit=15.0 2024-06-19 18:54:18,861 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.75 vs. limit=10.0 2024-06-19 18:54:20,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=41660.666666666664, ans=10.0 2024-06-19 18:54:21,871 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.163e+03 3.963e+03 4.743e+03 5.334e+03 1.602e+04, threshold=9.487e+03, percent-clipped=4.0 2024-06-19 18:54:26,860 INFO [train.py:1028] (1/2) Epoch 3, batch 2500, loss[loss=0.813, simple_loss=0.5666, pruned_loss=0.5297, over 13209.00 frames. ], tot_loss[loss=0.8703, simple_loss=0.5954, pruned_loss=0.5726, over 2586730.35 frames. ], batch size: 83, lr: 1.83e-02, grad_scale: 1.0 2024-06-19 18:54:29,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=41679.0, ans=8.0 2024-06-19 18:54:29,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=41679.0, ans=0.125 2024-06-19 18:54:33,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=41697.333333333336, ans=0.2 2024-06-19 18:54:39,492 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.33 vs. limit=22.5 2024-06-19 18:54:40,836 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.75 vs. limit=6.0 2024-06-19 18:54:46,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=41734.0, ans=0.1 2024-06-19 18:54:52,382 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.85 vs. limit=15.0 2024-06-19 18:54:58,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=41752.333333333336, ans=0.125 2024-06-19 18:55:02,871 INFO [train.py:1028] (1/2) Epoch 3, batch 2550, loss[loss=0.9208, simple_loss=0.6435, pruned_loss=0.5991, over 12681.00 frames. ], tot_loss[loss=0.8633, simple_loss=0.593, pruned_loss=0.5668, over 2585793.93 frames. ], batch size: 22, lr: 1.82e-02, grad_scale: 1.0 2024-06-19 18:55:09,963 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=18.49 vs. limit=15.0 2024-06-19 18:55:21,399 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=21.99 vs. limit=15.0 2024-06-19 18:55:32,213 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.899e+03 4.611e+03 5.346e+03 6.458e+03 2.438e+04, threshold=1.069e+04, percent-clipped=5.0 2024-06-19 18:55:32,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=41844.0, ans=0.125 2024-06-19 18:55:35,435 INFO [train.py:1028] (1/2) Epoch 3, batch 2600, loss[loss=0.8039, simple_loss=0.5612, pruned_loss=0.5233, over 13240.00 frames. ], tot_loss[loss=0.855, simple_loss=0.5893, pruned_loss=0.5603, over 2586564.96 frames. ], batch size: 52, lr: 1.82e-02, grad_scale: 0.5 2024-06-19 18:55:37,411 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.28 vs. limit=15.0 2024-06-19 18:55:49,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=41899.0, ans=0.025 2024-06-19 18:55:49,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=41899.0, ans=0.125 2024-06-19 18:55:53,410 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.97 vs. limit=22.5 2024-06-19 18:55:55,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=41917.333333333336, ans=0.0 2024-06-19 18:56:04,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=41935.666666666664, ans=0.125 2024-06-19 18:56:09,239 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2024-06-19 18:56:10,647 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.85 vs. limit=15.0 2024-06-19 18:56:11,526 INFO [train.py:1028] (1/2) Epoch 3, batch 2650, loss[loss=0.7426, simple_loss=0.523, pruned_loss=0.4811, over 13049.00 frames. ], tot_loss[loss=0.847, simple_loss=0.5859, pruned_loss=0.5541, over 2587140.25 frames. ], batch size: 144, lr: 1.82e-02, grad_scale: 0.25 2024-06-19 18:56:22,205 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.95 vs. limit=15.0 2024-06-19 18:56:25,469 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.62 vs. limit=15.0 2024-06-19 18:56:28,156 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.20 vs. limit=15.0 2024-06-19 18:56:30,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=42009.0, ans=0.125 2024-06-19 18:56:31,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=42009.0, ans=0.0017371739130434775 2024-06-19 18:56:32,622 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=12.0 2024-06-19 18:56:41,826 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.895e+03 4.192e+03 4.700e+03 5.564e+03 2.083e+04, threshold=9.401e+03, percent-clipped=2.0 2024-06-19 18:56:44,344 INFO [train.py:1028] (1/2) Epoch 3, batch 2700, loss[loss=0.7826, simple_loss=0.5499, pruned_loss=0.5077, over 13284.00 frames. ], tot_loss[loss=0.8371, simple_loss=0.5809, pruned_loss=0.5466, over 2584574.32 frames. ], batch size: 89, lr: 1.82e-02, grad_scale: 0.5 2024-06-19 18:56:50,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=42045.666666666664, ans=0.125 2024-06-19 18:56:52,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=42045.666666666664, ans=0.07 2024-06-19 18:56:55,720 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.06 vs. limit=10.0 2024-06-19 18:56:58,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=42064.0, ans=0.0 2024-06-19 18:56:59,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42064.0, ans=0.1 2024-06-19 18:56:59,759 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.44 vs. limit=6.0 2024-06-19 18:57:04,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=42082.333333333336, ans=0.0 2024-06-19 18:57:15,094 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.51 vs. limit=15.0 2024-06-19 18:57:20,216 INFO [train.py:1028] (1/2) Epoch 3, batch 2750, loss[loss=0.8288, simple_loss=0.5854, pruned_loss=0.5361, over 13231.00 frames. ], tot_loss[loss=0.8296, simple_loss=0.5783, pruned_loss=0.5405, over 2581583.87 frames. ], batch size: 43, lr: 1.82e-02, grad_scale: 0.5 2024-06-19 18:57:26,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=42155.666666666664, ans=0.1 2024-06-19 18:57:30,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=42155.666666666664, ans=0.0017052898550724647 2024-06-19 18:57:41,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=42192.333333333336, ans=0.125 2024-06-19 18:57:42,590 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.92 vs. limit=15.0 2024-06-19 18:57:50,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=42210.666666666664, ans=0.2 2024-06-19 18:57:54,472 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.994e+03 4.317e+03 4.963e+03 6.140e+03 1.989e+04, threshold=9.927e+03, percent-clipped=5.0 2024-06-19 18:57:57,140 INFO [train.py:1028] (1/2) Epoch 3, batch 2800, loss[loss=0.7364, simple_loss=0.5231, pruned_loss=0.4749, over 10868.00 frames. ], tot_loss[loss=0.8204, simple_loss=0.574, pruned_loss=0.5333, over 2580039.36 frames. ], batch size: 304, lr: 1.81e-02, grad_scale: 1.0 2024-06-19 18:58:03,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=42247.333333333336, ans=0.0 2024-06-19 18:58:08,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=42247.333333333336, ans=0.2 2024-06-19 18:58:11,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=42265.666666666664, ans=0.0016813768115942034 2024-06-19 18:58:15,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=42265.666666666664, ans=0.125 2024-06-19 18:58:15,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=42265.666666666664, ans=15.0 2024-06-19 18:58:20,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=42284.0, ans=0.125 2024-06-19 18:58:27,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=42302.333333333336, ans=0.125 2024-06-19 18:58:29,879 INFO [train.py:1028] (1/2) Epoch 3, batch 2850, loss[loss=0.7725, simple_loss=0.5491, pruned_loss=0.4979, over 13323.00 frames. ], tot_loss[loss=0.8131, simple_loss=0.571, pruned_loss=0.5276, over 2578616.53 frames. ], batch size: 49, lr: 1.81e-02, grad_scale: 0.5 2024-06-19 18:58:47,403 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.65 vs. limit=10.0 2024-06-19 18:58:57,013 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=4.258e+01 2024-06-19 18:58:58,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=42394.0, ans=0.5 2024-06-19 18:58:59,843 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.82 vs. limit=15.0 2024-06-19 18:59:01,798 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.09 vs. limit=10.0 2024-06-19 18:59:02,667 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.537e+03 4.093e+03 4.703e+03 5.436e+03 1.850e+04, threshold=9.407e+03, percent-clipped=4.0 2024-06-19 18:59:04,715 INFO [train.py:1028] (1/2) Epoch 3, batch 2900, loss[loss=0.8368, simple_loss=0.6007, pruned_loss=0.5364, over 13121.00 frames. ], tot_loss[loss=0.8009, simple_loss=0.5648, pruned_loss=0.5185, over 2586477.92 frames. ], batch size: 55, lr: 1.81e-02, grad_scale: 1.0 2024-06-19 18:59:05,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=42412.333333333336, ans=0.125 2024-06-19 18:59:09,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=42412.333333333336, ans=0.2 2024-06-19 18:59:16,720 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.89 vs. limit=15.0 2024-06-19 18:59:17,090 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 18:59:18,015 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.38 vs. limit=10.0 2024-06-19 18:59:18,018 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.06 vs. limit=15.0 2024-06-19 18:59:19,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=42449.0, ans=0.125 2024-06-19 18:59:19,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=42449.0, ans=0.125 2024-06-19 18:59:29,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=42467.333333333336, ans=0.95 2024-06-19 18:59:32,147 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.19 vs. limit=10.0 2024-06-19 18:59:38,182 INFO [train.py:1028] (1/2) Epoch 3, batch 2950, loss[loss=0.7607, simple_loss=0.5443, pruned_loss=0.4886, over 13239.00 frames. ], tot_loss[loss=0.7949, simple_loss=0.5626, pruned_loss=0.5135, over 2581689.45 frames. ], batch size: 43, lr: 1.81e-02, grad_scale: 0.5 2024-06-19 18:59:38,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=42504.0, ans=0.125 2024-06-19 18:59:53,017 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.79 vs. limit=6.0 2024-06-19 18:59:55,175 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.30 vs. limit=15.0 2024-06-19 18:59:58,210 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.48 vs. limit=15.0 2024-06-19 18:59:59,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=42540.666666666664, ans=0.125 2024-06-19 19:00:02,166 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=11.71 vs. limit=12.0 2024-06-19 19:00:07,602 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2024-06-19 19:00:09,248 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.44 vs. limit=15.0 2024-06-19 19:00:14,838 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.566e+03 4.148e+03 4.866e+03 5.618e+03 2.094e+04, threshold=9.733e+03, percent-clipped=6.0 2024-06-19 19:00:16,308 INFO [train.py:1028] (1/2) Epoch 3, batch 3000, loss[loss=0.7371, simple_loss=0.5406, pruned_loss=0.4668, over 13210.00 frames. ], tot_loss[loss=0.7858, simple_loss=0.5582, pruned_loss=0.5067, over 2579824.87 frames. ], batch size: 59, lr: 1.81e-02, grad_scale: 1.0 2024-06-19 19:00:16,309 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 19:00:24,402 INFO [train.py:1060] (1/2) Epoch 3, validation: loss=0.7243, simple_loss=0.5476, pruned_loss=0.4504, over 351949.00 frames. 2024-06-19 19:00:24,403 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17340MB 2024-06-19 19:00:33,187 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=42614.0, ans=0.04949747468305833 2024-06-19 19:00:42,199 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.28 vs. limit=15.0 2024-06-19 19:00:55,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=42650.666666666664, ans=0.0015976811594202896 2024-06-19 19:01:02,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=42669.0, ans=0.001593695652173913 2024-06-19 19:01:03,462 INFO [train.py:1028] (1/2) Epoch 3, batch 3050, loss[loss=0.7408, simple_loss=0.5463, pruned_loss=0.4677, over 13297.00 frames. ], tot_loss[loss=0.7769, simple_loss=0.5546, pruned_loss=0.4996, over 2578984.59 frames. ], batch size: 46, lr: 1.81e-02, grad_scale: 0.5 2024-06-19 19:01:05,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=42687.333333333336, ans=0.1 2024-06-19 19:01:05,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=42687.333333333336, ans=0.0 2024-06-19 19:01:07,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=42687.333333333336, ans=0.125 2024-06-19 19:01:18,842 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.05 vs. limit=10.0 2024-06-19 19:01:19,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42724.0, ans=0.1 2024-06-19 19:01:21,019 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.15 vs. limit=15.0 2024-06-19 19:01:21,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=42724.0, ans=0.05 2024-06-19 19:01:21,761 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=14.07 vs. limit=15.0 2024-06-19 19:01:32,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=42760.666666666664, ans=0.125 2024-06-19 19:01:32,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=42760.666666666664, ans=0.00157376811594203 2024-06-19 19:01:35,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=42760.666666666664, ans=0.125 2024-06-19 19:01:35,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=42760.666666666664, ans=0.0 2024-06-19 19:01:36,075 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.412e+03 4.195e+03 4.873e+03 5.548e+03 1.971e+04, threshold=9.747e+03, percent-clipped=5.0 2024-06-19 19:01:36,719 INFO [train.py:1028] (1/2) Epoch 3, batch 3100, loss[loss=0.7202, simple_loss=0.5203, pruned_loss=0.4601, over 13066.00 frames. ], tot_loss[loss=0.7724, simple_loss=0.5527, pruned_loss=0.496, over 2579839.63 frames. ], batch size: 144, lr: 1.80e-02, grad_scale: 1.0 2024-06-19 19:01:38,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42779.0, ans=0.1 2024-06-19 19:01:53,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=42815.666666666664, ans=0.1 2024-06-19 19:02:01,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=42834.0, ans=0.125 2024-06-19 19:02:03,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=42834.0, ans=0.125 2024-06-19 19:02:09,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=42852.333333333336, ans=0.125 2024-06-19 19:02:12,840 INFO [train.py:1028] (1/2) Epoch 3, batch 3150, loss[loss=0.7255, simple_loss=0.5212, pruned_loss=0.4649, over 12921.00 frames. ], tot_loss[loss=0.7644, simple_loss=0.5489, pruned_loss=0.49, over 2581471.97 frames. ], batch size: 158, lr: 1.80e-02, grad_scale: 0.5 2024-06-19 19:02:14,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=42870.666666666664, ans=0.025 2024-06-19 19:02:15,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=42870.666666666664, ans=0.125 2024-06-19 19:02:16,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=42870.666666666664, ans=0.125 2024-06-19 19:02:18,950 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.76 vs. limit=22.5 2024-06-19 19:02:22,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=42889.0, ans=0.125 2024-06-19 19:02:24,316 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=14.01 vs. limit=15.0 2024-06-19 19:02:34,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=42925.666666666664, ans=0.2 2024-06-19 19:02:44,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=42944.0, ans=0.0015339130434782605 2024-06-19 19:02:45,316 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=4.317e+02 2024-06-19 19:02:49,735 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.793e+03 4.752e+03 5.502e+03 6.391e+03 3.022e+04, threshold=1.100e+04, percent-clipped=7.0 2024-06-19 19:02:49,763 INFO [train.py:1028] (1/2) Epoch 3, batch 3200, loss[loss=0.7261, simple_loss=0.5308, pruned_loss=0.4608, over 13261.00 frames. ], tot_loss[loss=0.7589, simple_loss=0.5467, pruned_loss=0.4855, over 2580944.40 frames. ], batch size: 55, lr: 1.80e-02, grad_scale: 1.0 2024-06-19 19:02:53,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=42962.333333333336, ans=0.125 2024-06-19 19:03:05,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=42999.0, ans=0.0 2024-06-19 19:03:06,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=42999.0, ans=0.95 2024-06-19 19:03:07,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=42999.0, ans=0.025 2024-06-19 19:03:07,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=42999.0, ans=0.125 2024-06-19 19:03:12,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=43017.333333333336, ans=0.05 2024-06-19 19:03:15,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=43035.666666666664, ans=0.125 2024-06-19 19:03:15,654 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.97 vs. limit=22.5 2024-06-19 19:03:22,510 INFO [train.py:1028] (1/2) Epoch 3, batch 3250, loss[loss=0.7614, simple_loss=0.5609, pruned_loss=0.481, over 13242.00 frames. ], tot_loss[loss=0.7513, simple_loss=0.543, pruned_loss=0.4798, over 2585332.77 frames. ], batch size: 72, lr: 1.80e-02, grad_scale: 0.5 2024-06-19 19:03:24,890 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=3.285e-02 2024-06-19 19:03:31,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=43072.333333333336, ans=0.125 2024-06-19 19:03:38,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=43090.666666666664, ans=0.125 2024-06-19 19:03:43,810 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.52 vs. limit=10.0 2024-06-19 19:03:47,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.07 vs. limit=15.0 2024-06-19 19:03:51,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=43127.333333333336, ans=0.125 2024-06-19 19:04:00,242 INFO [train.py:1028] (1/2) Epoch 3, batch 3300, loss[loss=0.751, simple_loss=0.5482, pruned_loss=0.4769, over 12714.00 frames. ], tot_loss[loss=0.7453, simple_loss=0.541, pruned_loss=0.4748, over 2581060.11 frames. ], batch size: 176, lr: 1.80e-02, grad_scale: 1.0 2024-06-19 19:04:00,902 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.465e+03 4.309e+03 4.998e+03 5.820e+03 8.440e+03, threshold=9.996e+03, percent-clipped=0.0 2024-06-19 19:04:04,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=43145.666666666664, ans=0.125 2024-06-19 19:04:13,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=43182.333333333336, ans=0.0 2024-06-19 19:04:15,464 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.81 vs. limit=6.0 2024-06-19 19:04:15,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=43182.333333333336, ans=0.025 2024-06-19 19:04:16,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=43182.333333333336, ans=0.2 2024-06-19 19:04:18,985 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.53 vs. limit=15.0 2024-06-19 19:04:30,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=43219.0, ans=0.125 2024-06-19 19:04:32,722 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=43219.0, ans=0.025 2024-06-19 19:04:37,196 INFO [train.py:1028] (1/2) Epoch 3, batch 3350, loss[loss=0.6849, simple_loss=0.4981, pruned_loss=0.4358, over 12909.00 frames. ], tot_loss[loss=0.7377, simple_loss=0.5372, pruned_loss=0.4691, over 2575884.80 frames. ], batch size: 158, lr: 1.80e-02, grad_scale: 1.0 2024-06-19 19:04:41,572 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2024-06-19 19:04:46,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=43255.666666666664, ans=0.0014661594202898549 2024-06-19 19:04:51,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43274.0, ans=0.1 2024-06-19 19:04:53,631 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.96 vs. limit=15.0 2024-06-19 19:04:59,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=43292.333333333336, ans=0.125 2024-06-19 19:05:10,575 INFO [train.py:1028] (1/2) Epoch 3, batch 3400, loss[loss=0.7862, simple_loss=0.5856, pruned_loss=0.4934, over 12690.00 frames. ], tot_loss[loss=0.7296, simple_loss=0.5335, pruned_loss=0.4629, over 2575633.55 frames. ], batch size: 22, lr: 1.79e-02, grad_scale: 1.0 2024-06-19 19:05:11,940 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.396e+03 4.379e+03 4.796e+03 5.450e+03 1.460e+04, threshold=9.591e+03, percent-clipped=2.0 2024-06-19 19:05:23,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=20.29 vs. limit=15.0 2024-06-19 19:05:33,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=43384.0, ans=0.125 2024-06-19 19:05:44,315 INFO [train.py:1028] (1/2) Epoch 3, batch 3450, loss[loss=0.7448, simple_loss=0.5457, pruned_loss=0.4719, over 12785.00 frames. ], tot_loss[loss=0.7203, simple_loss=0.5292, pruned_loss=0.4556, over 2577496.48 frames. ], batch size: 176, lr: 1.79e-02, grad_scale: 0.5 2024-06-19 19:05:45,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=43420.666666666664, ans=0.125 2024-06-19 19:05:53,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=43439.0, ans=0.125 2024-06-19 19:06:01,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=43457.333333333336, ans=0.0 2024-06-19 19:06:10,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.37 vs. limit=15.0 2024-06-19 19:06:17,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=43494.0, ans=0.125 2024-06-19 19:06:19,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=43494.0, ans=0.1 2024-06-19 19:06:19,561 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.43 vs. limit=22.5 2024-06-19 19:06:20,553 INFO [train.py:1028] (1/2) Epoch 3, batch 3500, loss[loss=0.6552, simple_loss=0.4934, pruned_loss=0.4084, over 12936.00 frames. ], tot_loss[loss=0.7158, simple_loss=0.5273, pruned_loss=0.4521, over 2576658.96 frames. ], batch size: 33, lr: 1.79e-02, grad_scale: 0.5 2024-06-19 19:06:21,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=43512.333333333336, ans=0.0 2024-06-19 19:06:23,242 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.338e+03 4.525e+03 5.690e+03 7.285e+03 1.724e+04, threshold=1.138e+04, percent-clipped=9.0 2024-06-19 19:06:31,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=43530.666666666664, ans=0.025 2024-06-19 19:06:40,083 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.03 vs. limit=22.5 2024-06-19 19:06:47,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=43567.333333333336, ans=0.09899494936611666 2024-06-19 19:06:48,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=43567.333333333336, ans=0.0 2024-06-19 19:06:48,991 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.36 vs. limit=15.0 2024-06-19 19:06:54,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=43585.666666666664, ans=0.125 2024-06-19 19:06:57,355 INFO [train.py:1028] (1/2) Epoch 3, batch 3550, loss[loss=0.6902, simple_loss=0.5127, pruned_loss=0.4339, over 13154.00 frames. ], tot_loss[loss=0.7065, simple_loss=0.5226, pruned_loss=0.4452, over 2577302.15 frames. ], batch size: 95, lr: 1.79e-02, grad_scale: 0.5 2024-06-19 19:06:58,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=43604.0, ans=0.125 2024-06-19 19:07:01,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=43604.0, ans=0.0 2024-06-19 19:07:01,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=43604.0, ans=0.001390434782608696 2024-06-19 19:07:06,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=43622.333333333336, ans=0.125 2024-06-19 19:07:08,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=43622.333333333336, ans=0.125 2024-06-19 19:07:12,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=43640.666666666664, ans=0.125 2024-06-19 19:07:18,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.32 vs. limit=15.0 2024-06-19 19:07:18,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=43659.0, ans=0.125 2024-06-19 19:07:30,331 INFO [train.py:1028] (1/2) Epoch 3, batch 3600, loss[loss=0.6592, simple_loss=0.5062, pruned_loss=0.4061, over 13320.00 frames. ], tot_loss[loss=0.6956, simple_loss=0.517, pruned_loss=0.4372, over 2580483.57 frames. ], batch size: 49, lr: 1.79e-02, grad_scale: 1.0 2024-06-19 19:07:33,091 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+03 3.679e+03 4.215e+03 4.931e+03 1.419e+04, threshold=8.430e+03, percent-clipped=1.0 2024-06-19 19:07:38,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43714.0, ans=0.1 2024-06-19 19:07:43,276 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.75 vs. limit=15.0 2024-06-19 19:08:00,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=43750.666666666664, ans=0.125 2024-06-19 19:08:05,036 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.51 vs. limit=22.5 2024-06-19 19:08:08,477 INFO [train.py:1028] (1/2) Epoch 3, batch 3650, loss[loss=0.6609, simple_loss=0.5054, pruned_loss=0.4083, over 13000.00 frames. ], tot_loss[loss=0.6887, simple_loss=0.5141, pruned_loss=0.4316, over 2579264.80 frames. ], batch size: 102, lr: 1.79e-02, grad_scale: 0.5 2024-06-19 19:08:10,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=43787.333333333336, ans=0.125 2024-06-19 19:08:12,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=43787.333333333336, ans=0.125 2024-06-19 19:08:14,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=43805.666666666664, ans=0.2 2024-06-19 19:08:21,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=43824.0, ans=0.025 2024-06-19 19:08:32,859 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.12 vs. limit=15.0 2024-06-19 19:08:37,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=43860.666666666664, ans=0.001334637681159422 2024-06-19 19:08:44,669 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.48 vs. limit=10.0 2024-06-19 19:08:45,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=43860.666666666664, ans=0.035 2024-06-19 19:08:46,947 INFO [train.py:1028] (1/2) Epoch 3, batch 3700, loss[loss=0.6547, simple_loss=0.5019, pruned_loss=0.4038, over 13257.00 frames. ], tot_loss[loss=0.6819, simple_loss=0.5108, pruned_loss=0.4265, over 2584107.15 frames. ], batch size: 72, lr: 1.78e-02, grad_scale: 1.0 2024-06-19 19:08:47,402 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.12 vs. limit=15.0 2024-06-19 19:08:50,157 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.217e+03 3.217e+03 3.890e+03 4.383e+03 2.165e+04, threshold=7.780e+03, percent-clipped=6.0 2024-06-19 19:08:53,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=43897.333333333336, ans=0.025 2024-06-19 19:08:55,900 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.79 vs. limit=15.0 2024-06-19 19:09:01,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43915.666666666664, ans=0.1 2024-06-19 19:09:09,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=43934.0, ans=0.125 2024-06-19 19:09:19,533 INFO [train.py:1028] (1/2) Epoch 3, batch 3750, loss[loss=0.7422, simple_loss=0.5588, pruned_loss=0.4628, over 12665.00 frames. ], tot_loss[loss=0.6757, simple_loss=0.508, pruned_loss=0.4217, over 2586344.10 frames. ], batch size: 22, lr: 1.78e-02, grad_scale: 0.5 2024-06-19 19:09:24,969 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.91 vs. limit=22.5 2024-06-19 19:09:25,597 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=19.16 vs. limit=15.0 2024-06-19 19:09:44,979 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=2.61 vs. limit=15.0 2024-06-19 19:09:46,025 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.95 vs. limit=15.0 2024-06-19 19:09:50,093 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.93 vs. limit=15.0 2024-06-19 19:09:52,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=44044.0, ans=0.125 2024-06-19 19:09:56,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=44062.333333333336, ans=0.125 2024-06-19 19:09:56,805 INFO [train.py:1028] (1/2) Epoch 3, batch 3800, loss[loss=0.6597, simple_loss=0.5078, pruned_loss=0.4058, over 13162.00 frames. ], tot_loss[loss=0.6687, simple_loss=0.5048, pruned_loss=0.4163, over 2585028.15 frames. ], batch size: 83, lr: 1.78e-02, grad_scale: 1.0 2024-06-19 19:09:58,295 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=1.188e+00 2024-06-19 19:10:00,759 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.058e+03 3.287e+03 3.846e+03 4.476e+03 7.364e+03, threshold=7.691e+03, percent-clipped=0.0 2024-06-19 19:10:02,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=44062.333333333336, ans=0.0 2024-06-19 19:10:03,920 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=11.96 vs. limit=12.0 2024-06-19 19:10:04,606 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.72 vs. limit=22.5 2024-06-19 19:10:05,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=44080.666666666664, ans=0.07 2024-06-19 19:10:13,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=44080.666666666664, ans=22.5 2024-06-19 19:10:19,464 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.25 vs. limit=22.5 2024-06-19 19:10:21,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=44117.333333333336, ans=0.0 2024-06-19 19:10:21,454 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.59 vs. limit=15.0 2024-06-19 19:10:29,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=44135.666666666664, ans=0.125 2024-06-19 19:10:34,475 INFO [train.py:1028] (1/2) Epoch 3, batch 3850, loss[loss=0.6403, simple_loss=0.4806, pruned_loss=0.4, over 13047.00 frames. ], tot_loss[loss=0.6606, simple_loss=0.5008, pruned_loss=0.4102, over 2585064.06 frames. ], batch size: 144, lr: 1.78e-02, grad_scale: 0.25 2024-06-19 19:10:39,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=44154.0, ans=0.1 2024-06-19 19:10:40,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=44172.333333333336, ans=0.125 2024-06-19 19:10:42,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=44172.333333333336, ans=0.0 2024-06-19 19:10:42,903 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.68 vs. limit=15.0 2024-06-19 19:10:43,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=44172.333333333336, ans=0.125 2024-06-19 19:10:45,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=44172.333333333336, ans=0.125 2024-06-19 19:10:47,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=44190.666666666664, ans=0.125 2024-06-19 19:10:49,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=44190.666666666664, ans=0.125 2024-06-19 19:10:56,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=44209.0, ans=0.0 2024-06-19 19:10:58,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=44209.0, ans=0.0012589130434782596 2024-06-19 19:11:07,427 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.50 vs. limit=15.0 2024-06-19 19:11:10,406 INFO [train.py:1028] (1/2) Epoch 3, batch 3900, loss[loss=0.6097, simple_loss=0.4753, pruned_loss=0.3721, over 13204.00 frames. ], tot_loss[loss=0.6542, simple_loss=0.4973, pruned_loss=0.4055, over 2588183.86 frames. ], batch size: 83, lr: 1.78e-02, grad_scale: 0.5 2024-06-19 19:11:10,753 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.42 vs. limit=22.5 2024-06-19 19:11:13,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=44245.666666666664, ans=0.025 2024-06-19 19:11:13,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=44245.666666666664, ans=0.5 2024-06-19 19:11:15,674 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.750e+03 4.809e+03 5.812e+03 6.715e+03 2.324e+04, threshold=1.162e+04, percent-clipped=11.0 2024-06-19 19:11:17,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=44264.0, ans=0.035 2024-06-19 19:11:17,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=44264.0, ans=0.125 2024-06-19 19:11:19,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=44264.0, ans=0.125 2024-06-19 19:11:20,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=44264.0, ans=0.0 2024-06-19 19:11:29,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=44282.333333333336, ans=15.0 2024-06-19 19:11:43,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=44337.333333333336, ans=0.125 2024-06-19 19:11:43,864 INFO [train.py:1028] (1/2) Epoch 3, batch 3950, loss[loss=0.5967, simple_loss=0.453, pruned_loss=0.3702, over 13095.00 frames. ], tot_loss[loss=0.6452, simple_loss=0.4929, pruned_loss=0.3988, over 2589373.97 frames. ], batch size: 132, lr: 1.77e-02, grad_scale: 0.5 2024-06-19 19:11:45,761 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.86 vs. limit=15.0 2024-06-19 19:11:47,753 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.62 vs. limit=15.0 2024-06-19 19:11:54,004 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.21 vs. limit=15.0 2024-06-19 19:11:58,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=44374.0, ans=0.0 2024-06-19 19:11:59,708 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.17 vs. limit=15.0 2024-06-19 19:12:08,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=44392.333333333336, ans=0.125 2024-06-19 19:12:11,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=44410.666666666664, ans=0.125 2024-06-19 19:12:22,200 INFO [train.py:1028] (1/2) Epoch 3, batch 4000, loss[loss=0.6569, simple_loss=0.5022, pruned_loss=0.4058, over 12876.00 frames. ], tot_loss[loss=0.6397, simple_loss=0.4901, pruned_loss=0.3947, over 2583662.67 frames. ], batch size: 39, lr: 1.77e-02, grad_scale: 0.5 2024-06-19 19:12:22,695 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.80 vs. limit=15.0 2024-06-19 19:12:28,378 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.353e+03 3.801e+03 4.647e+03 5.271e+03 9.446e+03, threshold=9.294e+03, percent-clipped=0.0 2024-06-19 19:12:29,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=44447.333333333336, ans=0.0 2024-06-19 19:12:33,519 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.91 vs. limit=22.5 2024-06-19 19:12:35,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=44465.666666666664, ans=0.125 2024-06-19 19:12:39,933 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.26 vs. limit=22.5 2024-06-19 19:12:49,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=44502.333333333336, ans=0.125 2024-06-19 19:12:53,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=44502.333333333336, ans=0.125 2024-06-19 19:13:00,060 INFO [train.py:1028] (1/2) Epoch 3, batch 4050, loss[loss=0.6335, simple_loss=0.4862, pruned_loss=0.3904, over 11044.00 frames. ], tot_loss[loss=0.6379, simple_loss=0.4895, pruned_loss=0.3932, over 2580928.16 frames. ], batch size: 304, lr: 1.77e-02, grad_scale: 0.5 2024-06-19 19:13:01,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=44520.666666666664, ans=0.1 2024-06-19 19:13:16,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=44557.333333333336, ans=0.0011831884057971007 2024-06-19 19:13:29,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=44594.0, ans=0.1 2024-06-19 19:13:33,504 INFO [train.py:1028] (1/2) Epoch 3, batch 4100, loss[loss=0.6137, simple_loss=0.4748, pruned_loss=0.3763, over 13169.00 frames. ], tot_loss[loss=0.6307, simple_loss=0.4857, pruned_loss=0.3879, over 2577332.70 frames. ], batch size: 103, lr: 1.77e-02, grad_scale: 1.0 2024-06-19 19:13:39,393 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.499e+03 3.752e+03 4.260e+03 4.556e+03 1.053e+04, threshold=8.519e+03, percent-clipped=1.0 2024-06-19 19:13:39,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=44630.666666666664, ans=0.2 2024-06-19 19:13:40,237 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=44630.666666666664, ans=0.05 2024-06-19 19:13:40,487 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.32 vs. limit=10.0 2024-06-19 19:13:43,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=44630.666666666664, ans=0.125 2024-06-19 19:13:47,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44649.0, ans=0.1 2024-06-19 19:13:49,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=44649.0, ans=0.0 2024-06-19 19:13:51,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.39 vs. limit=15.0 2024-06-19 19:13:54,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=44667.333333333336, ans=0.0011592753623188411 2024-06-19 19:14:01,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=44685.666666666664, ans=15.0 2024-06-19 19:14:07,214 INFO [train.py:1028] (1/2) Epoch 3, batch 4150, loss[loss=0.5826, simple_loss=0.4647, pruned_loss=0.3503, over 13075.00 frames. ], tot_loss[loss=0.6256, simple_loss=0.4835, pruned_loss=0.3838, over 2576129.19 frames. ], batch size: 55, lr: 1.77e-02, grad_scale: 1.0 2024-06-19 19:14:12,992 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=9.56 vs. limit=12.0 2024-06-19 19:14:14,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=44722.333333333336, ans=0.125 2024-06-19 19:14:39,720 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.47 vs. limit=15.0 2024-06-19 19:14:40,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=44777.333333333336, ans=0.125 2024-06-19 19:14:42,933 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.81 vs. limit=15.0 2024-06-19 19:14:44,236 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=12.0 2024-06-19 19:14:45,078 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.69 vs. limit=15.0 2024-06-19 19:14:45,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=44795.666666666664, ans=0.025 2024-06-19 19:14:45,987 INFO [train.py:1028] (1/2) Epoch 3, batch 4200, loss[loss=0.5553, simple_loss=0.44, pruned_loss=0.3353, over 13018.00 frames. ], tot_loss[loss=0.6167, simple_loss=0.4789, pruned_loss=0.3772, over 2578486.34 frames. ], batch size: 102, lr: 1.77e-02, grad_scale: 2.0 2024-06-19 19:14:48,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=44795.666666666664, ans=0.125 2024-06-19 19:14:49,780 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=15.99 vs. limit=15.0 2024-06-19 19:14:50,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=44795.666666666664, ans=0.125 2024-06-19 19:14:52,471 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+03 2.895e+03 3.419e+03 3.761e+03 1.069e+04, threshold=6.837e+03, percent-clipped=2.0 2024-06-19 19:14:54,668 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.75 vs. limit=15.0 2024-06-19 19:15:21,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=44869.0, ans=0.125 2024-06-19 19:15:22,915 INFO [train.py:1028] (1/2) Epoch 3, batch 4250, loss[loss=0.5886, simple_loss=0.4809, pruned_loss=0.3482, over 13295.00 frames. ], tot_loss[loss=0.6124, simple_loss=0.4772, pruned_loss=0.3738, over 2581926.64 frames. ], batch size: 46, lr: 1.76e-02, grad_scale: 1.0 2024-06-19 19:15:24,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=44887.333333333336, ans=0.125 2024-06-19 19:15:25,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=44887.333333333336, ans=0.1 2024-06-19 19:15:27,273 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.66 vs. limit=15.0 2024-06-19 19:15:42,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=44942.333333333336, ans=0.125 2024-06-19 19:15:42,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=44942.333333333336, ans=0.0010994927536231887 2024-06-19 19:15:45,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=44942.333333333336, ans=0.125 2024-06-19 19:15:45,588 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.64 vs. limit=22.5 2024-06-19 19:15:55,841 INFO [train.py:1028] (1/2) Epoch 3, batch 4300, loss[loss=0.5898, simple_loss=0.4716, pruned_loss=0.354, over 13176.00 frames. ], tot_loss[loss=0.6083, simple_loss=0.4756, pruned_loss=0.3705, over 2580997.19 frames. ], batch size: 59, lr: 1.76e-02, grad_scale: 2.0 2024-06-19 19:15:57,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=44979.0, ans=0.125 2024-06-19 19:16:00,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=44979.0, ans=0.125 2024-06-19 19:16:02,669 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+03 2.606e+03 2.886e+03 3.311e+03 1.153e+04, threshold=5.771e+03, percent-clipped=1.0 2024-06-19 19:16:03,871 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.34 vs. limit=15.0 2024-06-19 19:16:04,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=44997.333333333336, ans=0.025 2024-06-19 19:16:23,278 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.15 vs. limit=12.0 2024-06-19 19:16:23,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=45052.333333333336, ans=0.0010755797101449274 2024-06-19 19:16:33,505 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.59 vs. limit=22.5 2024-06-19 19:16:34,521 INFO [train.py:1028] (1/2) Epoch 3, batch 4350, loss[loss=0.5774, simple_loss=0.4624, pruned_loss=0.3462, over 13172.00 frames. ], tot_loss[loss=0.6023, simple_loss=0.4726, pruned_loss=0.366, over 2585782.11 frames. ], batch size: 59, lr: 1.76e-02, grad_scale: 0.5 2024-06-19 19:16:34,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=45070.666666666664, ans=0.125 2024-06-19 19:16:34,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=45070.666666666664, ans=0.125 2024-06-19 19:16:37,650 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.14 vs. limit=15.0 2024-06-19 19:16:39,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=45070.666666666664, ans=0.025 2024-06-19 19:16:45,874 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.84 vs. limit=6.0 2024-06-19 19:16:48,614 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=18.45 vs. limit=15.0 2024-06-19 19:16:55,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=45125.666666666664, ans=10.0 2024-06-19 19:16:55,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=45125.666666666664, ans=0.125 2024-06-19 19:16:57,373 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.91 vs. limit=15.0 2024-06-19 19:16:59,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=45125.666666666664, ans=0.0 2024-06-19 19:17:10,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=45144.0, ans=0.0 2024-06-19 19:17:11,548 INFO [train.py:1028] (1/2) Epoch 3, batch 4400, loss[loss=0.5764, simple_loss=0.46, pruned_loss=0.3464, over 13230.00 frames. ], tot_loss[loss=0.5974, simple_loss=0.4699, pruned_loss=0.3625, over 2585688.29 frames. ], batch size: 83, lr: 1.76e-02, grad_scale: 1.0 2024-06-19 19:17:16,645 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.74 vs. limit=22.5 2024-06-19 19:17:17,969 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.85 vs. limit=15.0 2024-06-19 19:17:19,669 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.676e+03 3.555e+03 3.998e+03 4.760e+03 1.239e+04, threshold=7.996e+03, percent-clipped=11.0 2024-06-19 19:17:21,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=45180.666666666664, ans=0.0 2024-06-19 19:17:23,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=45180.666666666664, ans=0.125 2024-06-19 19:17:29,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=45199.0, ans=0.025 2024-06-19 19:17:35,550 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.00 vs. limit=15.0 2024-06-19 19:17:40,866 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.80 vs. limit=6.0 2024-06-19 19:17:44,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=45254.0, ans=0.1 2024-06-19 19:17:44,584 INFO [train.py:1028] (1/2) Epoch 3, batch 4450, loss[loss=0.5402, simple_loss=0.4378, pruned_loss=0.3213, over 12886.00 frames. ], tot_loss[loss=0.5937, simple_loss=0.4686, pruned_loss=0.3594, over 2581353.79 frames. ], batch size: 33, lr: 1.76e-02, grad_scale: 1.0 2024-06-19 19:17:46,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=45254.0, ans=0.025 2024-06-19 19:17:47,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=45254.0, ans=0.125 2024-06-19 19:18:12,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=45327.333333333336, ans=0.001015797101449275 2024-06-19 19:18:14,003 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.37 vs. limit=15.0 2024-06-19 19:18:16,837 INFO [train.py:1028] (1/2) Epoch 3, batch 4500, loss[loss=0.5306, simple_loss=0.4189, pruned_loss=0.3211, over 13222.00 frames. ], tot_loss[loss=0.5905, simple_loss=0.4664, pruned_loss=0.3573, over 2585861.71 frames. ], batch size: 89, lr: 1.76e-02, grad_scale: 1.0 2024-06-19 19:18:27,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=45364.0, ans=0.125 2024-06-19 19:18:28,587 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.834e+03 3.021e+03 3.472e+03 4.246e+03 1.154e+04, threshold=6.944e+03, percent-clipped=1.0 2024-06-19 19:18:35,669 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.40 vs. limit=22.5 2024-06-19 19:18:42,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=45400.666666666664, ans=0.5 2024-06-19 19:18:49,769 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2024-06-19 19:18:50,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=45419.0, ans=0.025 2024-06-19 19:18:50,952 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.16 vs. limit=10.0 2024-06-19 19:18:53,178 INFO [train.py:1028] (1/2) Epoch 3, batch 4550, loss[loss=0.5393, simple_loss=0.4397, pruned_loss=0.3194, over 13262.00 frames. ], tot_loss[loss=0.5853, simple_loss=0.4638, pruned_loss=0.3534, over 2589682.65 frames. ], batch size: 52, lr: 1.76e-02, grad_scale: 1.0 2024-06-19 19:18:55,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=45437.333333333336, ans=0.125 2024-06-19 19:18:58,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=45437.333333333336, ans=0.0 2024-06-19 19:19:11,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=45474.0, ans=0.125 2024-06-19 19:19:15,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=45474.0, ans=0.125 2024-06-19 19:19:16,775 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.71 vs. limit=15.0 2024-06-19 19:19:20,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=45492.333333333336, ans=0.2 2024-06-19 19:19:24,027 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.35 vs. limit=22.5 2024-06-19 19:19:25,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=45510.666666666664, ans=0.0009759420289855089 2024-06-19 19:19:30,382 INFO [train.py:1028] (1/2) Epoch 3, batch 4600, loss[loss=0.5908, simple_loss=0.4619, pruned_loss=0.3599, over 12474.00 frames. ], tot_loss[loss=0.5823, simple_loss=0.4625, pruned_loss=0.351, over 2585918.63 frames. ], batch size: 202, lr: 1.75e-02, grad_scale: 2.0 2024-06-19 19:19:33,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=45529.0, ans=0.04949747468305833 2024-06-19 19:19:37,083 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.41 vs. limit=22.5 2024-06-19 19:19:39,405 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.360e+03 2.329e+03 2.704e+03 3.038e+03 8.611e+03, threshold=5.407e+03, percent-clipped=1.0 2024-06-19 19:20:00,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=45602.333333333336, ans=0.125 2024-06-19 19:20:04,175 INFO [train.py:1028] (1/2) Epoch 3, batch 4650, loss[loss=0.5412, simple_loss=0.4332, pruned_loss=0.3246, over 13104.00 frames. ], tot_loss[loss=0.5762, simple_loss=0.4593, pruned_loss=0.3465, over 2588048.37 frames. ], batch size: 132, lr: 1.75e-02, grad_scale: 1.0 2024-06-19 19:20:16,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=45657.333333333336, ans=0.025 2024-06-19 19:20:18,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=45657.333333333336, ans=0.0009440579710144927 2024-06-19 19:20:23,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=45675.666666666664, ans=0.07 2024-06-19 19:20:23,862 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.66 vs. limit=22.5 2024-06-19 19:20:28,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=45675.666666666664, ans=0.0 2024-06-19 19:20:34,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=45694.0, ans=0.0 2024-06-19 19:20:40,865 INFO [train.py:1028] (1/2) Epoch 3, batch 4700, loss[loss=0.5799, simple_loss=0.4711, pruned_loss=0.3444, over 12738.00 frames. ], tot_loss[loss=0.5715, simple_loss=0.4573, pruned_loss=0.3429, over 2583852.56 frames. ], batch size: 26, lr: 1.75e-02, grad_scale: 1.0 2024-06-19 19:20:43,505 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.75 vs. limit=22.5 2024-06-19 19:20:43,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=45712.333333333336, ans=0.2 2024-06-19 19:20:48,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=45730.666666666664, ans=0.125 2024-06-19 19:20:51,025 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.589e+03 2.530e+03 2.879e+03 3.262e+03 9.392e+03, threshold=5.758e+03, percent-clipped=4.0 2024-06-19 19:20:52,788 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.72 vs. limit=10.0 2024-06-19 19:20:55,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=45749.0, ans=0.07 2024-06-19 19:20:58,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=45749.0, ans=0.1 2024-06-19 19:21:08,987 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.40 vs. limit=15.0 2024-06-19 19:21:13,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=45785.666666666664, ans=0.95 2024-06-19 19:21:17,925 INFO [train.py:1028] (1/2) Epoch 3, batch 4750, loss[loss=0.6227, simple_loss=0.4911, pruned_loss=0.3771, over 12488.00 frames. ], tot_loss[loss=0.5661, simple_loss=0.4542, pruned_loss=0.339, over 2580117.71 frames. ], batch size: 202, lr: 1.75e-02, grad_scale: 1.0 2024-06-19 19:21:18,326 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.95 vs. limit=10.0 2024-06-19 19:21:34,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=45840.666666666664, ans=0.1 2024-06-19 19:21:44,381 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.18 vs. limit=22.5 2024-06-19 19:21:51,907 INFO [train.py:1028] (1/2) Epoch 3, batch 4800, loss[loss=0.5872, simple_loss=0.4715, pruned_loss=0.3515, over 13254.00 frames. ], tot_loss[loss=0.5639, simple_loss=0.4535, pruned_loss=0.3372, over 2576794.38 frames. ], batch size: 63, lr: 1.75e-02, grad_scale: 2.0 2024-06-19 19:21:53,012 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.12 vs. limit=15.0 2024-06-19 19:21:57,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=45914.0, ans=0.125 2024-06-19 19:22:00,939 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.67 vs. limit=15.0 2024-06-19 19:22:01,853 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.730e+03 2.639e+03 2.961e+03 3.417e+03 8.350e+03, threshold=5.921e+03, percent-clipped=3.0 2024-06-19 19:22:06,877 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.11 vs. limit=6.0 2024-06-19 19:22:08,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=45932.333333333336, ans=0.1 2024-06-19 19:22:09,355 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.94 vs. limit=15.0 2024-06-19 19:22:13,644 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=15.0 2024-06-19 19:22:15,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=45950.666666666664, ans=0.2 2024-06-19 19:22:25,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=45969.0, ans=0.125 2024-06-19 19:22:28,961 INFO [train.py:1028] (1/2) Epoch 3, batch 4850, loss[loss=0.5328, simple_loss=0.4393, pruned_loss=0.3132, over 13264.00 frames. ], tot_loss[loss=0.56, simple_loss=0.4515, pruned_loss=0.3343, over 2574182.44 frames. ], batch size: 89, lr: 1.75e-02, grad_scale: 2.0 2024-06-19 19:22:31,989 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.51 vs. limit=15.0 2024-06-19 19:22:36,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=46005.666666666664, ans=0.125 2024-06-19 19:22:36,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=46005.666666666664, ans=0.125 2024-06-19 19:22:37,702 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.91 vs. limit=15.0 2024-06-19 19:22:51,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=46042.333333333336, ans=0.125 2024-06-19 19:23:06,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=46060.666666666664, ans=0.125 2024-06-19 19:23:07,402 INFO [train.py:1028] (1/2) Epoch 3, batch 4900, loss[loss=0.5274, simple_loss=0.4317, pruned_loss=0.3115, over 13211.00 frames. ], tot_loss[loss=0.5559, simple_loss=0.4494, pruned_loss=0.3312, over 2574216.24 frames. ], batch size: 59, lr: 1.74e-02, grad_scale: 2.0 2024-06-19 19:23:07,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46079.0, ans=0.1 2024-06-19 19:23:10,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=46079.0, ans=0.1 2024-06-19 19:23:11,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=46079.0, ans=0.1 2024-06-19 19:23:12,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=46079.0, ans=0.04949747468305833 2024-06-19 19:23:18,997 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.211e+03 3.047e+03 3.481e+03 3.892e+03 6.998e+03, threshold=6.963e+03, percent-clipped=3.0 2024-06-19 19:23:22,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=46115.666666666664, ans=0.125 2024-06-19 19:23:27,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=46134.0, ans=0.125 2024-06-19 19:23:32,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=46134.0, ans=0.125 2024-06-19 19:23:36,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=46152.333333333336, ans=0.0 2024-06-19 19:23:36,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=46152.333333333336, ans=0.125 2024-06-19 19:23:40,464 INFO [train.py:1028] (1/2) Epoch 3, batch 4950, loss[loss=0.5774, simple_loss=0.4547, pruned_loss=0.35, over 10910.00 frames. ], tot_loss[loss=0.5565, simple_loss=0.4497, pruned_loss=0.3316, over 2568929.11 frames. ], batch size: 303, lr: 1.74e-02, grad_scale: 0.5 2024-06-19 19:23:48,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46189.0, ans=0.1 2024-06-19 19:23:54,768 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.71 vs. limit=15.0 2024-06-19 19:23:57,227 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.61 vs. limit=22.5 2024-06-19 19:23:57,239 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.86 vs. limit=15.0 2024-06-19 19:23:58,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=46207.333333333336, ans=0.0008244927536231877 2024-06-19 19:24:04,305 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.48 vs. limit=10.0 2024-06-19 19:24:07,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=46244.0, ans=0.035 2024-06-19 19:24:13,532 INFO [train.py:1028] (1/2) Epoch 3, batch 5000, loss[loss=0.4988, simple_loss=0.4035, pruned_loss=0.2971, over 13191.00 frames. ], tot_loss[loss=0.551, simple_loss=0.4469, pruned_loss=0.3275, over 2572820.17 frames. ], batch size: 95, lr: 1.74e-02, grad_scale: 1.0 2024-06-19 19:24:13,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=46262.333333333336, ans=0.125 2024-06-19 19:24:19,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=46280.666666666664, ans=0.2 2024-06-19 19:24:20,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=46280.666666666664, ans=0.125 2024-06-19 19:24:21,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46280.666666666664, ans=0.1 2024-06-19 19:24:22,916 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.26 vs. limit=22.5 2024-06-19 19:24:25,898 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.721e+03 2.469e+03 2.950e+03 3.535e+03 1.221e+04, threshold=5.900e+03, percent-clipped=5.0 2024-06-19 19:24:28,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=46299.0, ans=0.2 2024-06-19 19:24:30,393 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.98 vs. limit=22.5 2024-06-19 19:24:31,107 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.53 vs. limit=22.5 2024-06-19 19:24:38,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=46317.333333333336, ans=0.125 2024-06-19 19:24:51,194 INFO [train.py:1028] (1/2) Epoch 3, batch 5050, loss[loss=0.5124, simple_loss=0.4316, pruned_loss=0.2966, over 12892.00 frames. ], tot_loss[loss=0.5481, simple_loss=0.446, pruned_loss=0.3251, over 2571888.40 frames. ], batch size: 36, lr: 1.74e-02, grad_scale: 1.0 2024-06-19 19:24:51,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=46354.0, ans=0.05 2024-06-19 19:24:55,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=46354.0, ans=0.125 2024-06-19 19:24:58,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=46372.333333333336, ans=0.125 2024-06-19 19:25:02,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46372.333333333336, ans=0.1 2024-06-19 19:25:11,909 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.99 vs. limit=6.0 2024-06-19 19:25:23,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=46427.333333333336, ans=0.125 2024-06-19 19:25:28,062 INFO [train.py:1028] (1/2) Epoch 3, batch 5100, loss[loss=0.5641, simple_loss=0.474, pruned_loss=0.3271, over 12869.00 frames. ], tot_loss[loss=0.5451, simple_loss=0.4446, pruned_loss=0.3228, over 2568134.41 frames. ], batch size: 39, lr: 1.74e-02, grad_scale: 2.0 2024-06-19 19:25:29,402 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.69 vs. limit=6.0 2024-06-19 19:25:31,473 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.94 vs. limit=22.5 2024-06-19 19:25:31,479 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.66 vs. limit=10.0 2024-06-19 19:25:31,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=46445.666666666664, ans=0.2 2024-06-19 19:25:36,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=46464.0, ans=0.0007686956521739136 2024-06-19 19:25:40,553 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.451e+03 2.537e+03 2.908e+03 3.402e+03 6.063e+03, threshold=5.816e+03, percent-clipped=1.0 2024-06-19 19:25:49,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.whiten.whitening_limit, batch_count=46500.666666666664, ans=12.0 2024-06-19 19:25:49,983 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.54 vs. limit=15.0 2024-06-19 19:25:51,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=46500.666666666664, ans=0.04949747468305833 2024-06-19 19:25:53,320 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.37 vs. limit=10.0 2024-06-19 19:25:58,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=46519.0, ans=0.125 2024-06-19 19:25:58,981 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.86 vs. limit=15.0 2024-06-19 19:26:01,115 INFO [train.py:1028] (1/2) Epoch 3, batch 5150, loss[loss=0.4542, simple_loss=0.3769, pruned_loss=0.2658, over 13080.00 frames. ], tot_loss[loss=0.5441, simple_loss=0.444, pruned_loss=0.3221, over 2571263.20 frames. ], batch size: 132, lr: 1.74e-02, grad_scale: 1.0 2024-06-19 19:26:02,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=46537.333333333336, ans=0.05 2024-06-19 19:26:04,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=46537.333333333336, ans=10.0 2024-06-19 19:26:07,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=46555.666666666664, ans=0.025 2024-06-19 19:26:08,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=46555.666666666664, ans=0.125 2024-06-19 19:26:08,930 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.32 vs. limit=10.0 2024-06-19 19:26:12,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=46555.666666666664, ans=0.09899494936611666 2024-06-19 19:26:19,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=46574.0, ans=0.1 2024-06-19 19:26:28,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=46592.333333333336, ans=0.0007407971014492757 2024-06-19 19:26:30,631 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.86 vs. limit=22.5 2024-06-19 19:26:38,092 INFO [train.py:1028] (1/2) Epoch 3, batch 5200, loss[loss=0.5279, simple_loss=0.436, pruned_loss=0.3099, over 13159.00 frames. ], tot_loss[loss=0.5422, simple_loss=0.4435, pruned_loss=0.3204, over 2573279.04 frames. ], batch size: 95, lr: 1.73e-02, grad_scale: 2.0 2024-06-19 19:26:42,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=46629.0, ans=0.1 2024-06-19 19:26:42,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=46629.0, ans=0.0007328260869565208 2024-06-19 19:26:43,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=46629.0, ans=0.2 2024-06-19 19:26:50,528 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.878e+03 2.484e+03 2.722e+03 3.148e+03 8.839e+03, threshold=5.445e+03, percent-clipped=3.0 2024-06-19 19:26:52,775 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=2.533e+00 2024-06-19 19:27:10,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=46702.333333333336, ans=0.125 2024-06-19 19:27:11,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=46702.333333333336, ans=0.0 2024-06-19 19:27:12,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=46702.333333333336, ans=0.125 2024-06-19 19:27:14,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=46720.666666666664, ans=0.0 2024-06-19 19:27:14,749 INFO [train.py:1028] (1/2) Epoch 3, batch 5250, loss[loss=0.4515, simple_loss=0.3889, pruned_loss=0.2571, over 13341.00 frames. ], tot_loss[loss=0.5385, simple_loss=0.4417, pruned_loss=0.3176, over 2569581.51 frames. ], batch size: 52, lr: 1.73e-02, grad_scale: 1.0 2024-06-19 19:27:22,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=46739.0, ans=0.125 2024-06-19 19:27:23,824 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.38 vs. limit=22.5 2024-06-19 19:27:24,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=46739.0, ans=0.125 2024-06-19 19:27:27,064 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.85 vs. limit=12.0 2024-06-19 19:27:28,667 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=17.25 vs. limit=15.0 2024-06-19 19:27:45,542 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.78 vs. limit=15.0 2024-06-19 19:27:48,221 INFO [train.py:1028] (1/2) Epoch 3, batch 5300, loss[loss=0.5463, simple_loss=0.4457, pruned_loss=0.3234, over 13067.00 frames. ], tot_loss[loss=0.5352, simple_loss=0.4399, pruned_loss=0.3152, over 2566636.50 frames. ], batch size: 144, lr: 1.73e-02, grad_scale: 2.0 2024-06-19 19:27:55,294 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.93 vs. limit=15.0 2024-06-19 19:27:58,027 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.38 vs. limit=15.0 2024-06-19 19:28:02,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=46849.0, ans=0.0 2024-06-19 19:28:03,226 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.358e+03 2.340e+03 2.676e+03 3.141e+03 8.225e+03, threshold=5.352e+03, percent-clipped=3.0 2024-06-19 19:28:04,930 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.38 vs. limit=15.0 2024-06-19 19:28:06,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=46849.0, ans=0.0 2024-06-19 19:28:07,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=46849.0, ans=0.2 2024-06-19 19:28:08,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=46867.333333333336, ans=0.2 2024-06-19 19:28:09,046 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.23 vs. limit=15.0 2024-06-19 19:28:15,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46885.666666666664, ans=0.1 2024-06-19 19:28:19,897 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=23.62 vs. limit=22.5 2024-06-19 19:28:27,102 INFO [train.py:1028] (1/2) Epoch 3, batch 5350, loss[loss=0.4694, simple_loss=0.398, pruned_loss=0.2704, over 11736.00 frames. ], tot_loss[loss=0.5293, simple_loss=0.4366, pruned_loss=0.311, over 2573728.69 frames. ], batch size: 17, lr: 1.73e-02, grad_scale: 0.5 2024-06-19 19:28:37,457 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.60 vs. limit=10.0 2024-06-19 19:28:39,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=46922.333333333336, ans=0.025 2024-06-19 19:28:49,127 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.472e+01 2024-06-19 19:28:55,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=46977.333333333336, ans=0.125 2024-06-19 19:29:03,522 INFO [train.py:1028] (1/2) Epoch 3, batch 5400, loss[loss=0.5822, simple_loss=0.4646, pruned_loss=0.3499, over 12195.00 frames. ], tot_loss[loss=0.5256, simple_loss=0.4344, pruned_loss=0.3084, over 2565523.30 frames. ], batch size: 240, lr: 1.73e-02, grad_scale: 1.0 2024-06-19 19:29:08,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=46995.666666666664, ans=0.07 2024-06-19 19:29:12,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=47014.0, ans=0.125 2024-06-19 19:29:12,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=47014.0, ans=0.125 2024-06-19 19:29:18,338 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.352e+03 2.122e+03 2.365e+03 2.676e+03 5.183e+03, threshold=4.730e+03, percent-clipped=0.0 2024-06-19 19:29:22,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=47032.333333333336, ans=0.125 2024-06-19 19:29:22,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=47032.333333333336, ans=0.2 2024-06-19 19:29:25,162 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.65 vs. limit=15.0 2024-06-19 19:29:26,598 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=17.77 vs. limit=15.0 2024-06-19 19:29:28,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=47050.666666666664, ans=0.07 2024-06-19 19:29:33,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=47069.0, ans=0.125 2024-06-19 19:29:37,057 INFO [train.py:1028] (1/2) Epoch 3, batch 5450, loss[loss=0.5137, simple_loss=0.4381, pruned_loss=0.2947, over 12828.00 frames. ], tot_loss[loss=0.5202, simple_loss=0.4321, pruned_loss=0.3041, over 2569875.04 frames. ], batch size: 26, lr: 1.73e-02, grad_scale: 1.0 2024-06-19 19:29:38,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=47087.333333333336, ans=0.125 2024-06-19 19:29:39,627 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=15.0 2024-06-19 19:29:42,260 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.31 vs. limit=15.0 2024-06-19 19:29:43,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=47105.666666666664, ans=0.0 2024-06-19 19:29:55,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47124.0, ans=0.1 2024-06-19 19:29:57,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=47142.333333333336, ans=0.125 2024-06-19 19:30:02,005 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.84 vs. limit=22.5 2024-06-19 19:30:04,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=47160.666666666664, ans=0.125 2024-06-19 19:30:10,582 INFO [train.py:1028] (1/2) Epoch 3, batch 5500, loss[loss=0.5647, simple_loss=0.4503, pruned_loss=0.3396, over 12236.00 frames. ], tot_loss[loss=0.5158, simple_loss=0.4296, pruned_loss=0.301, over 2563660.02 frames. ], batch size: 241, lr: 1.73e-02, grad_scale: 2.0 2024-06-19 19:30:21,310 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.85 vs. limit=22.5 2024-06-19 19:30:31,762 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.496e+03 2.155e+03 2.560e+03 3.017e+03 5.450e+03, threshold=5.120e+03, percent-clipped=2.0 2024-06-19 19:30:39,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=47234.0, ans=0.0 2024-06-19 19:30:42,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47234.0, ans=0.1 2024-06-19 19:30:50,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=47270.666666666664, ans=0.0005933333333333346 2024-06-19 19:30:51,538 INFO [train.py:1028] (1/2) Epoch 3, batch 5550, loss[loss=0.5105, simple_loss=0.4384, pruned_loss=0.2913, over 13280.00 frames. ], tot_loss[loss=0.5139, simple_loss=0.4289, pruned_loss=0.2995, over 2567700.89 frames. ], batch size: 43, lr: 1.72e-02, grad_scale: 1.0 2024-06-19 19:31:01,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=47289.0, ans=0.125 2024-06-19 19:31:06,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=47289.0, ans=0.125 2024-06-19 19:31:14,172 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.43 vs. limit=10.0 2024-06-19 19:31:14,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=47325.666666666664, ans=0.125 2024-06-19 19:31:22,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=47344.0, ans=0.125 2024-06-19 19:31:27,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=47362.333333333336, ans=0.125 2024-06-19 19:31:28,458 INFO [train.py:1028] (1/2) Epoch 3, batch 5600, loss[loss=0.4975, simple_loss=0.4219, pruned_loss=0.2865, over 13233.00 frames. ], tot_loss[loss=0.5092, simple_loss=0.4261, pruned_loss=0.2962, over 2569889.40 frames. ], batch size: 89, lr: 1.72e-02, grad_scale: 2.0 2024-06-19 19:31:33,117 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.02 vs. limit=10.0 2024-06-19 19:31:35,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=47380.666666666664, ans=0.125 2024-06-19 19:31:35,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=47380.666666666664, ans=0.125 2024-06-19 19:31:37,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=47380.666666666664, ans=0.125 2024-06-19 19:31:42,308 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.51 vs. limit=15.0 2024-06-19 19:31:42,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=47399.0, ans=0.125 2024-06-19 19:31:45,228 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.504e+03 2.337e+03 2.666e+03 3.079e+03 1.375e+04, threshold=5.333e+03, percent-clipped=6.0 2024-06-19 19:31:46,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=47399.0, ans=0.125 2024-06-19 19:31:47,257 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.76 vs. limit=22.5 2024-06-19 19:31:50,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=47417.333333333336, ans=0.1 2024-06-19 19:31:51,838 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2024-06-19 19:31:58,555 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.17 vs. limit=22.5 2024-06-19 19:32:00,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=47435.666666666664, ans=0.125 2024-06-19 19:32:02,677 INFO [train.py:1028] (1/2) Epoch 3, batch 5650, loss[loss=0.5638, simple_loss=0.4583, pruned_loss=0.3346, over 12490.00 frames. ], tot_loss[loss=0.5079, simple_loss=0.4257, pruned_loss=0.295, over 2574821.92 frames. ], batch size: 202, lr: 1.72e-02, grad_scale: 1.0 2024-06-19 19:32:06,444 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.02 vs. limit=12.0 2024-06-19 19:32:10,480 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.97 vs. limit=15.0 2024-06-19 19:32:29,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=47527.333333333336, ans=0.2 2024-06-19 19:32:39,763 INFO [train.py:1028] (1/2) Epoch 3, batch 5700, loss[loss=0.5402, simple_loss=0.4595, pruned_loss=0.3104, over 13311.00 frames. ], tot_loss[loss=0.5052, simple_loss=0.4244, pruned_loss=0.293, over 2578643.32 frames. ], batch size: 63, lr: 1.72e-02, grad_scale: 2.0 2024-06-19 19:32:42,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=47545.666666666664, ans=0.0005335507246376821 2024-06-19 19:32:55,958 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.564e+03 2.274e+03 2.536e+03 2.749e+03 5.319e+03, threshold=5.072e+03, percent-clipped=0.0 2024-06-19 19:32:56,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=47582.333333333336, ans=0.2 2024-06-19 19:33:00,344 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.70 vs. limit=15.0 2024-06-19 19:33:12,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47619.0, ans=0.1 2024-06-19 19:33:14,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=47619.0, ans=0.125 2024-06-19 19:33:16,045 INFO [train.py:1028] (1/2) Epoch 3, batch 5750, loss[loss=0.532, simple_loss=0.4385, pruned_loss=0.3128, over 12719.00 frames. ], tot_loss[loss=0.5048, simple_loss=0.4246, pruned_loss=0.2925, over 2579138.54 frames. ], batch size: 176, lr: 1.72e-02, grad_scale: 1.0 2024-06-19 19:33:20,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=47637.333333333336, ans=0.0 2024-06-19 19:33:21,866 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.73 vs. limit=15.0 2024-06-19 19:33:23,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=47655.666666666664, ans=0.025 2024-06-19 19:33:28,579 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.35 vs. limit=10.0 2024-06-19 19:33:29,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=47674.0, ans=0.125 2024-06-19 19:33:30,527 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.09 vs. limit=22.5 2024-06-19 19:33:41,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=47692.333333333336, ans=0.0 2024-06-19 19:33:43,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=47710.666666666664, ans=0.1 2024-06-19 19:33:47,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=47710.666666666664, ans=0.000497681159420291 2024-06-19 19:33:49,471 INFO [train.py:1028] (1/2) Epoch 3, batch 5800, loss[loss=0.564, simple_loss=0.459, pruned_loss=0.3345, over 12744.00 frames. ], tot_loss[loss=0.5076, simple_loss=0.4264, pruned_loss=0.2944, over 2578430.18 frames. ], batch size: 176, lr: 1.72e-02, grad_scale: 1.0 2024-06-19 19:33:51,199 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.81 vs. limit=12.0 2024-06-19 19:33:58,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=47747.333333333336, ans=10.0 2024-06-19 19:34:07,228 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.893e+03 2.831e+03 3.397e+03 4.042e+03 1.204e+04, threshold=6.795e+03, percent-clipped=5.0 2024-06-19 19:34:11,699 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.69 vs. limit=22.5 2024-06-19 19:34:16,324 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.62 vs. limit=10.0 2024-06-19 19:34:22,563 INFO [train.py:1028] (1/2) Epoch 3, batch 5850, loss[loss=0.5887, simple_loss=0.4811, pruned_loss=0.3481, over 12522.00 frames. ], tot_loss[loss=0.51, simple_loss=0.4287, pruned_loss=0.2957, over 2576021.83 frames. ], batch size: 202, lr: 1.71e-02, grad_scale: 0.5 2024-06-19 19:34:37,192 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.55 vs. limit=22.5 2024-06-19 19:34:40,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=47857.333333333336, ans=0.0 2024-06-19 19:34:57,621 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.95 vs. limit=15.0 2024-06-19 19:34:59,178 INFO [train.py:1028] (1/2) Epoch 3, batch 5900, loss[loss=0.5113, simple_loss=0.4311, pruned_loss=0.2957, over 13125.00 frames. ], tot_loss[loss=0.5143, simple_loss=0.4323, pruned_loss=0.2981, over 2577225.31 frames. ], batch size: 121, lr: 1.71e-02, grad_scale: 1.0 2024-06-19 19:35:00,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=47912.333333333336, ans=0.2 2024-06-19 19:35:10,816 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.95 vs. limit=15.0 2024-06-19 19:35:15,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=47930.666666666664, ans=0.00044985507246376837 2024-06-19 19:35:19,313 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.63 vs. limit=15.0 2024-06-19 19:35:19,904 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.48 vs. limit=10.0 2024-06-19 19:35:22,065 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.802e+03 2.737e+03 2.992e+03 3.452e+03 5.961e+03, threshold=5.985e+03, percent-clipped=0.0 2024-06-19 19:35:25,593 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.94 vs. limit=15.0 2024-06-19 19:35:36,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=48004.0, ans=0.125 2024-06-19 19:35:37,281 INFO [train.py:1028] (1/2) Epoch 3, batch 5950, loss[loss=0.4857, simple_loss=0.4137, pruned_loss=0.2788, over 13132.00 frames. ], tot_loss[loss=0.5147, simple_loss=0.4336, pruned_loss=0.2979, over 2580921.63 frames. ], batch size: 121, lr: 1.71e-02, grad_scale: 1.0 2024-06-19 19:35:39,770 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.89 vs. limit=10.0 2024-06-19 19:35:42,534 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.62 vs. limit=15.0 2024-06-19 19:35:47,061 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 19:35:52,083 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.26 vs. limit=22.5 2024-06-19 19:35:53,874 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.29 vs. limit=15.0 2024-06-19 19:35:54,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=48040.666666666664, ans=0.025 2024-06-19 19:35:57,037 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.09 vs. limit=6.0 2024-06-19 19:36:04,661 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.42 vs. limit=10.0 2024-06-19 19:36:07,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=12.44 vs. limit=10.0 2024-06-19 19:36:09,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=48077.333333333336, ans=0.0 2024-06-19 19:36:10,742 INFO [train.py:1028] (1/2) Epoch 3, batch 6000, loss[loss=0.6296, simple_loss=0.501, pruned_loss=0.3791, over 12112.00 frames. ], tot_loss[loss=0.5168, simple_loss=0.4355, pruned_loss=0.299, over 2573907.71 frames. ], batch size: 240, lr: 1.71e-02, grad_scale: 2.0 2024-06-19 19:36:10,743 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 19:36:15,481 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.5007, 1.2542, 1.4374, 1.3740, 1.3872, 1.4263, 1.3832, 1.3494], device='cuda:1') 2024-06-19 19:36:15,605 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.8199, 1.3354, 2.1278, 1.6466], device='cuda:1') 2024-06-19 19:36:18,521 INFO [train.py:1060] (1/2) Epoch 3, validation: loss=0.4102, simple_loss=0.3952, pruned_loss=0.2126, over 351949.00 frames. 2024-06-19 19:36:18,522 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17340MB 2024-06-19 19:36:21,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=48095.666666666664, ans=0.2 2024-06-19 19:36:23,046 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.53 vs. limit=12.0 2024-06-19 19:36:23,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=48095.666666666664, ans=0.125 2024-06-19 19:36:40,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=48132.333333333336, ans=0.025 2024-06-19 19:36:40,454 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.873e+03 2.626e+03 2.925e+03 3.337e+03 6.181e+03, threshold=5.850e+03, percent-clipped=1.0 2024-06-19 19:36:45,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=48150.666666666664, ans=0.125 2024-06-19 19:36:51,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=48169.0, ans=0.125 2024-06-19 19:36:56,000 INFO [train.py:1028] (1/2) Epoch 3, batch 6050, loss[loss=0.477, simple_loss=0.4113, pruned_loss=0.2714, over 12889.00 frames. ], tot_loss[loss=0.5168, simple_loss=0.4364, pruned_loss=0.2986, over 2576653.47 frames. ], batch size: 39, lr: 1.71e-02, grad_scale: 2.0 2024-06-19 19:37:04,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=48205.666666666664, ans=0.125 2024-06-19 19:37:08,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=48205.666666666664, ans=0.125 2024-06-19 19:37:22,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=48242.333333333336, ans=0.95 2024-06-19 19:37:28,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=48260.666666666664, ans=0.1 2024-06-19 19:37:34,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=48279.0, ans=0.0003741304347826078 2024-06-19 19:37:34,570 INFO [train.py:1028] (1/2) Epoch 3, batch 6100, loss[loss=0.5169, simple_loss=0.4369, pruned_loss=0.2984, over 13113.00 frames. ], tot_loss[loss=0.5177, simple_loss=0.438, pruned_loss=0.2986, over 2578863.37 frames. ], batch size: 121, lr: 1.71e-02, grad_scale: 2.0 2024-06-19 19:37:35,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=48279.0, ans=0.125 2024-06-19 19:37:36,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=48279.0, ans=0.07 2024-06-19 19:37:51,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=48315.666666666664, ans=0.125 2024-06-19 19:37:54,905 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.455e+03 2.147e+03 2.599e+03 2.971e+03 1.154e+04, threshold=5.197e+03, percent-clipped=1.0 2024-06-19 19:37:56,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=48334.0, ans=0.125 2024-06-19 19:38:00,237 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.09 vs. limit=15.0 2024-06-19 19:38:04,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=48352.333333333336, ans=0.1 2024-06-19 19:38:05,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=48352.333333333336, ans=0.125 2024-06-19 19:38:08,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=48352.333333333336, ans=0.125 2024-06-19 19:38:09,232 INFO [train.py:1028] (1/2) Epoch 3, batch 6150, loss[loss=0.5254, simple_loss=0.4327, pruned_loss=0.3091, over 10951.00 frames. ], tot_loss[loss=0.5213, simple_loss=0.441, pruned_loss=0.3008, over 2578135.15 frames. ], batch size: 303, lr: 1.71e-02, grad_scale: 1.0 2024-06-19 19:38:19,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48389.0, ans=0.1 2024-06-19 19:38:25,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=48407.333333333336, ans=10.0 2024-06-19 19:38:28,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=48407.333333333336, ans=0.5 2024-06-19 19:38:35,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=48425.666666666664, ans=0.2 2024-06-19 19:38:43,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=48444.0, ans=0.125 2024-06-19 19:38:43,849 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.51 vs. limit=15.0 2024-06-19 19:38:46,614 INFO [train.py:1028] (1/2) Epoch 3, batch 6200, loss[loss=0.6275, simple_loss=0.527, pruned_loss=0.364, over 13225.00 frames. ], tot_loss[loss=0.5244, simple_loss=0.4438, pruned_loss=0.3025, over 2574913.91 frames. ], batch size: 89, lr: 1.70e-02, grad_scale: 2.0 2024-06-19 19:38:50,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=48462.333333333336, ans=0.1 2024-06-19 19:38:52,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=48480.666666666664, ans=0.0 2024-06-19 19:38:58,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=48480.666666666664, ans=0.2 2024-06-19 19:39:06,065 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.250e+03 2.119e+03 2.354e+03 2.707e+03 9.483e+03, threshold=4.709e+03, percent-clipped=2.0 2024-06-19 19:39:06,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=48517.333333333336, ans=0.025 2024-06-19 19:39:12,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=48517.333333333336, ans=0.00032231884057971026 2024-06-19 19:39:14,412 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.81 vs. limit=6.0 2024-06-19 19:39:16,475 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.92 vs. limit=6.0 2024-06-19 19:39:18,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=48535.666666666664, ans=0.1 2024-06-19 19:39:23,481 INFO [train.py:1028] (1/2) Epoch 3, batch 6250, loss[loss=0.522, simple_loss=0.4412, pruned_loss=0.3014, over 13216.00 frames. ], tot_loss[loss=0.525, simple_loss=0.4449, pruned_loss=0.3025, over 2567451.59 frames. ], batch size: 83, lr: 1.70e-02, grad_scale: 0.5 2024-06-19 19:39:23,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=48554.0, ans=0.125 2024-06-19 19:39:33,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=48572.333333333336, ans=0.0 2024-06-19 19:39:38,571 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.52 vs. limit=10.0 2024-06-19 19:39:42,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=48609.0, ans=0.125 2024-06-19 19:39:43,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=48609.0, ans=0.125 2024-06-19 19:39:46,345 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=16.12 vs. limit=15.0 2024-06-19 19:39:46,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48609.0, ans=0.1 2024-06-19 19:39:56,531 INFO [train.py:1028] (1/2) Epoch 3, batch 6300, loss[loss=0.4809, simple_loss=0.4283, pruned_loss=0.2667, over 11485.00 frames. ], tot_loss[loss=0.5252, simple_loss=0.4458, pruned_loss=0.3023, over 2562814.59 frames. ], batch size: 16, lr: 1.70e-02, grad_scale: 1.0 2024-06-19 19:40:08,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=48664.0, ans=0.125 2024-06-19 19:40:15,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=48682.333333333336, ans=0.125 2024-06-19 19:40:18,262 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.650e+03 2.445e+03 3.043e+03 3.791e+03 8.866e+03, threshold=6.086e+03, percent-clipped=9.0 2024-06-19 19:40:23,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=48719.0, ans=0.125 2024-06-19 19:40:30,116 INFO [train.py:1028] (1/2) Epoch 3, batch 6350, loss[loss=0.6348, simple_loss=0.5213, pruned_loss=0.3741, over 12515.00 frames. ], tot_loss[loss=0.5244, simple_loss=0.4465, pruned_loss=0.3012, over 2572937.88 frames. ], batch size: 202, lr: 1.70e-02, grad_scale: 0.5 2024-06-19 19:40:36,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=48755.666666666664, ans=0.125 2024-06-19 19:41:07,205 INFO [train.py:1028] (1/2) Epoch 3, batch 6400, loss[loss=0.4986, simple_loss=0.4323, pruned_loss=0.2824, over 13209.00 frames. ], tot_loss[loss=0.5256, simple_loss=0.4482, pruned_loss=0.3015, over 2574305.44 frames. ], batch size: 67, lr: 1.70e-02, grad_scale: 1.0 2024-06-19 19:41:22,667 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.13 vs. limit=10.0 2024-06-19 19:41:24,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=48847.333333333336, ans=0.2 2024-06-19 19:41:29,076 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.79 vs. limit=22.5 2024-06-19 19:41:33,201 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.413e+03 2.468e+03 2.756e+03 3.196e+03 9.569e+03, threshold=5.512e+03, percent-clipped=2.0 2024-06-19 19:41:33,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=48884.0, ans=0.0 2024-06-19 19:41:34,381 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.51 vs. limit=15.0 2024-06-19 19:41:38,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=48902.333333333336, ans=0.125 2024-06-19 19:41:45,156 INFO [train.py:1028] (1/2) Epoch 3, batch 6450, loss[loss=0.5719, simple_loss=0.4746, pruned_loss=0.3346, over 12542.00 frames. ], tot_loss[loss=0.5286, simple_loss=0.451, pruned_loss=0.3031, over 2580622.13 frames. ], batch size: 202, lr: 1.70e-02, grad_scale: 1.0 2024-06-19 19:41:48,428 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.27 vs. limit=15.0 2024-06-19 19:41:51,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=48939.0, ans=0.015 2024-06-19 19:41:52,324 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.79 vs. limit=12.0 2024-06-19 19:41:53,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=48939.0, ans=0.125 2024-06-19 19:42:05,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=48975.666666666664, ans=0.1 2024-06-19 19:42:08,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=48975.666666666664, ans=0.125 2024-06-19 19:42:13,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=48994.0, ans=0.07 2024-06-19 19:42:16,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=48994.0, ans=0.00021869565217391349 2024-06-19 19:42:18,262 INFO [train.py:1028] (1/2) Epoch 3, batch 6500, loss[loss=0.599, simple_loss=0.4889, pruned_loss=0.3546, over 10742.00 frames. ], tot_loss[loss=0.5299, simple_loss=0.4529, pruned_loss=0.3034, over 2583281.00 frames. ], batch size: 303, lr: 1.69e-02, grad_scale: 2.0 2024-06-19 19:42:19,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=49012.333333333336, ans=0.2 2024-06-19 19:42:21,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=49012.333333333336, ans=0.1 2024-06-19 19:42:23,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49012.333333333336, ans=0.1 2024-06-19 19:42:24,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=49030.666666666664, ans=0.0 2024-06-19 19:42:27,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=49030.666666666664, ans=0.0 2024-06-19 19:42:39,662 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.66 vs. limit=15.0 2024-06-19 19:42:39,852 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.445e+03 2.653e+03 3.062e+03 3.365e+03 9.178e+03, threshold=6.123e+03, percent-clipped=3.0 2024-06-19 19:42:42,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=49067.333333333336, ans=0.2 2024-06-19 19:42:51,153 INFO [train.py:1028] (1/2) Epoch 3, batch 6550, loss[loss=0.5438, simple_loss=0.4703, pruned_loss=0.3087, over 12480.00 frames. ], tot_loss[loss=0.5304, simple_loss=0.4539, pruned_loss=0.3035, over 2587782.47 frames. ], batch size: 22, lr: 1.69e-02, grad_scale: 1.0 2024-06-19 19:43:01,910 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.36 vs. limit=15.0 2024-06-19 19:43:10,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=49140.666666666664, ans=0.0 2024-06-19 19:43:15,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=49159.0, ans=0.0 2024-06-19 19:43:18,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=49159.0, ans=0.125 2024-06-19 19:43:18,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=49159.0, ans=0.125 2024-06-19 19:43:22,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=49177.333333333336, ans=0.125 2024-06-19 19:43:30,220 INFO [train.py:1028] (1/2) Epoch 3, batch 6600, loss[loss=0.5302, simple_loss=0.4523, pruned_loss=0.3041, over 13212.00 frames. ], tot_loss[loss=0.5289, simple_loss=0.4533, pruned_loss=0.3022, over 2590136.18 frames. ], batch size: 72, lr: 1.69e-02, grad_scale: 2.0 2024-06-19 19:43:31,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=49195.666666666664, ans=0.1 2024-06-19 19:43:33,210 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.07 vs. limit=15.0 2024-06-19 19:43:38,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=49214.0, ans=0.1 2024-06-19 19:43:40,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=49214.0, ans=0.125 2024-06-19 19:43:42,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=49232.333333333336, ans=0.125 2024-06-19 19:43:47,591 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=12.43 vs. limit=12.0 2024-06-19 19:43:53,026 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.571e+03 2.337e+03 2.747e+03 3.259e+03 5.199e+03, threshold=5.494e+03, percent-clipped=0.0 2024-06-19 19:43:53,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=49250.666666666664, ans=0.125 2024-06-19 19:43:53,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=49250.666666666664, ans=0.025 2024-06-19 19:43:58,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=49269.0, ans=0.0 2024-06-19 19:44:00,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=49269.0, ans=0.00015891304347826103 2024-06-19 19:44:03,452 INFO [train.py:1028] (1/2) Epoch 3, batch 6650, loss[loss=0.5724, simple_loss=0.485, pruned_loss=0.3298, over 12913.00 frames. ], tot_loss[loss=0.5326, simple_loss=0.4569, pruned_loss=0.3042, over 2585037.11 frames. ], batch size: 158, lr: 1.69e-02, grad_scale: 1.0 2024-06-19 19:44:03,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=49287.333333333336, ans=0.125 2024-06-19 19:44:21,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=49324.0, ans=0.125 2024-06-19 19:44:25,183 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=5.796e+01 2024-06-19 19:44:25,466 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.78 vs. limit=15.0 2024-06-19 19:44:28,752 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-19 19:44:30,216 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.83 vs. limit=15.0 2024-06-19 19:44:31,133 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=3.663e+02 2024-06-19 19:44:31,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=49360.666666666664, ans=0.025 2024-06-19 19:44:34,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49360.666666666664, ans=0.1 2024-06-19 19:44:36,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=49360.666666666664, ans=0.025 2024-06-19 19:44:37,261 INFO [train.py:1028] (1/2) Epoch 3, batch 6700, loss[loss=0.5343, simple_loss=0.4521, pruned_loss=0.3082, over 12819.00 frames. ], tot_loss[loss=0.533, simple_loss=0.4578, pruned_loss=0.3041, over 2583826.53 frames. ], batch size: 176, lr: 1.69e-02, grad_scale: 2.0 2024-06-19 19:44:40,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=49379.0, ans=0.2 2024-06-19 19:44:45,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=49397.333333333336, ans=0.0 2024-06-19 19:44:49,511 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.89 vs. limit=15.0 2024-06-19 19:44:53,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=49415.666666666664, ans=0.0 2024-06-19 19:45:04,092 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+03 2.205e+03 2.587e+03 2.916e+03 9.266e+03, threshold=5.174e+03, percent-clipped=2.0 2024-06-19 19:45:04,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=49434.0, ans=0.04949747468305833 2024-06-19 19:45:14,483 INFO [train.py:1028] (1/2) Epoch 3, batch 6750, loss[loss=0.664, simple_loss=0.5311, pruned_loss=0.3984, over 12269.00 frames. ], tot_loss[loss=0.5321, simple_loss=0.4574, pruned_loss=0.3033, over 2576915.53 frames. ], batch size: 241, lr: 1.69e-02, grad_scale: 1.0 2024-06-19 19:45:20,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49489.0, ans=0.1 2024-06-19 19:45:21,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=49489.0, ans=0.09899494936611666 2024-06-19 19:45:22,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=49489.0, ans=0.0 2024-06-19 19:45:34,412 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.24 vs. limit=6.0 2024-06-19 19:45:38,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=49525.666666666664, ans=0.125 2024-06-19 19:45:39,853 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.67 vs. limit=22.5 2024-06-19 19:45:40,076 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.921e+03 2024-06-19 19:45:50,684 INFO [train.py:1028] (1/2) Epoch 3, batch 6800, loss[loss=0.5562, simple_loss=0.4706, pruned_loss=0.3209, over 13222.00 frames. ], tot_loss[loss=0.5329, simple_loss=0.4588, pruned_loss=0.3035, over 2579367.87 frames. ], batch size: 67, lr: 1.69e-02, grad_scale: 2.0 2024-06-19 19:45:55,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=49562.333333333336, ans=0.125 2024-06-19 19:46:06,012 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=15.0 2024-06-19 19:46:08,182 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.89 vs. limit=15.0 2024-06-19 19:46:08,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=49599.0, ans=0.025 2024-06-19 19:46:12,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49617.333333333336, ans=0.1 2024-06-19 19:46:14,853 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.661e+03 2.718e+03 3.103e+03 3.677e+03 1.008e+04, threshold=6.206e+03, percent-clipped=6.0 2024-06-19 19:46:23,654 INFO [train.py:1028] (1/2) Epoch 3, batch 6850, loss[loss=0.5309, simple_loss=0.4673, pruned_loss=0.2972, over 13236.00 frames. ], tot_loss[loss=0.5322, simple_loss=0.459, pruned_loss=0.3027, over 2583259.28 frames. ], batch size: 63, lr: 1.68e-02, grad_scale: 0.5 2024-06-19 19:46:25,353 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2024-06-19 19:46:30,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=49672.333333333336, ans=15.0 2024-06-19 19:46:31,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=49672.333333333336, ans=0.0 2024-06-19 19:46:47,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=49709.0, ans=0.0 2024-06-19 19:46:55,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=49727.333333333336, ans=0.025 2024-06-19 19:46:57,467 INFO [train.py:1028] (1/2) Epoch 3, batch 6900, loss[loss=0.5423, simple_loss=0.4719, pruned_loss=0.3064, over 13032.00 frames. ], tot_loss[loss=0.5325, simple_loss=0.4597, pruned_loss=0.3026, over 2585161.10 frames. ], batch size: 48, lr: 1.68e-02, grad_scale: 1.0 2024-06-19 19:47:03,794 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.39 vs. limit=12.0 2024-06-19 19:47:04,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=49764.0, ans=0.125 2024-06-19 19:47:14,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=49782.333333333336, ans=0.1 2024-06-19 19:47:15,154 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=12.21 vs. limit=12.0 2024-06-19 19:47:25,324 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.534e+03 2.568e+03 2.922e+03 3.526e+03 1.182e+04, threshold=5.843e+03, percent-clipped=5.0 2024-06-19 19:47:38,581 INFO [train.py:1028] (1/2) Epoch 3, batch 6950, loss[loss=0.4938, simple_loss=0.4217, pruned_loss=0.283, over 11333.00 frames. ], tot_loss[loss=0.5321, simple_loss=0.4602, pruned_loss=0.3021, over 2579572.32 frames. ], batch size: 16, lr: 1.68e-02, grad_scale: 1.0 2024-06-19 19:47:44,783 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.67 vs. limit=6.0 2024-06-19 19:47:46,713 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=19.76 vs. limit=15.0 2024-06-19 19:47:52,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=49874.0, ans=0.125 2024-06-19 19:47:53,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=49874.0, ans=0.125 2024-06-19 19:48:00,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=49892.333333333336, ans=0.2 2024-06-19 19:48:01,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=49892.333333333336, ans=0.125 2024-06-19 19:48:05,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=49910.666666666664, ans=22.5 2024-06-19 19:48:11,850 INFO [train.py:1028] (1/2) Epoch 3, batch 7000, loss[loss=0.5816, simple_loss=0.4919, pruned_loss=0.3356, over 12916.00 frames. ], tot_loss[loss=0.5313, simple_loss=0.4596, pruned_loss=0.3015, over 2576110.56 frames. ], batch size: 158, lr: 1.68e-02, grad_scale: 2.0 2024-06-19 19:48:37,286 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.217e+03 1.891e+03 2.256e+03 2.678e+03 7.422e+03, threshold=4.512e+03, percent-clipped=1.0 2024-06-19 19:48:45,701 INFO [train.py:1028] (1/2) Epoch 3, batch 7050, loss[loss=0.5751, simple_loss=0.4825, pruned_loss=0.3338, over 12745.00 frames. ], tot_loss[loss=0.5302, simple_loss=0.4596, pruned_loss=0.3004, over 2582351.01 frames. ], batch size: 176, lr: 1.68e-02, grad_scale: 1.0 2024-06-19 19:48:46,783 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.99 vs. limit=15.0 2024-06-19 19:48:49,506 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.08 vs. limit=15.0 2024-06-19 19:48:50,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=50020.666666666664, ans=0.125 2024-06-19 19:48:52,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=50039.0, ans=0.0 2024-06-19 19:48:53,422 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.31 vs. limit=10.0 2024-06-19 19:49:13,877 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=15.0 2024-06-19 19:49:16,454 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.68 vs. limit=15.0 2024-06-19 19:49:19,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=50094.0, ans=0.125 2024-06-19 19:49:21,173 INFO [train.py:1028] (1/2) Epoch 3, batch 7100, loss[loss=0.5135, simple_loss=0.452, pruned_loss=0.2875, over 13165.00 frames. ], tot_loss[loss=0.5278, simple_loss=0.4583, pruned_loss=0.2987, over 2574819.20 frames. ], batch size: 112, lr: 1.68e-02, grad_scale: 2.0 2024-06-19 19:49:30,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=50130.666666666664, ans=0.125 2024-06-19 19:49:36,139 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.03 vs. limit=10.0 2024-06-19 19:49:37,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=50130.666666666664, ans=0.125 2024-06-19 19:49:48,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=50167.333333333336, ans=0.125 2024-06-19 19:49:50,857 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.390e+03 1.978e+03 2.223e+03 2.581e+03 4.326e+03, threshold=4.446e+03, percent-clipped=0.0 2024-06-19 19:49:58,791 INFO [train.py:1028] (1/2) Epoch 3, batch 7150, loss[loss=0.614, simple_loss=0.5145, pruned_loss=0.3568, over 12593.00 frames. ], tot_loss[loss=0.5267, simple_loss=0.4582, pruned_loss=0.2976, over 2573225.02 frames. ], batch size: 202, lr: 1.68e-02, grad_scale: 2.0 2024-06-19 19:50:07,647 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=9.37 vs. limit=12.0 2024-06-19 19:50:07,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.17 vs. limit=10.0 2024-06-19 19:50:13,273 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=17.91 vs. limit=15.0 2024-06-19 19:50:16,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=50240.666666666664, ans=0.125 2024-06-19 19:50:20,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=50259.0, ans=0.0 2024-06-19 19:50:21,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=50259.0, ans=0.04949747468305833 2024-06-19 19:50:23,171 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.06 vs. limit=15.0 2024-06-19 19:50:24,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=50259.0, ans=0.0 2024-06-19 19:50:26,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=50277.333333333336, ans=0.125 2024-06-19 19:50:31,534 INFO [train.py:1028] (1/2) Epoch 3, batch 7200, loss[loss=0.5487, simple_loss=0.4793, pruned_loss=0.309, over 13191.00 frames. ], tot_loss[loss=0.5264, simple_loss=0.4587, pruned_loss=0.2971, over 2577840.70 frames. ], batch size: 112, lr: 1.67e-02, grad_scale: 2.0 2024-06-19 19:50:35,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=50295.666666666664, ans=0.125 2024-06-19 19:50:38,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=50314.0, ans=0.125 2024-06-19 19:50:44,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=50332.333333333336, ans=0.1 2024-06-19 19:50:53,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=50350.666666666664, ans=0.1 2024-06-19 19:50:57,905 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.483e+03 2.198e+03 2.572e+03 2.959e+03 1.036e+04, threshold=5.145e+03, percent-clipped=4.0 2024-06-19 19:51:04,810 INFO [train.py:1028] (1/2) Epoch 3, batch 7250, loss[loss=0.4946, simple_loss=0.4505, pruned_loss=0.2693, over 12962.00 frames. ], tot_loss[loss=0.5232, simple_loss=0.4574, pruned_loss=0.2945, over 2578658.43 frames. ], batch size: 36, lr: 1.67e-02, grad_scale: 1.0 2024-06-19 19:51:28,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=50442.333333333336, ans=0.125 2024-06-19 19:51:31,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=50442.333333333336, ans=0.05 2024-06-19 19:51:31,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=50442.333333333336, ans=0.125 2024-06-19 19:51:34,745 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.47 vs. limit=10.0 2024-06-19 19:51:41,578 INFO [train.py:1028] (1/2) Epoch 3, batch 7300, loss[loss=0.502, simple_loss=0.4468, pruned_loss=0.2786, over 12888.00 frames. ], tot_loss[loss=0.5236, simple_loss=0.4585, pruned_loss=0.2943, over 2577806.21 frames. ], batch size: 36, lr: 1.67e-02, grad_scale: 2.0 2024-06-19 19:51:49,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=50479.0, ans=0.125 2024-06-19 19:51:53,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=50497.333333333336, ans=0.125 2024-06-19 19:51:54,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=50497.333333333336, ans=0.025 2024-06-19 19:51:59,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=50515.666666666664, ans=0.02 2024-06-19 19:52:08,491 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.71 vs. limit=15.0 2024-06-19 19:52:12,623 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.381e+03 2.035e+03 2.430e+03 2.938e+03 7.032e+03, threshold=4.859e+03, percent-clipped=2.0 2024-06-19 19:52:15,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=50552.333333333336, ans=0.0 2024-06-19 19:52:15,974 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.42 vs. limit=15.0 2024-06-19 19:52:18,157 INFO [train.py:1028] (1/2) Epoch 3, batch 7350, loss[loss=0.5658, simple_loss=0.4931, pruned_loss=0.3192, over 13299.00 frames. ], tot_loss[loss=0.5234, simple_loss=0.4586, pruned_loss=0.2941, over 2579002.87 frames. ], batch size: 46, lr: 1.67e-02, grad_scale: 0.5 2024-06-19 19:52:19,950 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.08 vs. limit=15.0 2024-06-19 19:52:33,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.23 vs. limit=15.0 2024-06-19 19:52:36,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.30 vs. limit=22.5 2024-06-19 19:52:37,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=50625.666666666664, ans=0.0 2024-06-19 19:52:38,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=50625.666666666664, ans=0.125 2024-06-19 19:52:42,705 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.42 vs. limit=15.0 2024-06-19 19:52:49,442 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.67 vs. limit=15.0 2024-06-19 19:52:51,398 INFO [train.py:1028] (1/2) Epoch 3, batch 7400, loss[loss=0.56, simple_loss=0.4936, pruned_loss=0.3132, over 13263.00 frames. ], tot_loss[loss=0.5229, simple_loss=0.4588, pruned_loss=0.2935, over 2584317.43 frames. ], batch size: 63, lr: 1.67e-02, grad_scale: 1.0 2024-06-19 19:52:55,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=50662.333333333336, ans=10.0 2024-06-19 19:52:56,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=50662.333333333336, ans=0.125 2024-06-19 19:52:56,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=50662.333333333336, ans=0.2 2024-06-19 19:52:59,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=50680.666666666664, ans=0.125 2024-06-19 19:52:59,438 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=13.70 vs. limit=12.0 2024-06-19 19:53:05,018 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.98 vs. limit=10.0 2024-06-19 19:53:07,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50699.0, ans=0.1 2024-06-19 19:53:15,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=50717.333333333336, ans=0.125 2024-06-19 19:53:19,809 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.169e+03 1.896e+03 2.230e+03 2.793e+03 9.441e+03, threshold=4.460e+03, percent-clipped=1.0 2024-06-19 19:53:25,313 INFO [train.py:1028] (1/2) Epoch 3, batch 7450, loss[loss=0.4742, simple_loss=0.4214, pruned_loss=0.2634, over 12630.00 frames. ], tot_loss[loss=0.5206, simple_loss=0.4575, pruned_loss=0.2919, over 2578428.54 frames. ], batch size: 29, lr: 1.67e-02, grad_scale: 1.0 2024-06-19 19:53:26,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=50754.0, ans=0.125 2024-06-19 19:53:34,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.84 vs. limit=10.0 2024-06-19 19:53:58,243 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.47 vs. limit=22.5 2024-06-19 19:53:59,794 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.29 vs. limit=12.0 2024-06-19 19:54:01,005 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.64 vs. limit=6.0 2024-06-19 19:54:02,994 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.37 vs. limit=15.0 2024-06-19 19:54:06,481 INFO [train.py:1028] (1/2) Epoch 3, batch 7500, loss[loss=0.5971, simple_loss=0.4995, pruned_loss=0.3474, over 10531.00 frames. ], tot_loss[loss=0.522, simple_loss=0.4591, pruned_loss=0.2924, over 2576824.90 frames. ], batch size: 303, lr: 1.67e-02, grad_scale: 2.0 2024-06-19 19:54:06,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=50845.666666666664, ans=0.125 2024-06-19 19:54:07,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=50845.666666666664, ans=0.0 2024-06-19 19:54:08,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=50845.666666666664, ans=0.125 2024-06-19 19:54:11,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=50845.666666666664, ans=0.125 2024-06-19 19:54:15,857 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=36.15 vs. limit=22.5 2024-06-19 19:54:18,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=50864.0, ans=0.125 2024-06-19 19:54:23,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=50882.333333333336, ans=0.1 2024-06-19 19:54:24,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=50882.333333333336, ans=0.125 2024-06-19 19:54:28,126 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.08 vs. limit=15.0 2024-06-19 19:54:32,413 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.53 vs. limit=15.0 2024-06-19 19:54:32,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.60 vs. limit=15.0 2024-06-19 19:54:34,404 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.688e+02 1.675e+03 1.923e+03 2.263e+03 4.119e+03, threshold=3.846e+03, percent-clipped=0.0 2024-06-19 19:54:39,026 INFO [train.py:1028] (1/2) Epoch 3, batch 7550, loss[loss=0.5237, simple_loss=0.4598, pruned_loss=0.2938, over 12965.00 frames. ], tot_loss[loss=0.522, simple_loss=0.4594, pruned_loss=0.2923, over 2576511.16 frames. ], batch size: 158, lr: 1.66e-02, grad_scale: 1.0 2024-06-19 19:54:39,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=50937.333333333336, ans=0.0 2024-06-19 19:54:43,743 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.91 vs. limit=15.0 2024-06-19 19:54:49,742 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.41 vs. limit=10.0 2024-06-19 19:54:50,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=50955.666666666664, ans=0.125 2024-06-19 19:54:52,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50974.0, ans=0.1 2024-06-19 19:54:55,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=50974.0, ans=0.0 2024-06-19 19:54:57,770 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.67 vs. limit=10.0 2024-06-19 19:54:59,748 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.78 vs. limit=15.0 2024-06-19 19:55:02,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=50992.333333333336, ans=0.125 2024-06-19 19:55:05,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=51010.666666666664, ans=0.125 2024-06-19 19:55:12,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=51029.0, ans=0.2 2024-06-19 19:55:12,569 INFO [train.py:1028] (1/2) Epoch 3, batch 7600, loss[loss=0.4688, simple_loss=0.425, pruned_loss=0.2563, over 13191.00 frames. ], tot_loss[loss=0.5234, simple_loss=0.4605, pruned_loss=0.2932, over 2574832.17 frames. ], batch size: 83, lr: 1.66e-02, grad_scale: 2.0 2024-06-19 19:55:19,069 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.15 vs. limit=15.0 2024-06-19 19:55:41,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=51084.0, ans=0.125 2024-06-19 19:55:44,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=51102.333333333336, ans=0.0 2024-06-19 19:55:46,218 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.08 vs. limit=15.0 2024-06-19 19:55:46,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=51102.333333333336, ans=0.125 2024-06-19 19:55:47,106 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.118e+03 2.116e+03 2.487e+03 3.103e+03 7.993e+03, threshold=4.974e+03, percent-clipped=9.0 2024-06-19 19:55:53,825 INFO [train.py:1028] (1/2) Epoch 3, batch 7650, loss[loss=0.538, simple_loss=0.4765, pruned_loss=0.2997, over 12937.00 frames. ], tot_loss[loss=0.5222, simple_loss=0.4604, pruned_loss=0.292, over 2572474.75 frames. ], batch size: 33, lr: 1.66e-02, grad_scale: 0.5 2024-06-19 19:55:54,365 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=12.18 vs. limit=12.0 2024-06-19 19:55:55,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=51120.666666666664, ans=0.125 2024-06-19 19:55:59,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=51120.666666666664, ans=0.125 2024-06-19 19:56:01,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=51139.0, ans=0.125 2024-06-19 19:56:02,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=51139.0, ans=0.125 2024-06-19 19:56:11,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=51157.333333333336, ans=0.125 2024-06-19 19:56:12,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=51157.333333333336, ans=0.0 2024-06-19 19:56:17,037 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.51 vs. limit=15.0 2024-06-19 19:56:24,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=51194.0, ans=0.0 2024-06-19 19:56:27,797 INFO [train.py:1028] (1/2) Epoch 3, batch 7700, loss[loss=0.5142, simple_loss=0.4704, pruned_loss=0.279, over 13234.00 frames. ], tot_loss[loss=0.5219, simple_loss=0.4605, pruned_loss=0.2917, over 2568799.68 frames. ], batch size: 63, lr: 1.66e-02, grad_scale: 1.0 2024-06-19 19:56:34,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=51230.666666666664, ans=0.0 2024-06-19 19:56:50,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=51267.333333333336, ans=0.07 2024-06-19 19:56:52,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=51267.333333333336, ans=0.2 2024-06-19 19:56:56,466 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.292e+03 2.060e+03 2.328e+03 2.775e+03 5.827e+03, threshold=4.656e+03, percent-clipped=1.0 2024-06-19 19:56:59,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=51304.0, ans=0.0 2024-06-19 19:56:59,620 INFO [train.py:1028] (1/2) Epoch 3, batch 7750, loss[loss=0.4989, simple_loss=0.4475, pruned_loss=0.2752, over 13257.00 frames. ], tot_loss[loss=0.5219, simple_loss=0.4606, pruned_loss=0.2916, over 2573355.24 frames. ], batch size: 72, lr: 1.66e-02, grad_scale: 1.0 2024-06-19 19:57:20,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51340.666666666664, ans=0.1 2024-06-19 19:57:21,225 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.21 vs. limit=15.0 2024-06-19 19:57:22,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=51340.666666666664, ans=0.125 2024-06-19 19:57:37,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=51377.333333333336, ans=0.125 2024-06-19 19:57:40,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=51377.333333333336, ans=0.125 2024-06-19 19:57:41,356 INFO [train.py:1028] (1/2) Epoch 3, batch 7800, loss[loss=0.5061, simple_loss=0.4554, pruned_loss=0.2784, over 13118.00 frames. ], tot_loss[loss=0.5232, simple_loss=0.4621, pruned_loss=0.2921, over 2577570.11 frames. ], batch size: 95, lr: 1.66e-02, grad_scale: 1.0 2024-06-19 19:57:42,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=51395.666666666664, ans=0.125 2024-06-19 19:57:49,692 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2024-06-19 19:57:51,045 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.88 vs. limit=15.0 2024-06-19 19:58:07,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=51450.666666666664, ans=0.0 2024-06-19 19:58:10,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=51450.666666666664, ans=0.125 2024-06-19 19:58:10,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=51450.666666666664, ans=0.2 2024-06-19 19:58:13,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51469.0, ans=0.1 2024-06-19 19:58:14,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=51469.0, ans=0.2 2024-06-19 19:58:15,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=51469.0, ans=0.1 2024-06-19 19:58:16,582 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.471e+03 2.260e+03 2.550e+03 2.903e+03 7.779e+03, threshold=5.101e+03, percent-clipped=7.0 2024-06-19 19:58:18,629 INFO [train.py:1028] (1/2) Epoch 3, batch 7850, loss[loss=0.4243, simple_loss=0.3858, pruned_loss=0.2314, over 11687.00 frames. ], tot_loss[loss=0.5237, simple_loss=0.4628, pruned_loss=0.2923, over 2572473.36 frames. ], batch size: 17, lr: 1.66e-02, grad_scale: 0.5 2024-06-19 19:58:21,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=51487.333333333336, ans=0.125 2024-06-19 19:58:34,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=51524.0, ans=0.0 2024-06-19 19:58:43,254 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.74 vs. limit=22.5 2024-06-19 19:58:49,046 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2024-06-19 19:58:50,549 INFO [train.py:1028] (1/2) Epoch 3, batch 7900, loss[loss=0.5195, simple_loss=0.4645, pruned_loss=0.2872, over 13136.00 frames. ], tot_loss[loss=0.5214, simple_loss=0.4613, pruned_loss=0.2908, over 2572167.25 frames. ], batch size: 77, lr: 1.66e-02, grad_scale: 1.0 2024-06-19 19:58:53,021 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.05 vs. limit=6.0 2024-06-19 19:58:55,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=51579.0, ans=0.125 2024-06-19 19:59:02,857 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.03 vs. limit=10.0 2024-06-19 19:59:05,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=51615.666666666664, ans=0.0 2024-06-19 19:59:06,171 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.82 vs. limit=10.0 2024-06-19 19:59:25,367 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.213e+03 2.039e+03 2.484e+03 2.855e+03 6.061e+03, threshold=4.968e+03, percent-clipped=2.0 2024-06-19 19:59:26,775 INFO [train.py:1028] (1/2) Epoch 3, batch 7950, loss[loss=0.5293, simple_loss=0.4519, pruned_loss=0.3033, over 10785.00 frames. ], tot_loss[loss=0.52, simple_loss=0.4607, pruned_loss=0.2896, over 2576016.80 frames. ], batch size: 305, lr: 1.65e-02, grad_scale: 0.5 2024-06-19 19:59:28,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=51670.666666666664, ans=0.125 2024-06-19 19:59:29,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=51670.666666666664, ans=0.125 2024-06-19 19:59:33,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=51689.0, ans=0.125 2024-06-19 19:59:41,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=51707.333333333336, ans=0.0 2024-06-19 19:59:47,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=51707.333333333336, ans=0.0 2024-06-19 19:59:56,523 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.14 vs. limit=15.0 2024-06-19 20:00:00,520 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.80 vs. limit=22.5 2024-06-19 20:00:01,086 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.10 vs. limit=6.0 2024-06-19 20:00:04,105 INFO [train.py:1028] (1/2) Epoch 3, batch 8000, loss[loss=0.5208, simple_loss=0.4622, pruned_loss=0.2897, over 12630.00 frames. ], tot_loss[loss=0.5219, simple_loss=0.4625, pruned_loss=0.2907, over 2572494.02 frames. ], batch size: 29, lr: 1.65e-02, grad_scale: 1.0 2024-06-19 20:00:12,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=51780.666666666664, ans=0.125 2024-06-19 20:00:12,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=51780.666666666664, ans=0.125 2024-06-19 20:00:16,711 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.80 vs. limit=15.0 2024-06-19 20:00:18,092 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.79 vs. limit=22.5 2024-06-19 20:00:23,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=51799.0, ans=0.125 2024-06-19 20:00:37,785 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.045e+03 2.037e+03 2.546e+03 3.145e+03 7.740e+03, threshold=5.092e+03, percent-clipped=2.0 2024-06-19 20:00:39,159 INFO [train.py:1028] (1/2) Epoch 3, batch 8050, loss[loss=0.5044, simple_loss=0.4499, pruned_loss=0.2794, over 13182.00 frames. ], tot_loss[loss=0.5199, simple_loss=0.4612, pruned_loss=0.2893, over 2571947.38 frames. ], batch size: 83, lr: 1.65e-02, grad_scale: 1.0 2024-06-19 20:00:45,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=51872.333333333336, ans=0.0 2024-06-19 20:00:49,436 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.87 vs. limit=15.0 2024-06-19 20:01:08,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=51927.333333333336, ans=0.1 2024-06-19 20:01:11,815 INFO [train.py:1028] (1/2) Epoch 3, batch 8100, loss[loss=0.5196, simple_loss=0.4631, pruned_loss=0.288, over 13152.00 frames. ], tot_loss[loss=0.5196, simple_loss=0.4612, pruned_loss=0.289, over 2576601.69 frames. ], batch size: 112, lr: 1.65e-02, grad_scale: 2.0 2024-06-19 20:01:12,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=51945.666666666664, ans=0.125 2024-06-19 20:01:22,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51964.0, ans=0.1 2024-06-19 20:01:25,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=51982.333333333336, ans=0.125 2024-06-19 20:01:51,661 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.624e+03 2.296e+03 2.602e+03 2.906e+03 5.900e+03, threshold=5.204e+03, percent-clipped=2.0 2024-06-19 20:01:52,365 INFO [train.py:1028] (1/2) Epoch 3, batch 8150, loss[loss=0.4998, simple_loss=0.4421, pruned_loss=0.2787, over 13096.00 frames. ], tot_loss[loss=0.5175, simple_loss=0.4606, pruned_loss=0.2872, over 2579664.16 frames. ], batch size: 121, lr: 1.65e-02, grad_scale: 1.0 2024-06-19 20:01:54,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=15.0 2024-06-19 20:02:11,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=52092.333333333336, ans=0.2 2024-06-19 20:02:13,554 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.97 vs. limit=12.0 2024-06-19 20:02:15,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=52092.333333333336, ans=0.0 2024-06-19 20:02:25,966 INFO [train.py:1028] (1/2) Epoch 3, batch 8200, loss[loss=0.5318, simple_loss=0.4742, pruned_loss=0.2946, over 13123.00 frames. ], tot_loss[loss=0.5163, simple_loss=0.4603, pruned_loss=0.2862, over 2583013.70 frames. ], batch size: 112, lr: 1.65e-02, grad_scale: 2.0 2024-06-19 20:02:28,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=52129.0, ans=0.0 2024-06-19 20:02:30,549 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=17.84 vs. limit=15.0 2024-06-19 20:02:33,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=52147.333333333336, ans=0.125 2024-06-19 20:02:35,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=52147.333333333336, ans=0.125 2024-06-19 20:02:43,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=52165.666666666664, ans=0.0 2024-06-19 20:02:43,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=52165.666666666664, ans=10.0 2024-06-19 20:02:46,773 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.64 vs. limit=15.0 2024-06-19 20:02:51,609 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.29 vs. limit=15.0 2024-06-19 20:03:00,661 INFO [train.py:1028] (1/2) Epoch 3, batch 8250, loss[loss=0.4944, simple_loss=0.4643, pruned_loss=0.2622, over 13269.00 frames. ], tot_loss[loss=0.5168, simple_loss=0.4607, pruned_loss=0.2864, over 2583731.48 frames. ], batch size: 52, lr: 1.65e-02, grad_scale: 0.5 2024-06-19 20:03:01,237 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.432e+03 2.041e+03 2.352e+03 2.928e+03 1.240e+04, threshold=4.704e+03, percent-clipped=5.0 2024-06-19 20:03:17,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=52257.333333333336, ans=0.125 2024-06-19 20:03:27,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52294.0, ans=0.1 2024-06-19 20:03:35,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=52294.0, ans=0.025 2024-06-19 20:03:36,787 INFO [train.py:1028] (1/2) Epoch 3, batch 8300, loss[loss=0.5066, simple_loss=0.4608, pruned_loss=0.2762, over 13000.00 frames. ], tot_loss[loss=0.513, simple_loss=0.4582, pruned_loss=0.2839, over 2580403.71 frames. ], batch size: 102, lr: 1.64e-02, grad_scale: 1.0 2024-06-19 20:03:38,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=52312.333333333336, ans=0.2 2024-06-19 20:03:45,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52330.666666666664, ans=0.1 2024-06-19 20:03:45,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=52330.666666666664, ans=0.125 2024-06-19 20:03:54,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=15.36 vs. limit=15.0 2024-06-19 20:03:56,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=52349.0, ans=0.125 2024-06-19 20:03:59,917 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.46 vs. limit=15.0 2024-06-19 20:04:03,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=52367.333333333336, ans=0.0 2024-06-19 20:04:03,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=52367.333333333336, ans=0.125 2024-06-19 20:04:13,350 INFO [train.py:1028] (1/2) Epoch 3, batch 8350, loss[loss=0.5046, simple_loss=0.4493, pruned_loss=0.2799, over 13145.00 frames. ], tot_loss[loss=0.5103, simple_loss=0.457, pruned_loss=0.2818, over 2581529.71 frames. ], batch size: 112, lr: 1.64e-02, grad_scale: 1.0 2024-06-19 20:04:13,883 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.728e+02 1.698e+03 2.012e+03 2.412e+03 8.628e+03, threshold=4.023e+03, percent-clipped=1.0 2024-06-19 20:04:21,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=52422.333333333336, ans=0.125 2024-06-19 20:04:23,215 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.83 vs. limit=10.0 2024-06-19 20:04:25,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=52422.333333333336, ans=0.035 2024-06-19 20:04:26,942 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=5.792e+02 2024-06-19 20:04:38,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=52459.0, ans=0.125 2024-06-19 20:04:38,907 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=21.74 vs. limit=15.0 2024-06-19 20:04:39,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=52477.333333333336, ans=0.025 2024-06-19 20:04:41,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52477.333333333336, ans=0.1 2024-06-19 20:04:42,136 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.29 vs. limit=15.0 2024-06-19 20:04:47,153 INFO [train.py:1028] (1/2) Epoch 3, batch 8400, loss[loss=0.4331, simple_loss=0.4076, pruned_loss=0.2293, over 12994.00 frames. ], tot_loss[loss=0.51, simple_loss=0.4571, pruned_loss=0.2814, over 2577821.31 frames. ], batch size: 39, lr: 1.64e-02, grad_scale: 2.0 2024-06-19 20:04:56,104 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=18.15 vs. limit=15.0 2024-06-19 20:05:04,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=52532.333333333336, ans=0.0 2024-06-19 20:05:07,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=52550.666666666664, ans=0.0 2024-06-19 20:05:17,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=52569.0, ans=0.125 2024-06-19 20:05:20,250 INFO [train.py:1028] (1/2) Epoch 3, batch 8450, loss[loss=0.5122, simple_loss=0.4603, pruned_loss=0.2821, over 13186.00 frames. ], tot_loss[loss=0.5092, simple_loss=0.4575, pruned_loss=0.2805, over 2579174.08 frames. ], batch size: 112, lr: 1.64e-02, grad_scale: 1.0 2024-06-19 20:05:20,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=52587.333333333336, ans=0.2 2024-06-19 20:05:21,559 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.776e+02 1.637e+03 1.870e+03 2.198e+03 3.721e+03, threshold=3.739e+03, percent-clipped=0.0 2024-06-19 20:05:42,802 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.77 vs. limit=15.0 2024-06-19 20:05:53,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52660.666666666664, ans=0.1 2024-06-19 20:05:56,043 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=4.127e+02 2024-06-19 20:05:58,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=52660.666666666664, ans=0.025 2024-06-19 20:06:01,370 INFO [train.py:1028] (1/2) Epoch 3, batch 8500, loss[loss=0.5497, simple_loss=0.4866, pruned_loss=0.3064, over 12626.00 frames. ], tot_loss[loss=0.5098, simple_loss=0.4585, pruned_loss=0.2805, over 2577506.57 frames. ], batch size: 29, lr: 1.64e-02, grad_scale: 2.0 2024-06-19 20:06:19,705 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.36 vs. limit=6.0 2024-06-19 20:06:25,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=52734.0, ans=0.125 2024-06-19 20:06:36,151 INFO [train.py:1028] (1/2) Epoch 3, batch 8550, loss[loss=0.4268, simple_loss=0.4101, pruned_loss=0.2218, over 12740.00 frames. ], tot_loss[loss=0.5064, simple_loss=0.4564, pruned_loss=0.2782, over 2575292.28 frames. ], batch size: 22, lr: 1.64e-02, grad_scale: 2.0 2024-06-19 20:06:37,393 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.013e+03 1.659e+03 2.071e+03 2.530e+03 8.065e+03, threshold=4.143e+03, percent-clipped=4.0 2024-06-19 20:06:37,936 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.38 vs. limit=10.0 2024-06-19 20:06:38,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=52770.666666666664, ans=0.0 2024-06-19 20:06:39,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=52770.666666666664, ans=0.125 2024-06-19 20:06:50,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=52807.333333333336, ans=0.125 2024-06-19 20:06:54,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52807.333333333336, ans=0.1 2024-06-19 20:06:55,324 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=18.83 vs. limit=15.0 2024-06-19 20:07:10,232 INFO [train.py:1028] (1/2) Epoch 3, batch 8600, loss[loss=0.5339, simple_loss=0.4748, pruned_loss=0.2965, over 13072.00 frames. ], tot_loss[loss=0.5074, simple_loss=0.4569, pruned_loss=0.2789, over 2573165.70 frames. ], batch size: 121, lr: 1.64e-02, grad_scale: 1.0 2024-06-19 20:07:22,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=52880.666666666664, ans=0.5 2024-06-19 20:07:31,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=52917.333333333336, ans=0.2 2024-06-19 20:07:36,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=52935.666666666664, ans=0.025 2024-06-19 20:07:46,884 INFO [train.py:1028] (1/2) Epoch 3, batch 8650, loss[loss=0.5221, simple_loss=0.4677, pruned_loss=0.2882, over 13209.00 frames. ], tot_loss[loss=0.506, simple_loss=0.4561, pruned_loss=0.278, over 2576071.79 frames. ], batch size: 103, lr: 1.63e-02, grad_scale: 1.0 2024-06-19 20:07:49,399 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.340e+03 2.132e+03 2.660e+03 3.051e+03 5.530e+03, threshold=5.321e+03, percent-clipped=7.0 2024-06-19 20:08:02,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=52972.333333333336, ans=0.125 2024-06-19 20:08:02,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=52972.333333333336, ans=0.05 2024-06-19 20:08:05,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52990.666666666664, ans=0.1 2024-06-19 20:08:08,799 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.62 vs. limit=15.0 2024-06-19 20:08:23,033 INFO [train.py:1028] (1/2) Epoch 3, batch 8700, loss[loss=0.5457, simple_loss=0.5, pruned_loss=0.2957, over 13201.00 frames. ], tot_loss[loss=0.5084, simple_loss=0.4577, pruned_loss=0.2796, over 2571777.66 frames. ], batch size: 59, lr: 1.63e-02, grad_scale: 2.0 2024-06-19 20:08:25,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=53045.666666666664, ans=0.025 2024-06-19 20:08:28,301 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.99 vs. limit=15.0 2024-06-19 20:08:30,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=53064.0, ans=0.125 2024-06-19 20:08:38,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=53082.333333333336, ans=0.025 2024-06-19 20:08:39,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=53082.333333333336, ans=0.05 2024-06-19 20:08:42,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=53100.666666666664, ans=0.0 2024-06-19 20:08:53,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=53119.0, ans=0.125 2024-06-19 20:08:57,112 INFO [train.py:1028] (1/2) Epoch 3, batch 8750, loss[loss=0.495, simple_loss=0.442, pruned_loss=0.274, over 13132.00 frames. ], tot_loss[loss=0.5084, simple_loss=0.4577, pruned_loss=0.2795, over 2569131.24 frames. ], batch size: 121, lr: 1.63e-02, grad_scale: 1.0 2024-06-19 20:08:59,725 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.87 vs. limit=22.5 2024-06-19 20:09:00,513 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.614e+03 2.456e+03 2.786e+03 3.379e+03 7.357e+03, threshold=5.571e+03, percent-clipped=3.0 2024-06-19 20:09:11,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=53174.0, ans=0.125 2024-06-19 20:09:15,533 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=16.33 vs. limit=15.0 2024-06-19 20:09:30,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=53229.0, ans=0.025 2024-06-19 20:09:31,037 INFO [train.py:1028] (1/2) Epoch 3, batch 8800, loss[loss=0.4781, simple_loss=0.433, pruned_loss=0.2616, over 13201.00 frames. ], tot_loss[loss=0.5078, simple_loss=0.4575, pruned_loss=0.279, over 2574591.67 frames. ], batch size: 72, lr: 1.63e-02, grad_scale: 0.5 2024-06-19 20:09:43,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=53247.333333333336, ans=0.125 2024-06-19 20:09:48,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=53247.333333333336, ans=0.0 2024-06-19 20:10:02,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=53284.0, ans=0.125 2024-06-19 20:10:06,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=53302.333333333336, ans=0.1 2024-06-19 20:10:07,414 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.15 vs. limit=10.0 2024-06-19 20:10:11,894 INFO [train.py:1028] (1/2) Epoch 3, batch 8850, loss[loss=0.5951, simple_loss=0.5112, pruned_loss=0.3395, over 12526.00 frames. ], tot_loss[loss=0.5074, simple_loss=0.4569, pruned_loss=0.279, over 2563140.55 frames. ], batch size: 202, lr: 1.63e-02, grad_scale: 0.5 2024-06-19 20:10:13,686 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=13.88 vs. limit=15.0 2024-06-19 20:10:13,754 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.30 vs. limit=15.0 2024-06-19 20:10:16,718 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.874e+03 2.764e+03 3.391e+03 3.865e+03 9.837e+03, threshold=6.782e+03, percent-clipped=6.0 2024-06-19 20:10:18,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=53339.0, ans=0.0 2024-06-19 20:10:21,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=53339.0, ans=0.0 2024-06-19 20:10:21,428 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.65 vs. limit=15.0 2024-06-19 20:10:23,873 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.79 vs. limit=22.5 2024-06-19 20:10:28,000 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2024-06-19 20:10:34,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=53375.666666666664, ans=0.125 2024-06-19 20:10:35,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=53375.666666666664, ans=0.125 2024-06-19 20:10:45,382 INFO [train.py:1028] (1/2) Epoch 3, batch 8900, loss[loss=0.5374, simple_loss=0.4837, pruned_loss=0.2956, over 13009.00 frames. ], tot_loss[loss=0.5072, simple_loss=0.4567, pruned_loss=0.2789, over 2562328.55 frames. ], batch size: 33, lr: 1.63e-02, grad_scale: 1.0 2024-06-19 20:10:45,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=53412.333333333336, ans=0.0 2024-06-19 20:10:51,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=53430.666666666664, ans=0.125 2024-06-19 20:10:55,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=53430.666666666664, ans=0.125 2024-06-19 20:11:18,360 INFO [train.py:1028] (1/2) Epoch 3, batch 8950, loss[loss=0.568, simple_loss=0.4885, pruned_loss=0.3237, over 12564.00 frames. ], tot_loss[loss=0.505, simple_loss=0.4558, pruned_loss=0.2771, over 2562118.91 frames. ], batch size: 202, lr: 1.63e-02, grad_scale: 1.0 2024-06-19 20:11:22,950 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.211e+03 1.959e+03 2.261e+03 2.595e+03 5.998e+03, threshold=4.523e+03, percent-clipped=0.0 2024-06-19 20:11:23,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=53504.0, ans=0.125 2024-06-19 20:11:32,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=53540.666666666664, ans=0.125 2024-06-19 20:11:42,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=53559.0, ans=0.125 2024-06-19 20:11:47,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=53559.0, ans=0.125 2024-06-19 20:11:57,639 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.95 vs. limit=15.0 2024-06-19 20:11:59,003 INFO [train.py:1028] (1/2) Epoch 3, batch 9000, loss[loss=0.4532, simple_loss=0.4311, pruned_loss=0.2377, over 13349.00 frames. ], tot_loss[loss=0.5021, simple_loss=0.4544, pruned_loss=0.2749, over 2567873.80 frames. ], batch size: 46, lr: 1.63e-02, grad_scale: 1.0 2024-06-19 20:11:59,003 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 20:12:07,111 INFO [train.py:1060] (1/2) Epoch 3, validation: loss=0.328, simple_loss=0.3513, pruned_loss=0.1524, over 351949.00 frames. 2024-06-19 20:12:07,111 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17340MB 2024-06-19 20:12:07,544 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.54 vs. limit=22.5 2024-06-19 20:12:24,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=53632.333333333336, ans=0.0 2024-06-19 20:12:25,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=53632.333333333336, ans=0.0 2024-06-19 20:12:28,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=53650.666666666664, ans=0.125 2024-06-19 20:12:29,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=53650.666666666664, ans=0.0 2024-06-19 20:12:31,187 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=53650.666666666664, ans=0.125 2024-06-19 20:12:31,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=53650.666666666664, ans=0.0 2024-06-19 20:12:38,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=53669.0, ans=0.2 2024-06-19 20:12:39,594 INFO [train.py:1028] (1/2) Epoch 3, batch 9050, loss[loss=0.5019, simple_loss=0.4538, pruned_loss=0.275, over 11125.00 frames. ], tot_loss[loss=0.5018, simple_loss=0.4546, pruned_loss=0.2745, over 2566431.40 frames. ], batch size: 16, lr: 1.62e-02, grad_scale: 1.0 2024-06-19 20:12:42,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=53687.333333333336, ans=0.09899494936611666 2024-06-19 20:12:44,599 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.232e+03 1.899e+03 2.228e+03 2.618e+03 1.177e+04, threshold=4.455e+03, percent-clipped=3.0 2024-06-19 20:12:45,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.83 vs. limit=6.0 2024-06-19 20:12:51,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=53724.0, ans=0.125 2024-06-19 20:12:56,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=53724.0, ans=0.0 2024-06-19 20:12:59,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=53742.333333333336, ans=0.05 2024-06-19 20:13:04,627 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.86 vs. limit=22.5 2024-06-19 20:13:07,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=53760.666666666664, ans=0.0 2024-06-19 20:13:08,179 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=53760.666666666664, ans=0.125 2024-06-19 20:13:12,086 INFO [train.py:1028] (1/2) Epoch 3, batch 9100, loss[loss=0.4867, simple_loss=0.4501, pruned_loss=0.2617, over 13278.00 frames. ], tot_loss[loss=0.5, simple_loss=0.4535, pruned_loss=0.2732, over 2568685.94 frames. ], batch size: 72, lr: 1.62e-02, grad_scale: 1.0 2024-06-19 20:13:24,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=53815.666666666664, ans=0.05 2024-06-19 20:13:32,941 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.98 vs. limit=12.0 2024-06-19 20:13:35,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=53834.0, ans=0.125 2024-06-19 20:13:44,002 INFO [train.py:1028] (1/2) Epoch 3, batch 9150, loss[loss=0.4821, simple_loss=0.4456, pruned_loss=0.2593, over 13182.00 frames. ], tot_loss[loss=0.4989, simple_loss=0.4532, pruned_loss=0.2723, over 2568888.19 frames. ], batch size: 77, lr: 1.62e-02, grad_scale: 0.5 2024-06-19 20:13:50,210 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.324e+03 1.889e+03 2.207e+03 2.487e+03 6.311e+03, threshold=4.415e+03, percent-clipped=3.0 2024-06-19 20:13:52,485 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 20:14:03,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=53925.666666666664, ans=0.0 2024-06-19 20:14:07,749 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2024-06-19 20:14:09,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=53944.0, ans=0.025 2024-06-19 20:14:16,275 INFO [train.py:1028] (1/2) Epoch 3, batch 9200, loss[loss=0.4924, simple_loss=0.4589, pruned_loss=0.2629, over 12973.00 frames. ], tot_loss[loss=0.4955, simple_loss=0.4516, pruned_loss=0.2697, over 2573120.79 frames. ], batch size: 36, lr: 1.62e-02, grad_scale: 1.0 2024-06-19 20:14:18,451 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.32 vs. limit=10.0 2024-06-19 20:14:21,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=53980.666666666664, ans=0.025 2024-06-19 20:14:31,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=53999.0, ans=0.0 2024-06-19 20:14:32,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=53999.0, ans=0.125 2024-06-19 20:14:47,997 INFO [train.py:1028] (1/2) Epoch 3, batch 9250, loss[loss=0.4682, simple_loss=0.4313, pruned_loss=0.2526, over 13190.00 frames. ], tot_loss[loss=0.4939, simple_loss=0.4506, pruned_loss=0.2686, over 2573409.05 frames. ], batch size: 67, lr: 1.62e-02, grad_scale: 1.0 2024-06-19 20:14:54,428 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.053e+03 1.831e+03 2.143e+03 2.534e+03 4.902e+03, threshold=4.286e+03, percent-clipped=2.0 2024-06-19 20:14:55,515 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.97 vs. limit=15.0 2024-06-19 20:14:57,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=54072.333333333336, ans=0.1 2024-06-19 20:15:01,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=54090.666666666664, ans=0.1 2024-06-19 20:15:15,101 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.59 vs. limit=22.5 2024-06-19 20:15:21,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=54127.333333333336, ans=0.025 2024-06-19 20:15:23,050 INFO [train.py:1028] (1/2) Epoch 3, batch 9300, loss[loss=0.4439, simple_loss=0.4154, pruned_loss=0.2362, over 12927.00 frames. ], tot_loss[loss=0.4922, simple_loss=0.4497, pruned_loss=0.2673, over 2571666.09 frames. ], batch size: 39, lr: 1.62e-02, grad_scale: 1.0 2024-06-19 20:15:23,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=54145.666666666664, ans=0.0 2024-06-19 20:15:39,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=54182.333333333336, ans=10.0 2024-06-19 20:15:39,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=54182.333333333336, ans=0.0 2024-06-19 20:15:42,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=54182.333333333336, ans=0.0 2024-06-19 20:15:42,903 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=42.98 vs. limit=15.0 2024-06-19 20:15:50,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=54219.0, ans=0.1 2024-06-19 20:15:53,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=54219.0, ans=0.1 2024-06-19 20:15:55,029 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=54219.0, ans=0.025 2024-06-19 20:15:56,056 INFO [train.py:1028] (1/2) Epoch 3, batch 9350, loss[loss=0.4274, simple_loss=0.3962, pruned_loss=0.2293, over 12465.00 frames. ], tot_loss[loss=0.4933, simple_loss=0.4503, pruned_loss=0.2681, over 2568323.72 frames. ], batch size: 22, lr: 1.62e-02, grad_scale: 1.0 2024-06-19 20:15:58,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=54237.333333333336, ans=0.2 2024-06-19 20:15:59,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=54237.333333333336, ans=0.125 2024-06-19 20:15:59,979 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.81 vs. limit=15.0 2024-06-19 20:16:02,723 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.164e+03 1.678e+03 2.009e+03 2.319e+03 4.280e+03, threshold=4.019e+03, percent-clipped=0.0 2024-06-19 20:16:06,668 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.09 vs. limit=22.5 2024-06-19 20:16:15,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54292.333333333336, ans=0.1 2024-06-19 20:16:15,745 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.88 vs. limit=6.0 2024-06-19 20:16:20,547 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.80 vs. limit=10.0 2024-06-19 20:16:22,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=54310.666666666664, ans=10.0 2024-06-19 20:16:25,981 INFO [train.py:1028] (1/2) Epoch 3, batch 9400, loss[loss=0.4974, simple_loss=0.4618, pruned_loss=0.2665, over 13327.00 frames. ], tot_loss[loss=0.4938, simple_loss=0.4508, pruned_loss=0.2684, over 2567307.14 frames. ], batch size: 52, lr: 1.62e-02, grad_scale: 2.0 2024-06-19 20:16:31,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=54347.333333333336, ans=0.125 2024-06-19 20:16:32,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=54347.333333333336, ans=0.125 2024-06-19 20:16:37,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=54365.666666666664, ans=0.0 2024-06-19 20:16:40,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=54365.666666666664, ans=0.1 2024-06-19 20:16:54,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=54402.333333333336, ans=0.2 2024-06-19 20:16:56,571 INFO [train.py:1028] (1/2) Epoch 3, batch 9450, loss[loss=0.5474, simple_loss=0.4853, pruned_loss=0.3048, over 12622.00 frames. ], tot_loss[loss=0.4948, simple_loss=0.4516, pruned_loss=0.269, over 2567893.63 frames. ], batch size: 22, lr: 1.61e-02, grad_scale: 2.0 2024-06-19 20:16:58,698 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.57 vs. limit=10.0 2024-06-19 20:16:58,724 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=9.83 vs. limit=12.0 2024-06-19 20:16:59,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=54420.666666666664, ans=0.125 2024-06-19 20:17:02,587 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.01 vs. limit=22.5 2024-06-19 20:17:04,047 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.163e+03 1.719e+03 2.027e+03 2.480e+03 4.764e+03, threshold=4.054e+03, percent-clipped=4.0 2024-06-19 20:17:20,662 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.36 vs. limit=15.0 2024-06-19 20:17:20,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=54494.0, ans=0.125 2024-06-19 20:17:24,114 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.96 vs. limit=10.0 2024-06-19 20:17:24,119 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.89 vs. limit=15.0 2024-06-19 20:17:26,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=54512.333333333336, ans=0.1 2024-06-19 20:17:26,793 INFO [train.py:1028] (1/2) Epoch 3, batch 9500, loss[loss=0.4717, simple_loss=0.4433, pruned_loss=0.25, over 13200.00 frames. ], tot_loss[loss=0.4918, simple_loss=0.45, pruned_loss=0.2668, over 2576449.25 frames. ], batch size: 43, lr: 1.61e-02, grad_scale: 2.0 2024-06-19 20:17:30,851 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.83 vs. limit=15.0 2024-06-19 20:17:30,871 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=9.93 vs. limit=12.0 2024-06-19 20:17:32,048 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.51 vs. limit=22.5 2024-06-19 20:17:34,102 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=1.474e+01 2024-06-19 20:17:46,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=54567.333333333336, ans=0.125 2024-06-19 20:17:48,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=54567.333333333336, ans=0.125 2024-06-19 20:17:56,834 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.44 vs. limit=6.0 2024-06-19 20:17:57,135 INFO [train.py:1028] (1/2) Epoch 3, batch 9550, loss[loss=0.5069, simple_loss=0.4702, pruned_loss=0.2718, over 12970.00 frames. ], tot_loss[loss=0.4899, simple_loss=0.4488, pruned_loss=0.2655, over 2572726.31 frames. ], batch size: 39, lr: 1.61e-02, grad_scale: 1.0 2024-06-19 20:18:00,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=54604.0, ans=15.0 2024-06-19 20:18:05,541 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.095e+03 1.587e+03 1.878e+03 2.169e+03 5.493e+03, threshold=3.756e+03, percent-clipped=3.0 2024-06-19 20:18:07,072 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.60 vs. limit=6.0 2024-06-19 20:18:27,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=54677.333333333336, ans=0.2 2024-06-19 20:18:29,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=54677.333333333336, ans=0.0 2024-06-19 20:18:31,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54677.333333333336, ans=0.1 2024-06-19 20:18:32,427 INFO [train.py:1028] (1/2) Epoch 3, batch 9600, loss[loss=0.5558, simple_loss=0.4739, pruned_loss=0.3188, over 10500.00 frames. ], tot_loss[loss=0.4882, simple_loss=0.4472, pruned_loss=0.2646, over 2570987.63 frames. ], batch size: 303, lr: 1.61e-02, grad_scale: 2.0 2024-06-19 20:18:34,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=54695.666666666664, ans=0.125 2024-06-19 20:18:35,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=54695.666666666664, ans=0.05 2024-06-19 20:18:35,233 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.76 vs. limit=15.0 2024-06-19 20:18:46,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=54732.333333333336, ans=0.0 2024-06-19 20:18:52,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=54750.666666666664, ans=0.5 2024-06-19 20:18:53,750 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=15.34 vs. limit=15.0 2024-06-19 20:18:54,946 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.70 vs. limit=22.5 2024-06-19 20:18:55,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=54750.666666666664, ans=0.025 2024-06-19 20:19:01,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=54769.0, ans=0.0 2024-06-19 20:19:02,297 INFO [train.py:1028] (1/2) Epoch 3, batch 9650, loss[loss=0.5039, simple_loss=0.4462, pruned_loss=0.2809, over 13113.00 frames. ], tot_loss[loss=0.4885, simple_loss=0.447, pruned_loss=0.265, over 2561973.54 frames. ], batch size: 132, lr: 1.61e-02, grad_scale: 1.0 2024-06-19 20:19:10,620 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.208e+02 1.450e+03 1.703e+03 2.023e+03 5.489e+03, threshold=3.407e+03, percent-clipped=2.0 2024-06-19 20:19:30,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=54860.666666666664, ans=0.125 2024-06-19 20:19:31,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=54860.666666666664, ans=0.125 2024-06-19 20:19:32,614 INFO [train.py:1028] (1/2) Epoch 3, batch 9700, loss[loss=0.4622, simple_loss=0.4249, pruned_loss=0.2497, over 13072.00 frames. ], tot_loss[loss=0.4872, simple_loss=0.4462, pruned_loss=0.2641, over 2557043.86 frames. ], batch size: 144, lr: 1.61e-02, grad_scale: 2.0 2024-06-19 20:19:35,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=54879.0, ans=0.05 2024-06-19 20:19:42,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=54897.333333333336, ans=0.125 2024-06-19 20:19:43,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=54897.333333333336, ans=0.125 2024-06-19 20:19:44,484 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=9.64 vs. limit=12.0 2024-06-19 20:19:46,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=54915.666666666664, ans=0.95 2024-06-19 20:20:01,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=54952.333333333336, ans=0.125 2024-06-19 20:20:02,970 INFO [train.py:1028] (1/2) Epoch 3, batch 9750, loss[loss=0.4646, simple_loss=0.4257, pruned_loss=0.2518, over 13099.00 frames. ], tot_loss[loss=0.4848, simple_loss=0.445, pruned_loss=0.2623, over 2553525.88 frames. ], batch size: 132, lr: 1.61e-02, grad_scale: 2.0 2024-06-19 20:20:13,814 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.686e+02 1.765e+03 2.023e+03 2.264e+03 5.206e+03, threshold=4.045e+03, percent-clipped=3.0 2024-06-19 20:20:25,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=55025.666666666664, ans=0.125 2024-06-19 20:20:30,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=55025.666666666664, ans=0.0 2024-06-19 20:20:30,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.82 vs. limit=22.5 2024-06-19 20:20:34,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=55044.0, ans=0.125 2024-06-19 20:20:35,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=55044.0, ans=0.0 2024-06-19 20:20:36,963 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=17.53 vs. limit=15.0 2024-06-19 20:20:37,710 INFO [train.py:1028] (1/2) Epoch 3, batch 9800, loss[loss=0.4323, simple_loss=0.4063, pruned_loss=0.2291, over 12970.00 frames. ], tot_loss[loss=0.4821, simple_loss=0.4431, pruned_loss=0.2606, over 2546580.88 frames. ], batch size: 39, lr: 1.61e-02, grad_scale: 4.0 2024-06-19 20:20:42,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=55062.333333333336, ans=15.0 2024-06-19 20:20:45,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55080.666666666664, ans=0.1 2024-06-19 20:20:48,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=55080.666666666664, ans=0.125 2024-06-19 20:21:08,133 INFO [train.py:1028] (1/2) Epoch 3, batch 9850, loss[loss=0.4796, simple_loss=0.445, pruned_loss=0.257, over 13057.00 frames. ], tot_loss[loss=0.4813, simple_loss=0.4422, pruned_loss=0.2602, over 2539131.17 frames. ], batch size: 102, lr: 1.60e-02, grad_scale: 1.0 2024-06-19 20:21:08,476 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.29 vs. limit=10.0 2024-06-19 20:21:09,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=55154.0, ans=0.125 2024-06-19 20:21:12,633 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=26.44 vs. limit=22.5 2024-06-19 20:21:13,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=55172.333333333336, ans=0.2 2024-06-19 20:21:17,609 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.429e+03 2.086e+03 2.480e+03 2.951e+03 6.745e+03, threshold=4.961e+03, percent-clipped=5.0 2024-06-19 20:21:21,186 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.01 vs. limit=6.0 2024-06-19 20:21:24,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=55190.666666666664, ans=10.0 2024-06-19 20:21:25,863 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.19 vs. limit=12.0 2024-06-19 20:21:32,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=55227.333333333336, ans=0.07 2024-06-19 20:21:36,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=55227.333333333336, ans=0.0 2024-06-19 20:21:40,612 INFO [train.py:1028] (1/2) Epoch 3, batch 9900, loss[loss=0.4843, simple_loss=0.4411, pruned_loss=0.2638, over 12977.00 frames. ], tot_loss[loss=0.4818, simple_loss=0.442, pruned_loss=0.2607, over 2532351.72 frames. ], batch size: 39, lr: 1.60e-02, grad_scale: 1.0 2024-06-19 20:21:42,320 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.85 vs. limit=15.0 2024-06-19 20:21:43,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.61 vs. limit=15.0 2024-06-19 20:21:53,335 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.39 vs. limit=10.0 2024-06-19 20:21:54,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=55282.333333333336, ans=0.0 2024-06-19 20:22:00,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=55300.666666666664, ans=0.0 2024-06-19 20:22:01,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=55300.666666666664, ans=0.125 2024-06-19 20:22:07,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=55319.0, ans=0.125 2024-06-19 20:22:09,281 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.67 vs. limit=15.0 2024-06-19 20:22:10,753 INFO [train.py:1028] (1/2) Epoch 3, batch 9950, loss[loss=0.5493, simple_loss=0.4941, pruned_loss=0.3022, over 12614.00 frames. ], tot_loss[loss=0.4836, simple_loss=0.4421, pruned_loss=0.2625, over 2524195.04 frames. ], batch size: 29, lr: 1.60e-02, grad_scale: 1.0 2024-06-19 20:22:12,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=55337.333333333336, ans=0.125 2024-06-19 20:22:13,424 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.61 vs. limit=10.0 2024-06-19 20:22:21,271 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.380e+03 2.908e+03 3.430e+03 4.325e+03 8.261e+03, threshold=6.861e+03, percent-clipped=16.0 2024-06-19 20:22:32,736 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.35 vs. limit=10.0 2024-06-19 20:22:36,581 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.14 vs. limit=15.0 2024-06-19 20:22:43,486 INFO [train.py:1028] (1/2) Epoch 3, batch 10000, loss[loss=0.4577, simple_loss=0.4364, pruned_loss=0.2395, over 12488.00 frames. ], tot_loss[loss=0.4854, simple_loss=0.443, pruned_loss=0.2639, over 2487163.42 frames. ], batch size: 22, lr: 1.60e-02, grad_scale: 1.0 2024-06-19 20:22:53,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=55447.333333333336, ans=0.025 2024-06-19 20:22:56,287 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.83 vs. limit=12.0 2024-06-19 20:22:57,395 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.24 vs. limit=15.0 2024-06-19 20:23:00,405 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 20:23:00,573 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.05 vs. limit=10.0 2024-06-19 20:23:03,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=55484.0, ans=0.0 2024-06-19 20:23:06,256 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.93 vs. limit=6.0 2024-06-19 20:23:10,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=55502.333333333336, ans=0.2 2024-06-19 20:23:14,466 INFO [train.py:1028] (1/2) Epoch 3, batch 10050, loss[loss=0.5121, simple_loss=0.4773, pruned_loss=0.2735, over 12314.00 frames. ], tot_loss[loss=0.4865, simple_loss=0.443, pruned_loss=0.265, over 2444465.63 frames. ], batch size: 22, lr: 1.60e-02, grad_scale: 0.5 2024-06-19 20:23:19,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=55520.666666666664, ans=0.0 2024-06-19 20:23:26,056 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.745e+03 2.953e+03 3.395e+03 3.963e+03 7.510e+03, threshold=6.790e+03, percent-clipped=1.0 2024-06-19 20:23:26,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=55557.333333333336, ans=0.1 2024-06-19 20:23:31,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55557.333333333336, ans=0.1 2024-06-19 20:23:31,686 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.43 vs. limit=10.0 2024-06-19 20:23:40,945 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=19.07 vs. limit=15.0 2024-06-19 20:23:44,772 INFO [train.py:1028] (1/2) Epoch 3, batch 10100, loss[loss=0.4115, simple_loss=0.3843, pruned_loss=0.2194, over 11017.00 frames. ], tot_loss[loss=0.4855, simple_loss=0.4423, pruned_loss=0.2643, over 2425523.35 frames. ], batch size: 16, lr: 1.60e-02, grad_scale: 1.0 2024-06-19 20:26:02,116 INFO [train.py:1028] (1/2) Epoch 4, batch 0, loss[loss=0.4631, simple_loss=0.4351, pruned_loss=0.2455, over 12956.00 frames. ], tot_loss[loss=0.4631, simple_loss=0.4351, pruned_loss=0.2455, over 12956.00 frames. ], batch size: 36, lr: 1.49e-02, grad_scale: 2.0 2024-06-19 20:26:02,116 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 20:26:09,119 INFO [train.py:1060] (1/2) Epoch 4, validation: loss=0.321, simple_loss=0.3487, pruned_loss=0.1466, over 351949.00 frames. 2024-06-19 20:26:09,120 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17340MB 2024-06-19 20:26:25,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=55678.333333333336, ans=0.125 2024-06-19 20:26:28,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=55678.333333333336, ans=0.125 2024-06-19 20:26:28,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=55678.333333333336, ans=0.125 2024-06-19 20:26:35,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=55696.666666666664, ans=0.125 2024-06-19 20:26:40,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55715.0, ans=0.1 2024-06-19 20:26:42,107 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.46 vs. limit=15.0 2024-06-19 20:26:45,881 INFO [train.py:1028] (1/2) Epoch 4, batch 50, loss[loss=0.4245, simple_loss=0.3961, pruned_loss=0.2264, over 12797.00 frames. ], tot_loss[loss=0.4508, simple_loss=0.4141, pruned_loss=0.2437, over 574227.48 frames. ], batch size: 29, lr: 1.49e-02, grad_scale: 0.5 2024-06-19 20:26:47,588 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.18 vs. limit=22.5 2024-06-19 20:26:49,110 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.393e+03 2.352e+03 2.933e+03 3.453e+03 4.888e+03, threshold=5.865e+03, percent-clipped=0.0 2024-06-19 20:26:50,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.75 vs. limit=22.5 2024-06-19 20:27:03,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=55770.0, ans=0.125 2024-06-19 20:27:05,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=55788.333333333336, ans=0.125 2024-06-19 20:27:11,816 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2024-06-19 20:27:16,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=55806.666666666664, ans=0.0 2024-06-19 20:27:17,648 INFO [train.py:1028] (1/2) Epoch 4, batch 100, loss[loss=0.4006, simple_loss=0.387, pruned_loss=0.2071, over 13285.00 frames. ], tot_loss[loss=0.4494, simple_loss=0.4126, pruned_loss=0.2431, over 1017592.49 frames. ], batch size: 46, lr: 1.49e-02, grad_scale: 1.0 2024-06-19 20:27:19,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=55825.0, ans=0.025 2024-06-19 20:27:21,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=55825.0, ans=0.0 2024-06-19 20:27:21,770 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.16 vs. limit=10.0 2024-06-19 20:27:23,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.60 vs. limit=12.0 2024-06-19 20:27:39,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=55880.0, ans=0.125 2024-06-19 20:27:40,243 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.68 vs. limit=10.0 2024-06-19 20:27:45,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=55898.333333333336, ans=0.0 2024-06-19 20:27:50,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=55898.333333333336, ans=0.125 2024-06-19 20:27:52,396 INFO [train.py:1028] (1/2) Epoch 4, batch 150, loss[loss=0.3864, simple_loss=0.3758, pruned_loss=0.1985, over 12685.00 frames. ], tot_loss[loss=0.4414, simple_loss=0.408, pruned_loss=0.2374, over 1365083.41 frames. ], batch size: 29, lr: 1.49e-02, grad_scale: 1.0 2024-06-19 20:27:53,523 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.45 vs. limit=8.0 2024-06-19 20:27:55,443 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.185e+03 2.915e+03 3.417e+03 3.933e+03 1.034e+04, threshold=6.833e+03, percent-clipped=5.0 2024-06-19 20:27:55,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=55916.666666666664, ans=0.125 2024-06-19 20:27:56,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=55916.666666666664, ans=0.0 2024-06-19 20:27:56,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=55916.666666666664, ans=0.125 2024-06-19 20:28:17,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=55971.666666666664, ans=0.0 2024-06-19 20:28:19,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.98 vs. limit=15.0 2024-06-19 20:28:25,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=55990.0, ans=0.0 2024-06-19 20:28:27,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=55990.0, ans=0.125 2024-06-19 20:28:28,760 INFO [train.py:1028] (1/2) Epoch 4, batch 200, loss[loss=0.5203, simple_loss=0.4507, pruned_loss=0.2949, over 12517.00 frames. ], tot_loss[loss=0.4427, simple_loss=0.4087, pruned_loss=0.2384, over 1635245.11 frames. ], batch size: 202, lr: 1.49e-02, grad_scale: 2.0 2024-06-19 20:28:32,919 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.81 vs. limit=15.0 2024-06-19 20:28:38,811 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=26.37 vs. limit=22.5 2024-06-19 20:28:41,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=56045.0, ans=0.125 2024-06-19 20:28:44,387 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.54 vs. limit=15.0 2024-06-19 20:28:50,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=56063.333333333336, ans=0.125 2024-06-19 20:29:00,735 INFO [train.py:1028] (1/2) Epoch 4, batch 250, loss[loss=0.4314, simple_loss=0.3924, pruned_loss=0.2352, over 13001.00 frames. ], tot_loss[loss=0.4415, simple_loss=0.4083, pruned_loss=0.2374, over 1846510.36 frames. ], batch size: 144, lr: 1.49e-02, grad_scale: 2.0 2024-06-19 20:29:03,990 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.838e+03 2.872e+03 3.361e+03 3.701e+03 5.400e+03, threshold=6.722e+03, percent-clipped=0.0 2024-06-19 20:29:08,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=56118.333333333336, ans=0.0 2024-06-19 20:29:17,436 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.29 vs. limit=10.0 2024-06-19 20:29:22,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=56155.0, ans=0.2 2024-06-19 20:29:32,902 INFO [train.py:1028] (1/2) Epoch 4, batch 300, loss[loss=0.4178, simple_loss=0.3845, pruned_loss=0.2256, over 13160.00 frames. ], tot_loss[loss=0.442, simple_loss=0.4092, pruned_loss=0.2374, over 2009445.85 frames. ], batch size: 112, lr: 1.49e-02, grad_scale: 2.0 2024-06-19 20:29:48,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=56228.333333333336, ans=0.125 2024-06-19 20:29:53,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=56228.333333333336, ans=0.2 2024-06-19 20:30:04,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=56246.666666666664, ans=0.125 2024-06-19 20:30:04,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=56246.666666666664, ans=0.125 2024-06-19 20:30:10,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=56265.0, ans=0.0 2024-06-19 20:30:10,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=56265.0, ans=0.2 2024-06-19 20:30:12,644 INFO [train.py:1028] (1/2) Epoch 4, batch 350, loss[loss=0.4282, simple_loss=0.405, pruned_loss=0.2258, over 12940.00 frames. ], tot_loss[loss=0.4398, simple_loss=0.4079, pruned_loss=0.2359, over 2138487.47 frames. ], batch size: 33, lr: 1.49e-02, grad_scale: 1.0 2024-06-19 20:30:16,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=56283.333333333336, ans=0.1 2024-06-19 20:30:17,143 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.418e+03 2.108e+03 2.509e+03 2.928e+03 4.711e+03, threshold=5.017e+03, percent-clipped=0.0 2024-06-19 20:30:30,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=56320.0, ans=0.2 2024-06-19 20:30:45,887 INFO [train.py:1028] (1/2) Epoch 4, batch 400, loss[loss=0.4125, simple_loss=0.3889, pruned_loss=0.2181, over 13298.00 frames. ], tot_loss[loss=0.4412, simple_loss=0.409, pruned_loss=0.2366, over 2238067.31 frames. ], batch size: 63, lr: 1.48e-02, grad_scale: 0.5 2024-06-19 20:30:50,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=56375.0, ans=0.0 2024-06-19 20:30:50,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=56375.0, ans=0.0 2024-06-19 20:30:54,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=56393.333333333336, ans=0.035 2024-06-19 20:30:55,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=20.36 vs. limit=15.0 2024-06-19 20:30:59,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=56411.666666666664, ans=0.0 2024-06-19 20:31:00,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=56411.666666666664, ans=0.125 2024-06-19 20:31:01,169 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.76 vs. limit=22.5 2024-06-19 20:31:12,549 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.76 vs. limit=15.0 2024-06-19 20:31:15,698 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.77 vs. limit=10.0 2024-06-19 20:31:15,766 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.45 vs. limit=22.5 2024-06-19 20:31:18,864 INFO [train.py:1028] (1/2) Epoch 4, batch 450, loss[loss=0.4135, simple_loss=0.3888, pruned_loss=0.2191, over 13231.00 frames. ], tot_loss[loss=0.439, simple_loss=0.4078, pruned_loss=0.2351, over 2311814.80 frames. ], batch size: 67, lr: 1.48e-02, grad_scale: 0.5 2024-06-19 20:31:21,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=56466.666666666664, ans=0.025 2024-06-19 20:31:22,619 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.64 vs. limit=10.0 2024-06-19 20:31:23,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=56466.666666666664, ans=0.125 2024-06-19 20:31:24,789 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.143e+03 2.272e+03 2.756e+03 3.458e+03 1.257e+04, threshold=5.512e+03, percent-clipped=7.0 2024-06-19 20:31:35,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=56503.333333333336, ans=0.025 2024-06-19 20:31:36,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=56503.333333333336, ans=0.0 2024-06-19 20:31:40,290 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 20:31:40,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=56521.666666666664, ans=0.125 2024-06-19 20:31:43,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=56521.666666666664, ans=0.125 2024-06-19 20:31:54,881 INFO [train.py:1028] (1/2) Epoch 4, batch 500, loss[loss=0.4184, simple_loss=0.3875, pruned_loss=0.2246, over 13115.00 frames. ], tot_loss[loss=0.4374, simple_loss=0.4074, pruned_loss=0.2337, over 2374811.34 frames. ], batch size: 121, lr: 1.48e-02, grad_scale: 1.0 2024-06-19 20:31:57,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=56558.333333333336, ans=0.0 2024-06-19 20:32:02,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=56576.666666666664, ans=0.125 2024-06-19 20:32:22,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=56613.333333333336, ans=0.025 2024-06-19 20:32:23,906 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.05 vs. limit=10.0 2024-06-19 20:32:24,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=56631.666666666664, ans=0.0 2024-06-19 20:32:30,893 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.83 vs. limit=15.0 2024-06-19 20:32:31,917 INFO [train.py:1028] (1/2) Epoch 4, batch 550, loss[loss=0.447, simple_loss=0.4056, pruned_loss=0.2442, over 12932.00 frames. ], tot_loss[loss=0.4365, simple_loss=0.4069, pruned_loss=0.2331, over 2419878.77 frames. ], batch size: 158, lr: 1.48e-02, grad_scale: 1.0 2024-06-19 20:32:37,772 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.029e+03 1.953e+03 2.313e+03 2.714e+03 8.504e+03, threshold=4.627e+03, percent-clipped=2.0 2024-06-19 20:32:38,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=56668.333333333336, ans=0.125 2024-06-19 20:32:41,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=56668.333333333336, ans=0.1 2024-06-19 20:32:48,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=56686.666666666664, ans=0.0 2024-06-19 20:32:59,708 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.70 vs. limit=15.0 2024-06-19 20:33:00,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=56723.333333333336, ans=0.125 2024-06-19 20:33:03,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=56741.666666666664, ans=0.125 2024-06-19 20:33:03,764 INFO [train.py:1028] (1/2) Epoch 4, batch 600, loss[loss=0.4263, simple_loss=0.3881, pruned_loss=0.2322, over 13007.00 frames. ], tot_loss[loss=0.4341, simple_loss=0.4055, pruned_loss=0.2313, over 2457461.85 frames. ], batch size: 144, lr: 1.48e-02, grad_scale: 1.0 2024-06-19 20:33:03,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=56741.666666666664, ans=0.025 2024-06-19 20:33:05,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=56741.666666666664, ans=0.015 2024-06-19 20:33:11,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=56760.0, ans=0.2 2024-06-19 20:33:13,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=56760.0, ans=0.025 2024-06-19 20:33:16,159 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=15.0 2024-06-19 20:33:20,148 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.18 vs. limit=10.0 2024-06-19 20:33:21,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=56778.333333333336, ans=0.125 2024-06-19 20:33:22,107 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.76 vs. limit=22.5 2024-06-19 20:33:22,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=56796.666666666664, ans=0.0 2024-06-19 20:33:31,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=56815.0, ans=0.125 2024-06-19 20:33:36,315 INFO [train.py:1028] (1/2) Epoch 4, batch 650, loss[loss=0.4018, simple_loss=0.3863, pruned_loss=0.2087, over 13120.00 frames. ], tot_loss[loss=0.4336, simple_loss=0.4055, pruned_loss=0.2308, over 2487916.26 frames. ], batch size: 59, lr: 1.48e-02, grad_scale: 0.5 2024-06-19 20:33:39,787 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2024-06-19 20:33:43,270 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.705e+03 2.641e+03 3.156e+03 3.707e+03 5.750e+03, threshold=6.312e+03, percent-clipped=6.0 2024-06-19 20:33:45,711 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.73 vs. limit=15.0 2024-06-19 20:33:47,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=56851.666666666664, ans=0.125 2024-06-19 20:33:55,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=56888.333333333336, ans=0.125 2024-06-19 20:34:03,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=56888.333333333336, ans=0.125 2024-06-19 20:34:11,326 INFO [train.py:1028] (1/2) Epoch 4, batch 700, loss[loss=0.4402, simple_loss=0.4161, pruned_loss=0.2321, over 13283.00 frames. ], tot_loss[loss=0.4322, simple_loss=0.4046, pruned_loss=0.2299, over 2511637.72 frames. ], batch size: 46, lr: 1.48e-02, grad_scale: 1.0 2024-06-19 20:34:14,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=56925.0, ans=0.125 2024-06-19 20:34:25,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=56943.333333333336, ans=0.0 2024-06-19 20:34:32,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=56980.0, ans=0.125 2024-06-19 20:34:33,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=56980.0, ans=0.035 2024-06-19 20:34:36,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=56980.0, ans=0.0 2024-06-19 20:34:37,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=56980.0, ans=0.125 2024-06-19 20:34:46,571 INFO [train.py:1028] (1/2) Epoch 4, batch 750, loss[loss=0.425, simple_loss=0.4146, pruned_loss=0.2177, over 13206.00 frames. ], tot_loss[loss=0.4313, simple_loss=0.4044, pruned_loss=0.2291, over 2527609.13 frames. ], batch size: 63, lr: 1.48e-02, grad_scale: 0.5 2024-06-19 20:34:54,064 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.251e+03 2.528e+03 2.905e+03 3.443e+03 8.093e+03, threshold=5.809e+03, percent-clipped=1.0 2024-06-19 20:34:55,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=57035.0, ans=0.1 2024-06-19 20:35:00,189 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.17 vs. limit=10.0 2024-06-19 20:35:03,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=57053.333333333336, ans=0.1 2024-06-19 20:35:10,197 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.09 vs. limit=6.0 2024-06-19 20:35:18,791 INFO [train.py:1028] (1/2) Epoch 4, batch 800, loss[loss=0.4218, simple_loss=0.4043, pruned_loss=0.2196, over 12908.00 frames. ], tot_loss[loss=0.4305, simple_loss=0.4037, pruned_loss=0.2287, over 2541087.34 frames. ], batch size: 36, lr: 1.48e-02, grad_scale: 1.0 2024-06-19 20:35:21,207 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.01 vs. limit=15.0 2024-06-19 20:35:23,202 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.67 vs. limit=6.0 2024-06-19 20:35:24,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=57126.666666666664, ans=0.05 2024-06-19 20:35:27,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=57126.666666666664, ans=0.125 2024-06-19 20:35:31,581 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.65 vs. limit=15.0 2024-06-19 20:35:32,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=57145.0, ans=0.125 2024-06-19 20:35:40,426 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.17 vs. limit=15.0 2024-06-19 20:35:52,407 INFO [train.py:1028] (1/2) Epoch 4, batch 850, loss[loss=0.4034, simple_loss=0.3821, pruned_loss=0.2123, over 13140.00 frames. ], tot_loss[loss=0.4292, simple_loss=0.403, pruned_loss=0.2277, over 2551507.14 frames. ], batch size: 95, lr: 1.47e-02, grad_scale: 1.0 2024-06-19 20:35:55,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=57200.0, ans=0.0 2024-06-19 20:35:56,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=57200.0, ans=0.125 2024-06-19 20:35:59,132 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.73 vs. limit=6.0 2024-06-19 20:36:00,587 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.380e+03 2.481e+03 3.066e+03 3.692e+03 4.698e+03, threshold=6.132e+03, percent-clipped=0.0 2024-06-19 20:36:02,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=57218.333333333336, ans=0.1 2024-06-19 20:36:02,261 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.81 vs. limit=10.0 2024-06-19 20:36:06,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=57218.333333333336, ans=0.015 2024-06-19 20:36:07,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=57236.666666666664, ans=0.07 2024-06-19 20:36:10,708 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.26 vs. limit=15.0 2024-06-19 20:36:11,932 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.80 vs. limit=10.0 2024-06-19 20:36:12,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=57236.666666666664, ans=0.0 2024-06-19 20:36:13,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=57236.666666666664, ans=0.0 2024-06-19 20:36:14,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=57255.0, ans=0.025 2024-06-19 20:36:25,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=57273.333333333336, ans=0.1 2024-06-19 20:36:30,765 INFO [train.py:1028] (1/2) Epoch 4, batch 900, loss[loss=0.3808, simple_loss=0.3725, pruned_loss=0.1945, over 12939.00 frames. ], tot_loss[loss=0.4294, simple_loss=0.4028, pruned_loss=0.2279, over 2556831.26 frames. ], batch size: 36, lr: 1.47e-02, grad_scale: 1.0 2024-06-19 20:36:31,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=57291.666666666664, ans=0.125 2024-06-19 20:36:31,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=57291.666666666664, ans=0.0 2024-06-19 20:36:32,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=57291.666666666664, ans=0.125 2024-06-19 20:36:33,015 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.16 vs. limit=22.5 2024-06-19 20:36:36,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=57310.0, ans=0.0 2024-06-19 20:36:37,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=57310.0, ans=0.125 2024-06-19 20:36:41,531 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.49 vs. limit=22.5 2024-06-19 20:37:03,374 INFO [train.py:1028] (1/2) Epoch 4, batch 950, loss[loss=0.4218, simple_loss=0.4032, pruned_loss=0.2202, over 13037.00 frames. ], tot_loss[loss=0.4312, simple_loss=0.404, pruned_loss=0.2292, over 2559421.94 frames. ], batch size: 39, lr: 1.47e-02, grad_scale: 0.5 2024-06-19 20:37:12,759 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.988e+03 3.110e+03 3.597e+03 4.266e+03 6.315e+03, threshold=7.195e+03, percent-clipped=2.0 2024-06-19 20:37:14,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=57401.666666666664, ans=0.0 2024-06-19 20:37:23,829 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.38 vs. limit=22.5 2024-06-19 20:37:25,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=57438.333333333336, ans=0.125 2024-06-19 20:37:25,801 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.99 vs. limit=15.0 2024-06-19 20:37:35,392 INFO [train.py:1028] (1/2) Epoch 4, batch 1000, loss[loss=0.4544, simple_loss=0.422, pruned_loss=0.2434, over 13043.00 frames. ], tot_loss[loss=0.4317, simple_loss=0.4039, pruned_loss=0.2298, over 2561236.44 frames. ], batch size: 48, lr: 1.47e-02, grad_scale: 1.0 2024-06-19 20:37:47,920 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.16 vs. limit=15.0 2024-06-19 20:37:50,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=57511.666666666664, ans=0.125 2024-06-19 20:37:51,457 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 20:37:56,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=57530.0, ans=0.07 2024-06-19 20:38:11,635 INFO [train.py:1028] (1/2) Epoch 4, batch 1050, loss[loss=0.42, simple_loss=0.3998, pruned_loss=0.2201, over 13159.00 frames. ], tot_loss[loss=0.4311, simple_loss=0.4036, pruned_loss=0.2293, over 2564441.95 frames. ], batch size: 77, lr: 1.47e-02, grad_scale: 0.5 2024-06-19 20:38:14,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=57566.666666666664, ans=0.125 2024-06-19 20:38:25,095 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.750e+03 2.753e+03 3.133e+03 3.636e+03 1.136e+04, threshold=6.267e+03, percent-clipped=1.0 2024-06-19 20:38:34,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=57621.666666666664, ans=0.2 2024-06-19 20:38:35,033 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.08 vs. limit=22.5 2024-06-19 20:38:35,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=57621.666666666664, ans=0.125 2024-06-19 20:38:43,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=57640.0, ans=0.0 2024-06-19 20:38:48,082 INFO [train.py:1028] (1/2) Epoch 4, batch 1100, loss[loss=0.4328, simple_loss=0.4094, pruned_loss=0.2281, over 13305.00 frames. ], tot_loss[loss=0.4313, simple_loss=0.4037, pruned_loss=0.2294, over 2569690.77 frames. ], batch size: 52, lr: 1.47e-02, grad_scale: 1.0 2024-06-19 20:38:50,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=57658.333333333336, ans=0.0 2024-06-19 20:38:56,511 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.77 vs. limit=15.0 2024-06-19 20:39:14,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=57731.666666666664, ans=0.125 2024-06-19 20:39:18,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=57731.666666666664, ans=0.125 2024-06-19 20:39:21,066 INFO [train.py:1028] (1/2) Epoch 4, batch 1150, loss[loss=0.477, simple_loss=0.4451, pruned_loss=0.2545, over 13221.00 frames. ], tot_loss[loss=0.4312, simple_loss=0.4036, pruned_loss=0.2293, over 2570732.20 frames. ], batch size: 52, lr: 1.47e-02, grad_scale: 0.5 2024-06-19 20:39:23,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=57750.0, ans=0.025 2024-06-19 20:39:24,770 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.77 vs. limit=15.0 2024-06-19 20:39:31,215 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.567e+03 2.438e+03 2.816e+03 3.261e+03 1.133e+04, threshold=5.631e+03, percent-clipped=1.0 2024-06-19 20:39:45,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=57805.0, ans=0.0 2024-06-19 20:39:53,044 INFO [train.py:1028] (1/2) Epoch 4, batch 1200, loss[loss=0.3935, simple_loss=0.3865, pruned_loss=0.2003, over 13154.00 frames. ], tot_loss[loss=0.4312, simple_loss=0.4036, pruned_loss=0.2294, over 2573319.48 frames. ], batch size: 77, lr: 1.47e-02, grad_scale: 1.0 2024-06-19 20:39:54,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=57841.666666666664, ans=0.2 2024-06-19 20:39:55,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=57841.666666666664, ans=0.0 2024-06-19 20:40:05,714 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.90 vs. limit=15.0 2024-06-19 20:40:08,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=57878.333333333336, ans=0.2 2024-06-19 20:40:08,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=57878.333333333336, ans=0.2 2024-06-19 20:40:09,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=57878.333333333336, ans=0.125 2024-06-19 20:40:10,013 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.385e+03 2024-06-19 20:40:10,070 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=6.176e+00 2024-06-19 20:40:10,366 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.96 vs. limit=15.0 2024-06-19 20:40:10,371 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.70 vs. limit=15.0 2024-06-19 20:40:28,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=57915.0, ans=0.125 2024-06-19 20:40:29,173 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.77 vs. limit=15.0 2024-06-19 20:40:30,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=57933.333333333336, ans=0.025 2024-06-19 20:40:30,783 INFO [train.py:1028] (1/2) Epoch 4, batch 1250, loss[loss=0.4422, simple_loss=0.4114, pruned_loss=0.2364, over 13238.00 frames. ], tot_loss[loss=0.4301, simple_loss=0.403, pruned_loss=0.2286, over 2582797.68 frames. ], batch size: 112, lr: 1.47e-02, grad_scale: 1.0 2024-06-19 20:40:36,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=57933.333333333336, ans=0.125 2024-06-19 20:40:37,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=57951.666666666664, ans=0.125 2024-06-19 20:40:37,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=57951.666666666664, ans=0.125 2024-06-19 20:40:41,025 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.838e+03 2.370e+03 2.802e+03 3.132e+03 7.023e+03, threshold=5.604e+03, percent-clipped=2.0 2024-06-19 20:40:42,013 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.82 vs. limit=15.0 2024-06-19 20:40:50,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=57988.333333333336, ans=0.2 2024-06-19 20:40:52,786 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.32 vs. limit=15.0 2024-06-19 20:40:53,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=57988.333333333336, ans=0.125 2024-06-19 20:40:57,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=58006.666666666664, ans=0.125 2024-06-19 20:41:02,758 INFO [train.py:1028] (1/2) Epoch 4, batch 1300, loss[loss=0.4654, simple_loss=0.4242, pruned_loss=0.2533, over 12844.00 frames. ], tot_loss[loss=0.4296, simple_loss=0.4028, pruned_loss=0.2282, over 2582687.43 frames. ], batch size: 177, lr: 1.46e-02, grad_scale: 2.0 2024-06-19 20:41:07,161 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=5.460e+01 2024-06-19 20:41:11,614 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.53 vs. limit=22.5 2024-06-19 20:41:20,394 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=10.90 vs. limit=12.0 2024-06-19 20:41:20,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=58080.0, ans=0.125 2024-06-19 20:41:20,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=58080.0, ans=0.2 2024-06-19 20:41:21,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=58080.0, ans=0.125 2024-06-19 20:41:29,237 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=58098.333333333336, ans=0.0 2024-06-19 20:41:32,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=58098.333333333336, ans=0.0 2024-06-19 20:41:34,205 INFO [train.py:1028] (1/2) Epoch 4, batch 1350, loss[loss=0.3981, simple_loss=0.3889, pruned_loss=0.2036, over 13204.00 frames. ], tot_loss[loss=0.4277, simple_loss=0.4022, pruned_loss=0.2266, over 2585899.20 frames. ], batch size: 59, lr: 1.46e-02, grad_scale: 1.0 2024-06-19 20:41:36,477 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.41 vs. limit=15.0 2024-06-19 20:41:41,868 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.09 vs. limit=15.0 2024-06-19 20:41:42,453 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=58135.0, ans=0.2 2024-06-19 20:41:45,310 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.704e+02 2.022e+03 2.308e+03 2.650e+03 4.836e+03, threshold=4.617e+03, percent-clipped=0.0 2024-06-19 20:41:46,942 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.75 vs. limit=15.0 2024-06-19 20:41:47,016 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=21.26 vs. limit=15.0 2024-06-19 20:41:54,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=58171.666666666664, ans=0.125 2024-06-19 20:42:02,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=58190.0, ans=0.125 2024-06-19 20:42:10,052 INFO [train.py:1028] (1/2) Epoch 4, batch 1400, loss[loss=0.4586, simple_loss=0.4267, pruned_loss=0.2452, over 12428.00 frames. ], tot_loss[loss=0.429, simple_loss=0.4027, pruned_loss=0.2276, over 2587268.73 frames. ], batch size: 25, lr: 1.46e-02, grad_scale: 1.0 2024-06-19 20:42:33,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=58263.333333333336, ans=0.125 2024-06-19 20:42:41,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=58281.666666666664, ans=0.125 2024-06-19 20:42:45,518 INFO [train.py:1028] (1/2) Epoch 4, batch 1450, loss[loss=0.4003, simple_loss=0.3789, pruned_loss=0.2109, over 13097.00 frames. ], tot_loss[loss=0.4286, simple_loss=0.4021, pruned_loss=0.2276, over 2586247.22 frames. ], batch size: 121, lr: 1.46e-02, grad_scale: 1.0 2024-06-19 20:42:57,368 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.375e+03 2.488e+03 2.825e+03 3.343e+03 6.478e+03, threshold=5.650e+03, percent-clipped=3.0 2024-06-19 20:42:58,342 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 20:43:16,515 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.78 vs. limit=15.0 2024-06-19 20:43:17,994 INFO [train.py:1028] (1/2) Epoch 4, batch 1500, loss[loss=0.4248, simple_loss=0.4005, pruned_loss=0.2246, over 13194.00 frames. ], tot_loss[loss=0.4286, simple_loss=0.4021, pruned_loss=0.2276, over 2588546.24 frames. ], batch size: 83, lr: 1.46e-02, grad_scale: 1.0 2024-06-19 20:43:19,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=15.0 2024-06-19 20:43:20,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=58391.666666666664, ans=0.2 2024-06-19 20:43:22,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=58391.666666666664, ans=0.0 2024-06-19 20:43:30,488 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.24 vs. limit=15.0 2024-06-19 20:43:31,145 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.51 vs. limit=15.0 2024-06-19 20:43:37,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=58446.666666666664, ans=0.015 2024-06-19 20:43:38,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=58446.666666666664, ans=0.125 2024-06-19 20:43:38,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=58446.666666666664, ans=0.125 2024-06-19 20:43:49,831 INFO [train.py:1028] (1/2) Epoch 4, batch 1550, loss[loss=0.4383, simple_loss=0.3982, pruned_loss=0.2392, over 12994.00 frames. ], tot_loss[loss=0.4292, simple_loss=0.4023, pruned_loss=0.228, over 2584625.19 frames. ], batch size: 102, lr: 1.46e-02, grad_scale: 0.5 2024-06-19 20:43:53,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=58483.333333333336, ans=0.0 2024-06-19 20:44:03,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=58501.666666666664, ans=0.125 2024-06-19 20:44:04,786 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.44 vs. limit=22.5 2024-06-19 20:44:05,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=58520.0, ans=0.125 2024-06-19 20:44:05,730 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.447e+03 2.182e+03 2.598e+03 3.060e+03 8.900e+03, threshold=5.197e+03, percent-clipped=2.0 2024-06-19 20:44:05,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=58520.0, ans=0.125 2024-06-19 20:44:05,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=58520.0, ans=0.125 2024-06-19 20:44:09,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=58520.0, ans=0.0 2024-06-19 20:44:10,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=58520.0, ans=10.0 2024-06-19 20:44:18,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=58538.333333333336, ans=0.0 2024-06-19 20:44:19,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=58538.333333333336, ans=0.0 2024-06-19 20:44:19,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=58538.333333333336, ans=0.2 2024-06-19 20:44:27,933 INFO [train.py:1028] (1/2) Epoch 4, batch 1600, loss[loss=0.4105, simple_loss=0.3923, pruned_loss=0.2144, over 13186.00 frames. ], tot_loss[loss=0.4278, simple_loss=0.4016, pruned_loss=0.227, over 2579575.71 frames. ], batch size: 77, lr: 1.46e-02, grad_scale: 1.0 2024-06-19 20:44:31,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=58575.0, ans=0.125 2024-06-19 20:44:37,000 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.15 vs. limit=15.0 2024-06-19 20:44:44,628 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.16 vs. limit=15.0 2024-06-19 20:44:44,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=58611.666666666664, ans=0.125 2024-06-19 20:44:52,957 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 20:44:56,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=58648.333333333336, ans=0.025 2024-06-19 20:44:57,708 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=12.0 2024-06-19 20:45:04,730 INFO [train.py:1028] (1/2) Epoch 4, batch 1650, loss[loss=0.4227, simple_loss=0.397, pruned_loss=0.2242, over 13198.00 frames. ], tot_loss[loss=0.4253, simple_loss=0.4001, pruned_loss=0.2253, over 2575935.68 frames. ], batch size: 95, lr: 1.46e-02, grad_scale: 1.0 2024-06-19 20:45:08,869 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.97 vs. limit=6.0 2024-06-19 20:45:10,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=58685.0, ans=0.0 2024-06-19 20:45:17,407 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.303e+03 2.008e+03 2.319e+03 2.954e+03 7.579e+03, threshold=4.637e+03, percent-clipped=3.0 2024-06-19 20:45:22,537 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.59 vs. limit=15.0 2024-06-19 20:45:32,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=58740.0, ans=0.07 2024-06-19 20:45:33,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=58740.0, ans=0.05 2024-06-19 20:45:37,510 INFO [train.py:1028] (1/2) Epoch 4, batch 1700, loss[loss=0.3562, simple_loss=0.353, pruned_loss=0.1797, over 12314.00 frames. ], tot_loss[loss=0.4245, simple_loss=0.4002, pruned_loss=0.2244, over 2580690.89 frames. ], batch size: 25, lr: 1.46e-02, grad_scale: 2.0 2024-06-19 20:45:38,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=58758.333333333336, ans=0.1 2024-06-19 20:45:40,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=58758.333333333336, ans=0.125 2024-06-19 20:45:44,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=58776.666666666664, ans=0.125 2024-06-19 20:45:46,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=58776.666666666664, ans=0.125 2024-06-19 20:45:47,514 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.61 vs. limit=15.0 2024-06-19 20:45:49,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=58795.0, ans=0.09899494936611666 2024-06-19 20:45:50,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=58795.0, ans=0.0 2024-06-19 20:45:54,005 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=17.91 vs. limit=15.0 2024-06-19 20:46:11,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.83 vs. limit=6.0 2024-06-19 20:46:12,612 INFO [train.py:1028] (1/2) Epoch 4, batch 1750, loss[loss=0.3885, simple_loss=0.3836, pruned_loss=0.1967, over 12589.00 frames. ], tot_loss[loss=0.4243, simple_loss=0.4002, pruned_loss=0.2242, over 2581541.56 frames. ], batch size: 22, lr: 1.45e-02, grad_scale: 1.0 2024-06-19 20:46:20,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=58850.0, ans=0.125 2024-06-19 20:46:29,544 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.691e+03 2.292e+03 2.681e+03 3.172e+03 5.645e+03, threshold=5.361e+03, percent-clipped=1.0 2024-06-19 20:46:35,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=58905.0, ans=0.0 2024-06-19 20:46:36,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=58905.0, ans=0.2 2024-06-19 20:46:38,943 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=15.0 2024-06-19 20:46:47,986 INFO [train.py:1028] (1/2) Epoch 4, batch 1800, loss[loss=0.4039, simple_loss=0.3852, pruned_loss=0.2113, over 13220.00 frames. ], tot_loss[loss=0.4233, simple_loss=0.3997, pruned_loss=0.2235, over 2581835.08 frames. ], batch size: 67, lr: 1.45e-02, grad_scale: 2.0 2024-06-19 20:47:06,225 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.73 vs. limit=10.0 2024-06-19 20:47:09,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=58996.666666666664, ans=0.125 2024-06-19 20:47:10,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=58996.666666666664, ans=0.0 2024-06-19 20:47:20,240 INFO [train.py:1028] (1/2) Epoch 4, batch 1850, loss[loss=0.4098, simple_loss=0.3933, pruned_loss=0.2132, over 13158.00 frames. ], tot_loss[loss=0.4233, simple_loss=0.4001, pruned_loss=0.2232, over 2583083.89 frames. ], batch size: 83, lr: 1.45e-02, grad_scale: 0.5 2024-06-19 20:47:25,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=59033.333333333336, ans=0.025 2024-06-19 20:47:28,911 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=24.35 vs. limit=15.0 2024-06-19 20:47:31,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=59051.666666666664, ans=0.125 2024-06-19 20:47:34,821 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.836e+03 2.723e+03 3.072e+03 3.920e+03 6.542e+03, threshold=6.144e+03, percent-clipped=4.0 2024-06-19 20:47:38,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=59088.333333333336, ans=0.125 2024-06-19 20:47:51,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=59125.0, ans=0.0 2024-06-19 20:47:52,237 INFO [train.py:1028] (1/2) Epoch 4, batch 1900, loss[loss=0.4066, simple_loss=0.3804, pruned_loss=0.2165, over 13162.00 frames. ], tot_loss[loss=0.4221, simple_loss=0.3987, pruned_loss=0.2228, over 2585318.29 frames. ], batch size: 95, lr: 1.45e-02, grad_scale: 1.0 2024-06-19 20:47:57,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=59125.0, ans=0.125 2024-06-19 20:48:04,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=59143.333333333336, ans=0.0 2024-06-19 20:48:07,499 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.33 vs. limit=15.0 2024-06-19 20:48:17,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=59180.0, ans=15.0 2024-06-19 20:48:22,406 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.40 vs. limit=15.0 2024-06-19 20:48:23,017 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.98 vs. limit=22.5 2024-06-19 20:48:26,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=59198.333333333336, ans=0.07 2024-06-19 20:48:29,822 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.48 vs. limit=10.0 2024-06-19 20:48:30,724 INFO [train.py:1028] (1/2) Epoch 4, batch 1950, loss[loss=0.3922, simple_loss=0.387, pruned_loss=0.1986, over 13230.00 frames. ], tot_loss[loss=0.4215, simple_loss=0.3979, pruned_loss=0.2225, over 2590960.07 frames. ], batch size: 52, lr: 1.45e-02, grad_scale: 1.0 2024-06-19 20:48:39,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=59235.0, ans=0.125 2024-06-19 20:48:45,521 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.350e+03 2.221e+03 2.489e+03 2.873e+03 3.745e+03, threshold=4.978e+03, percent-clipped=0.0 2024-06-19 20:48:52,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=59271.666666666664, ans=0.035 2024-06-19 20:48:57,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=59290.0, ans=0.1 2024-06-19 20:48:58,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59290.0, ans=0.1 2024-06-19 20:49:02,730 INFO [train.py:1028] (1/2) Epoch 4, batch 2000, loss[loss=0.3837, simple_loss=0.3871, pruned_loss=0.1902, over 12281.00 frames. ], tot_loss[loss=0.4199, simple_loss=0.3968, pruned_loss=0.2215, over 2586654.01 frames. ], batch size: 22, lr: 1.45e-02, grad_scale: 2.0 2024-06-19 20:49:09,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=59326.666666666664, ans=0.2 2024-06-19 20:49:16,191 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.50 vs. limit=22.5 2024-06-19 20:49:17,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=59345.0, ans=0.1 2024-06-19 20:49:18,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=59345.0, ans=0.125 2024-06-19 20:49:29,106 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.66 vs. limit=10.0 2024-06-19 20:49:32,325 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.67 vs. limit=22.5 2024-06-19 20:49:35,705 INFO [train.py:1028] (1/2) Epoch 4, batch 2050, loss[loss=0.3916, simple_loss=0.3841, pruned_loss=0.1995, over 12764.00 frames. ], tot_loss[loss=0.4221, simple_loss=0.3986, pruned_loss=0.2229, over 2581724.54 frames. ], batch size: 29, lr: 1.45e-02, grad_scale: 1.0 2024-06-19 20:49:37,182 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 20:49:38,532 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 20:49:39,329 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=15.23 vs. limit=15.0 2024-06-19 20:49:39,341 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.76 vs. limit=22.5 2024-06-19 20:49:43,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=59418.333333333336, ans=0.125 2024-06-19 20:49:45,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=59418.333333333336, ans=0.07 2024-06-19 20:49:51,122 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.99 vs. limit=10.0 2024-06-19 20:49:51,320 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.377e+03 2.106e+03 2.613e+03 2.993e+03 5.515e+03, threshold=5.226e+03, percent-clipped=2.0 2024-06-19 20:50:05,128 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.88 vs. limit=15.0 2024-06-19 20:50:07,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=59473.333333333336, ans=0.125 2024-06-19 20:50:11,110 INFO [train.py:1028] (1/2) Epoch 4, batch 2100, loss[loss=0.4034, simple_loss=0.3972, pruned_loss=0.2048, over 13170.00 frames. ], tot_loss[loss=0.4193, simple_loss=0.397, pruned_loss=0.2208, over 2584746.61 frames. ], batch size: 59, lr: 1.45e-02, grad_scale: 2.0 2024-06-19 20:50:12,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=59491.666666666664, ans=0.125 2024-06-19 20:50:22,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=59510.0, ans=0.0 2024-06-19 20:50:25,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=59510.0, ans=0.0 2024-06-19 20:50:29,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59528.333333333336, ans=0.1 2024-06-19 20:50:36,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=59546.666666666664, ans=0.025 2024-06-19 20:50:48,238 INFO [train.py:1028] (1/2) Epoch 4, batch 2150, loss[loss=0.382, simple_loss=0.3774, pruned_loss=0.1933, over 13221.00 frames. ], tot_loss[loss=0.419, simple_loss=0.3971, pruned_loss=0.2205, over 2587384.20 frames. ], batch size: 52, lr: 1.45e-02, grad_scale: 1.0 2024-06-19 20:51:05,240 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.673e+03 2.633e+03 2.956e+03 3.400e+03 4.939e+03, threshold=5.911e+03, percent-clipped=0.0 2024-06-19 20:51:11,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=59638.333333333336, ans=0.125 2024-06-19 20:51:17,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=59656.666666666664, ans=0.125 2024-06-19 20:51:20,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=59675.0, ans=0.09899494936611666 2024-06-19 20:51:20,658 INFO [train.py:1028] (1/2) Epoch 4, batch 2200, loss[loss=0.4052, simple_loss=0.3903, pruned_loss=0.21, over 13261.00 frames. ], tot_loss[loss=0.4197, simple_loss=0.3976, pruned_loss=0.2209, over 2588249.79 frames. ], batch size: 83, lr: 1.45e-02, grad_scale: 1.0 2024-06-19 20:51:24,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=59675.0, ans=0.0 2024-06-19 20:51:39,962 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.01 vs. limit=22.5 2024-06-19 20:51:41,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=59730.0, ans=0.0 2024-06-19 20:51:42,316 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.380e+00 2024-06-19 20:51:42,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=59730.0, ans=0.125 2024-06-19 20:51:44,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=59730.0, ans=0.025 2024-06-19 20:51:53,467 INFO [train.py:1028] (1/2) Epoch 4, batch 2250, loss[loss=0.3844, simple_loss=0.3666, pruned_loss=0.2012, over 13306.00 frames. ], tot_loss[loss=0.4178, simple_loss=0.3961, pruned_loss=0.2198, over 2587232.09 frames. ], batch size: 63, lr: 1.44e-02, grad_scale: 1.0 2024-06-19 20:51:55,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=59766.666666666664, ans=0.2 2024-06-19 20:52:02,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=59785.0, ans=0.0 2024-06-19 20:52:12,886 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+03 2.388e+03 2.751e+03 3.153e+03 1.252e+04, threshold=5.501e+03, percent-clipped=2.0 2024-06-19 20:52:13,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=59803.333333333336, ans=0.1 2024-06-19 20:52:15,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=59821.666666666664, ans=0.125 2024-06-19 20:52:16,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=59821.666666666664, ans=0.025 2024-06-19 20:52:31,440 INFO [train.py:1028] (1/2) Epoch 4, batch 2300, loss[loss=0.4201, simple_loss=0.3979, pruned_loss=0.2211, over 12942.00 frames. ], tot_loss[loss=0.4172, simple_loss=0.3961, pruned_loss=0.2192, over 2580643.69 frames. ], batch size: 33, lr: 1.44e-02, grad_scale: 2.0 2024-06-19 20:52:37,928 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.44 vs. limit=22.5 2024-06-19 20:52:52,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=59913.333333333336, ans=0.2 2024-06-19 20:52:56,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=59913.333333333336, ans=0.1 2024-06-19 20:53:04,309 INFO [train.py:1028] (1/2) Epoch 4, batch 2350, loss[loss=0.4184, simple_loss=0.3968, pruned_loss=0.22, over 13248.00 frames. ], tot_loss[loss=0.4179, simple_loss=0.3964, pruned_loss=0.2197, over 2584612.36 frames. ], batch size: 67, lr: 1.44e-02, grad_scale: 0.5 2024-06-19 20:53:07,117 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 20:53:10,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=59968.333333333336, ans=0.0 2024-06-19 20:53:12,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=59968.333333333336, ans=0.125 2024-06-19 20:53:19,319 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=22.5 2024-06-19 20:53:21,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=59986.666666666664, ans=0.125 2024-06-19 20:53:21,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=59986.666666666664, ans=0.05 2024-06-19 20:53:22,863 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.333e+03 2.025e+03 2.323e+03 2.891e+03 1.142e+04, threshold=4.646e+03, percent-clipped=3.0 2024-06-19 20:53:23,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=59986.666666666664, ans=0.0 2024-06-19 20:53:31,083 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.11 vs. limit=15.0 2024-06-19 20:53:36,504 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.29 vs. limit=22.5 2024-06-19 20:53:37,518 INFO [train.py:1028] (1/2) Epoch 4, batch 2400, loss[loss=0.4095, simple_loss=0.3972, pruned_loss=0.211, over 13301.00 frames. ], tot_loss[loss=0.4168, simple_loss=0.3953, pruned_loss=0.2192, over 2587430.54 frames. ], batch size: 46, lr: 1.44e-02, grad_scale: 1.0 2024-06-19 20:53:44,373 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.83 vs. limit=10.0 2024-06-19 20:53:46,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=60060.0, ans=0.07 2024-06-19 20:53:47,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=60060.0, ans=0.0 2024-06-19 20:53:50,094 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.96 vs. limit=15.0 2024-06-19 20:53:56,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=60078.333333333336, ans=0.1 2024-06-19 20:53:56,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=60078.333333333336, ans=0.125 2024-06-19 20:53:59,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=60078.333333333336, ans=0.2 2024-06-19 20:54:03,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=60096.666666666664, ans=0.0 2024-06-19 20:54:15,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=60115.0, ans=0.0 2024-06-19 20:54:16,463 INFO [train.py:1028] (1/2) Epoch 4, batch 2450, loss[loss=0.3885, simple_loss=0.3751, pruned_loss=0.2009, over 13266.00 frames. ], tot_loss[loss=0.4163, simple_loss=0.3944, pruned_loss=0.2191, over 2583789.84 frames. ], batch size: 63, lr: 1.44e-02, grad_scale: 1.0 2024-06-19 20:54:22,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=60151.666666666664, ans=0.125 2024-06-19 20:54:23,277 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2024-06-19 20:54:34,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=60170.0, ans=0.125 2024-06-19 20:54:34,579 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.335e+03 1.966e+03 2.409e+03 2.805e+03 5.249e+03, threshold=4.817e+03, percent-clipped=2.0 2024-06-19 20:54:43,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=60206.666666666664, ans=0.125 2024-06-19 20:54:43,863 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.18 vs. limit=6.0 2024-06-19 20:54:48,701 INFO [train.py:1028] (1/2) Epoch 4, batch 2500, loss[loss=0.3652, simple_loss=0.357, pruned_loss=0.1867, over 13274.00 frames. ], tot_loss[loss=0.4145, simple_loss=0.3931, pruned_loss=0.2179, over 2586721.88 frames. ], batch size: 83, lr: 1.44e-02, grad_scale: 2.0 2024-06-19 20:54:56,095 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.27 vs. limit=15.0 2024-06-19 20:54:56,747 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.26 vs. limit=15.0 2024-06-19 20:55:00,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=60243.333333333336, ans=22.5 2024-06-19 20:55:02,984 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.11 vs. limit=15.0 2024-06-19 20:55:05,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=60261.666666666664, ans=0.125 2024-06-19 20:55:06,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=60261.666666666664, ans=0.2 2024-06-19 20:55:16,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=60298.333333333336, ans=0.125 2024-06-19 20:55:21,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=60316.666666666664, ans=0.025 2024-06-19 20:55:21,525 INFO [train.py:1028] (1/2) Epoch 4, batch 2550, loss[loss=0.4117, simple_loss=0.4, pruned_loss=0.2116, over 12617.00 frames. ], tot_loss[loss=0.4121, simple_loss=0.391, pruned_loss=0.2166, over 2586182.00 frames. ], batch size: 22, lr: 1.44e-02, grad_scale: 2.0 2024-06-19 20:55:25,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=60316.666666666664, ans=0.0 2024-06-19 20:55:29,862 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.30 vs. limit=10.0 2024-06-19 20:55:39,528 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.403e+03 1.997e+03 2.318e+03 2.716e+03 3.973e+03, threshold=4.636e+03, percent-clipped=0.0 2024-06-19 20:55:41,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=60371.666666666664, ans=0.0 2024-06-19 20:55:56,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=60408.333333333336, ans=0.125 2024-06-19 20:55:57,009 INFO [train.py:1028] (1/2) Epoch 4, batch 2600, loss[loss=0.3413, simple_loss=0.3499, pruned_loss=0.1664, over 13291.00 frames. ], tot_loss[loss=0.41, simple_loss=0.3889, pruned_loss=0.2155, over 2585918.99 frames. ], batch size: 52, lr: 1.44e-02, grad_scale: 4.0 2024-06-19 20:56:06,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=60426.666666666664, ans=0.1 2024-06-19 20:56:10,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=60426.666666666664, ans=0.125 2024-06-19 20:56:10,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=60426.666666666664, ans=0.1 2024-06-19 20:56:22,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=60463.333333333336, ans=0.025 2024-06-19 20:56:25,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=60481.666666666664, ans=0.0 2024-06-19 20:56:27,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=60481.666666666664, ans=0.125 2024-06-19 20:56:28,777 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=19.12 vs. limit=15.0 2024-06-19 20:56:32,959 INFO [train.py:1028] (1/2) Epoch 4, batch 2650, loss[loss=0.4143, simple_loss=0.3851, pruned_loss=0.2217, over 13063.00 frames. ], tot_loss[loss=0.4082, simple_loss=0.3871, pruned_loss=0.2146, over 2586004.05 frames. ], batch size: 144, lr: 1.44e-02, grad_scale: 1.0 2024-06-19 20:56:36,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=60500.0, ans=0.09899494936611666 2024-06-19 20:56:51,921 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.362e+02 1.906e+03 2.316e+03 2.609e+03 6.786e+03, threshold=4.632e+03, percent-clipped=2.0 2024-06-19 20:57:04,655 INFO [train.py:1028] (1/2) Epoch 4, batch 2700, loss[loss=0.3683, simple_loss=0.3499, pruned_loss=0.1933, over 13214.00 frames. ], tot_loss[loss=0.4055, simple_loss=0.3848, pruned_loss=0.2131, over 2584427.55 frames. ], batch size: 89, lr: 1.43e-02, grad_scale: 2.0 2024-06-19 20:57:18,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=60628.333333333336, ans=0.2 2024-06-19 20:57:20,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=60628.333333333336, ans=0.125 2024-06-19 20:57:37,652 INFO [train.py:1028] (1/2) Epoch 4, batch 2750, loss[loss=0.4337, simple_loss=0.4026, pruned_loss=0.2323, over 13247.00 frames. ], tot_loss[loss=0.4026, simple_loss=0.3827, pruned_loss=0.2112, over 2581228.54 frames. ], batch size: 43, lr: 1.43e-02, grad_scale: 1.0 2024-06-19 20:57:39,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=60683.333333333336, ans=0.035 2024-06-19 20:57:41,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=60683.333333333336, ans=0.125 2024-06-19 20:57:42,190 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.00 vs. limit=15.0 2024-06-19 20:57:42,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=60683.333333333336, ans=0.125 2024-06-19 20:57:56,907 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.05 vs. limit=22.5 2024-06-19 20:58:04,142 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.049e+03 1.709e+03 1.960e+03 2.292e+03 4.889e+03, threshold=3.920e+03, percent-clipped=2.0 2024-06-19 20:58:16,851 INFO [train.py:1028] (1/2) Epoch 4, batch 2800, loss[loss=0.4226, simple_loss=0.3852, pruned_loss=0.23, over 10848.00 frames. ], tot_loss[loss=0.403, simple_loss=0.3824, pruned_loss=0.2117, over 2579708.48 frames. ], batch size: 303, lr: 1.43e-02, grad_scale: 2.0 2024-06-19 20:58:27,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=60793.333333333336, ans=0.125 2024-06-19 20:58:28,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=60793.333333333336, ans=0.2 2024-06-19 20:58:33,308 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.08 vs. limit=10.0 2024-06-19 20:58:35,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=60830.0, ans=0.125 2024-06-19 20:58:37,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=60830.0, ans=0.125 2024-06-19 20:58:44,099 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.70 vs. limit=22.5 2024-06-19 20:58:48,999 INFO [train.py:1028] (1/2) Epoch 4, batch 2850, loss[loss=0.4216, simple_loss=0.4049, pruned_loss=0.2191, over 13325.00 frames. ], tot_loss[loss=0.4026, simple_loss=0.3817, pruned_loss=0.2118, over 2577836.03 frames. ], batch size: 49, lr: 1.43e-02, grad_scale: 0.5 2024-06-19 20:59:09,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=60921.666666666664, ans=0.125 2024-06-19 20:59:10,287 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.331e+03 1.890e+03 2.221e+03 2.593e+03 5.081e+03, threshold=4.442e+03, percent-clipped=2.0 2024-06-19 20:59:11,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=60921.666666666664, ans=0.125 2024-06-19 20:59:21,005 INFO [train.py:1028] (1/2) Epoch 4, batch 2900, loss[loss=0.3672, simple_loss=0.3658, pruned_loss=0.1843, over 13149.00 frames. ], tot_loss[loss=0.3986, simple_loss=0.3785, pruned_loss=0.2094, over 2586430.38 frames. ], batch size: 55, lr: 1.43e-02, grad_scale: 1.0 2024-06-19 20:59:25,452 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.89 vs. limit=10.0 2024-06-19 20:59:26,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=60958.333333333336, ans=0.125 2024-06-19 20:59:30,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=60976.666666666664, ans=0.2 2024-06-19 20:59:47,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=61013.333333333336, ans=0.125 2024-06-19 20:59:47,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=61013.333333333336, ans=0.2 2024-06-19 20:59:47,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=61013.333333333336, ans=0.2 2024-06-19 21:00:00,407 INFO [train.py:1028] (1/2) Epoch 4, batch 2950, loss[loss=0.4241, simple_loss=0.3947, pruned_loss=0.2267, over 13233.00 frames. ], tot_loss[loss=0.398, simple_loss=0.378, pruned_loss=0.209, over 2579957.96 frames. ], batch size: 43, lr: 1.43e-02, grad_scale: 1.0 2024-06-19 21:00:04,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=61050.0, ans=0.2 2024-06-19 21:00:10,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=61068.333333333336, ans=0.025 2024-06-19 21:00:19,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=61086.666666666664, ans=0.125 2024-06-19 21:00:23,171 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.460e+03 2.287e+03 2.728e+03 3.410e+03 6.168e+03, threshold=5.455e+03, percent-clipped=6.0 2024-06-19 21:00:29,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=61123.333333333336, ans=0.125 2024-06-19 21:00:35,224 INFO [train.py:1028] (1/2) Epoch 4, batch 3000, loss[loss=0.3658, simple_loss=0.3585, pruned_loss=0.1865, over 13208.00 frames. ], tot_loss[loss=0.398, simple_loss=0.378, pruned_loss=0.209, over 2578691.42 frames. ], batch size: 59, lr: 1.43e-02, grad_scale: 2.0 2024-06-19 21:00:35,224 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 21:00:43,220 INFO [train.py:1060] (1/2) Epoch 4, validation: loss=0.2904, simple_loss=0.3277, pruned_loss=0.1266, over 351949.00 frames. 2024-06-19 21:00:43,220 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17340MB 2024-06-19 21:01:02,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=61178.333333333336, ans=0.125 2024-06-19 21:01:02,479 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=12.0 2024-06-19 21:01:06,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=61196.666666666664, ans=0.0 2024-06-19 21:01:08,677 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.94 vs. limit=6.0 2024-06-19 21:01:12,141 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.56 vs. limit=15.0 2024-06-19 21:01:16,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=61233.333333333336, ans=0.125 2024-06-19 21:01:17,469 INFO [train.py:1028] (1/2) Epoch 4, batch 3050, loss[loss=0.3665, simple_loss=0.3613, pruned_loss=0.1859, over 13309.00 frames. ], tot_loss[loss=0.3987, simple_loss=0.3779, pruned_loss=0.2098, over 2578744.42 frames. ], batch size: 46, lr: 1.43e-02, grad_scale: 0.5 2024-06-19 21:01:18,106 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=25.04 vs. limit=22.5 2024-06-19 21:01:21,359 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.42 vs. limit=22.5 2024-06-19 21:01:30,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=61270.0, ans=0.125 2024-06-19 21:01:41,246 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=19.58 vs. limit=15.0 2024-06-19 21:01:44,136 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+03 2.692e+03 3.128e+03 3.646e+03 9.171e+03, threshold=6.257e+03, percent-clipped=5.0 2024-06-19 21:01:57,055 INFO [train.py:1028] (1/2) Epoch 4, batch 3100, loss[loss=0.3862, simple_loss=0.3623, pruned_loss=0.2051, over 13020.00 frames. ], tot_loss[loss=0.3947, simple_loss=0.3749, pruned_loss=0.2073, over 2578942.07 frames. ], batch size: 144, lr: 1.43e-02, grad_scale: 1.0 2024-06-19 21:02:00,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=61325.0, ans=0.0 2024-06-19 21:02:01,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=61325.0, ans=0.2 2024-06-19 21:02:04,039 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.48 vs. limit=15.0 2024-06-19 21:02:04,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=61343.333333333336, ans=0.125 2024-06-19 21:02:07,879 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.55 vs. limit=22.5 2024-06-19 21:02:27,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=61398.333333333336, ans=0.1 2024-06-19 21:02:27,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=61398.333333333336, ans=0.125 2024-06-19 21:02:29,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=61398.333333333336, ans=0.125 2024-06-19 21:02:31,167 INFO [train.py:1028] (1/2) Epoch 4, batch 3150, loss[loss=0.4049, simple_loss=0.3739, pruned_loss=0.218, over 12980.00 frames. ], tot_loss[loss=0.3917, simple_loss=0.3727, pruned_loss=0.2053, over 2580309.34 frames. ], batch size: 158, lr: 1.43e-02, grad_scale: 1.0 2024-06-19 21:02:33,707 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.03 vs. limit=15.0 2024-06-19 21:02:33,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=61416.666666666664, ans=0.125 2024-06-19 21:02:54,592 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.337e+03 1.865e+03 2.412e+03 2.952e+03 6.529e+03, threshold=4.823e+03, percent-clipped=1.0 2024-06-19 21:03:01,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=61490.0, ans=0.025 2024-06-19 21:03:02,241 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.40 vs. limit=15.0 2024-06-19 21:03:04,389 INFO [train.py:1028] (1/2) Epoch 4, batch 3200, loss[loss=0.4016, simple_loss=0.3841, pruned_loss=0.2096, over 13103.00 frames. ], tot_loss[loss=0.3914, simple_loss=0.3723, pruned_loss=0.2053, over 2580994.44 frames. ], batch size: 55, lr: 1.42e-02, grad_scale: 2.0 2024-06-19 21:03:07,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=61508.333333333336, ans=0.125 2024-06-19 21:03:08,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61508.333333333336, ans=0.1 2024-06-19 21:03:26,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=61563.333333333336, ans=0.125 2024-06-19 21:03:34,994 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.13 vs. limit=15.0 2024-06-19 21:03:37,330 INFO [train.py:1028] (1/2) Epoch 4, batch 3250, loss[loss=0.361, simple_loss=0.3513, pruned_loss=0.1853, over 13261.00 frames. ], tot_loss[loss=0.3913, simple_loss=0.3718, pruned_loss=0.2054, over 2584321.60 frames. ], batch size: 72, lr: 1.42e-02, grad_scale: 0.5 2024-06-19 21:03:38,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=61600.0, ans=0.125 2024-06-19 21:03:45,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=61600.0, ans=0.2 2024-06-19 21:04:08,508 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.610e+03 2.277e+03 2.851e+03 3.448e+03 1.123e+04, threshold=5.701e+03, percent-clipped=6.0 2024-06-19 21:04:15,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=61673.333333333336, ans=0.2 2024-06-19 21:04:17,679 INFO [train.py:1028] (1/2) Epoch 4, batch 3300, loss[loss=0.3951, simple_loss=0.3652, pruned_loss=0.2125, over 12709.00 frames. ], tot_loss[loss=0.3898, simple_loss=0.371, pruned_loss=0.2043, over 2581130.99 frames. ], batch size: 176, lr: 1.42e-02, grad_scale: 1.0 2024-06-19 21:04:22,140 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.48 vs. limit=15.0 2024-06-19 21:04:22,749 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.37 vs. limit=15.0 2024-06-19 21:04:37,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=61746.666666666664, ans=0.1 2024-06-19 21:04:42,254 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2024-06-19 21:04:44,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=61765.0, ans=0.0 2024-06-19 21:04:50,375 INFO [train.py:1028] (1/2) Epoch 4, batch 3350, loss[loss=0.3889, simple_loss=0.3571, pruned_loss=0.2104, over 12889.00 frames. ], tot_loss[loss=0.3894, simple_loss=0.3701, pruned_loss=0.2043, over 2576282.65 frames. ], batch size: 158, lr: 1.42e-02, grad_scale: 1.0 2024-06-19 21:04:52,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=61783.333333333336, ans=0.0 2024-06-19 21:04:54,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=61783.333333333336, ans=0.2 2024-06-19 21:04:55,901 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:04:56,741 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.81 vs. limit=15.0 2024-06-19 21:05:02,367 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.47 vs. limit=22.5 2024-06-19 21:05:04,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=61820.0, ans=0.125 2024-06-19 21:05:15,272 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.133e+03 2.106e+03 2.428e+03 2.907e+03 4.989e+03, threshold=4.857e+03, percent-clipped=0.0 2024-06-19 21:05:22,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61856.666666666664, ans=0.1 2024-06-19 21:05:24,022 INFO [train.py:1028] (1/2) Epoch 4, batch 3400, loss[loss=0.4024, simple_loss=0.3907, pruned_loss=0.2071, over 12398.00 frames. ], tot_loss[loss=0.3888, simple_loss=0.3694, pruned_loss=0.2041, over 2574272.03 frames. ], batch size: 22, lr: 1.42e-02, grad_scale: 2.0 2024-06-19 21:05:26,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=61875.0, ans=0.0 2024-06-19 21:05:27,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=61875.0, ans=0.2 2024-06-19 21:05:28,875 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.10 vs. limit=15.0 2024-06-19 21:05:47,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=61911.666666666664, ans=0.0 2024-06-19 21:05:52,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=61930.0, ans=0.0 2024-06-19 21:05:53,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=61930.0, ans=0.1 2024-06-19 21:06:03,711 INFO [train.py:1028] (1/2) Epoch 4, batch 3450, loss[loss=0.4307, simple_loss=0.396, pruned_loss=0.2327, over 12745.00 frames. ], tot_loss[loss=0.386, simple_loss=0.3676, pruned_loss=0.2022, over 2575197.27 frames. ], batch size: 176, lr: 1.42e-02, grad_scale: 1.0 2024-06-19 21:06:04,060 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.44 vs. limit=15.0 2024-06-19 21:06:08,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=61966.666666666664, ans=0.2 2024-06-19 21:06:10,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=61985.0, ans=0.125 2024-06-19 21:06:16,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=62003.333333333336, ans=0.125 2024-06-19 21:06:22,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=62003.333333333336, ans=0.0 2024-06-19 21:06:28,672 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.586e+03 2.805e+03 3.415e+03 3.858e+03 6.822e+03, threshold=6.829e+03, percent-clipped=6.0 2024-06-19 21:06:36,590 INFO [train.py:1028] (1/2) Epoch 4, batch 3500, loss[loss=0.348, simple_loss=0.3445, pruned_loss=0.1758, over 12885.00 frames. ], tot_loss[loss=0.3855, simple_loss=0.3671, pruned_loss=0.202, over 2575010.31 frames. ], batch size: 33, lr: 1.42e-02, grad_scale: 1.0 2024-06-19 21:06:37,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=62058.333333333336, ans=0.125 2024-06-19 21:06:46,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=62076.666666666664, ans=0.0 2024-06-19 21:06:50,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=62095.0, ans=0.125 2024-06-19 21:06:58,353 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.50 vs. limit=15.0 2024-06-19 21:07:01,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=62113.333333333336, ans=0.125 2024-06-19 21:07:06,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=62131.666666666664, ans=0.125 2024-06-19 21:07:09,727 INFO [train.py:1028] (1/2) Epoch 4, batch 3550, loss[loss=0.3745, simple_loss=0.3516, pruned_loss=0.1987, over 13135.00 frames. ], tot_loss[loss=0.3848, simple_loss=0.3664, pruned_loss=0.2016, over 2575583.97 frames. ], batch size: 95, lr: 1.42e-02, grad_scale: 0.5 2024-06-19 21:07:16,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=62168.333333333336, ans=0.1 2024-06-19 21:07:20,404 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.04 vs. limit=22.5 2024-06-19 21:07:39,947 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.676e+03 2.877e+03 3.345e+03 3.975e+03 8.231e+03, threshold=6.690e+03, percent-clipped=2.0 2024-06-19 21:07:42,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=62223.333333333336, ans=0.0 2024-06-19 21:07:48,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=62223.333333333336, ans=0.125 2024-06-19 21:07:49,556 INFO [train.py:1028] (1/2) Epoch 4, batch 3600, loss[loss=0.3773, simple_loss=0.3682, pruned_loss=0.1932, over 13323.00 frames. ], tot_loss[loss=0.3836, simple_loss=0.3653, pruned_loss=0.201, over 2579658.02 frames. ], batch size: 49, lr: 1.42e-02, grad_scale: 1.0 2024-06-19 21:07:52,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=62241.666666666664, ans=0.035 2024-06-19 21:07:59,025 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.61 vs. limit=6.0 2024-06-19 21:08:00,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=62260.0, ans=0.2 2024-06-19 21:08:04,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=62278.333333333336, ans=0.1 2024-06-19 21:08:18,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=62315.0, ans=0.125 2024-06-19 21:08:22,998 INFO [train.py:1028] (1/2) Epoch 4, batch 3650, loss[loss=0.4004, simple_loss=0.3736, pruned_loss=0.2136, over 13068.00 frames. ], tot_loss[loss=0.3848, simple_loss=0.3661, pruned_loss=0.2017, over 2578673.08 frames. ], batch size: 102, lr: 1.42e-02, grad_scale: 0.5 2024-06-19 21:08:26,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=62333.333333333336, ans=15.0 2024-06-19 21:08:26,568 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.58 vs. limit=10.0 2024-06-19 21:08:39,422 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.76 vs. limit=12.0 2024-06-19 21:08:50,186 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.53 vs. limit=15.0 2024-06-19 21:08:50,404 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.241e+03 3.186e+03 4.030e+03 5.180e+03 1.373e+04, threshold=8.061e+03, percent-clipped=10.0 2024-06-19 21:08:51,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=62406.666666666664, ans=0.125 2024-06-19 21:08:55,728 INFO [train.py:1028] (1/2) Epoch 4, batch 3700, loss[loss=0.3323, simple_loss=0.3294, pruned_loss=0.1675, over 13237.00 frames. ], tot_loss[loss=0.3836, simple_loss=0.3652, pruned_loss=0.2011, over 2584123.11 frames. ], batch size: 72, lr: 1.41e-02, grad_scale: 0.5 2024-06-19 21:08:55,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=62425.0, ans=0.0 2024-06-19 21:08:56,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=62425.0, ans=0.125 2024-06-19 21:08:57,363 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.83 vs. limit=15.0 2024-06-19 21:09:05,877 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.32 vs. limit=10.0 2024-06-19 21:09:09,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=62461.666666666664, ans=0.2 2024-06-19 21:09:11,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=62461.666666666664, ans=0.125 2024-06-19 21:09:18,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=62480.0, ans=0.125 2024-06-19 21:09:20,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=62480.0, ans=0.125 2024-06-19 21:09:21,994 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.98 vs. limit=6.0 2024-06-19 21:09:25,089 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:09:28,050 INFO [train.py:1028] (1/2) Epoch 4, batch 3750, loss[loss=0.3992, simple_loss=0.3803, pruned_loss=0.209, over 12798.00 frames. ], tot_loss[loss=0.3809, simple_loss=0.3632, pruned_loss=0.1993, over 2586759.13 frames. ], batch size: 22, lr: 1.41e-02, grad_scale: 0.5 2024-06-19 21:09:30,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=62516.666666666664, ans=0.125 2024-06-19 21:09:56,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=62571.666666666664, ans=0.0 2024-06-19 21:10:01,316 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.259e+03 2.529e+03 3.002e+03 3.700e+03 1.036e+04, threshold=6.004e+03, percent-clipped=1.0 2024-06-19 21:10:01,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=62590.0, ans=0.5 2024-06-19 21:10:06,491 INFO [train.py:1028] (1/2) Epoch 4, batch 3800, loss[loss=0.3727, simple_loss=0.359, pruned_loss=0.1932, over 13208.00 frames. ], tot_loss[loss=0.3814, simple_loss=0.3638, pruned_loss=0.1995, over 2585415.10 frames. ], batch size: 83, lr: 1.41e-02, grad_scale: 1.0 2024-06-19 21:10:06,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2024-06-19 21:10:07,258 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=2.393e+01 2024-06-19 21:10:07,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=62608.333333333336, ans=0.0 2024-06-19 21:10:11,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=62608.333333333336, ans=0.125 2024-06-19 21:10:17,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=62626.666666666664, ans=0.0 2024-06-19 21:10:17,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=62626.666666666664, ans=0.0 2024-06-19 21:10:26,338 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.78 vs. limit=6.0 2024-06-19 21:10:26,591 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=2.489e+03 2024-06-19 21:10:36,574 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.02 vs. limit=15.0 2024-06-19 21:10:39,437 INFO [train.py:1028] (1/2) Epoch 4, batch 3850, loss[loss=0.3952, simple_loss=0.3668, pruned_loss=0.2118, over 13027.00 frames. ], tot_loss[loss=0.3799, simple_loss=0.3629, pruned_loss=0.1985, over 2585360.67 frames. ], batch size: 144, lr: 1.41e-02, grad_scale: 0.5 2024-06-19 21:10:40,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=62700.0, ans=0.0 2024-06-19 21:10:54,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=62736.666666666664, ans=0.125 2024-06-19 21:10:54,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=62736.666666666664, ans=0.0 2024-06-19 21:10:58,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=62755.0, ans=0.0 2024-06-19 21:11:01,121 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.07 vs. limit=22.5 2024-06-19 21:11:03,073 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.86 vs. limit=12.0 2024-06-19 21:11:07,029 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.328e+03 2.275e+03 2.699e+03 3.244e+03 1.376e+04, threshold=5.398e+03, percent-clipped=1.0 2024-06-19 21:11:07,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=62773.333333333336, ans=0.125 2024-06-19 21:11:11,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=62791.666666666664, ans=0.09899494936611666 2024-06-19 21:11:11,582 INFO [train.py:1028] (1/2) Epoch 4, batch 3900, loss[loss=0.3511, simple_loss=0.344, pruned_loss=0.1791, over 13172.00 frames. ], tot_loss[loss=0.3789, simple_loss=0.3623, pruned_loss=0.1978, over 2588165.12 frames. ], batch size: 83, lr: 1.41e-02, grad_scale: 1.0 2024-06-19 21:11:11,975 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.98 vs. limit=15.0 2024-06-19 21:11:24,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62828.333333333336, ans=0.1 2024-06-19 21:11:25,119 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.61 vs. limit=10.0 2024-06-19 21:11:37,226 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.97 vs. limit=15.0 2024-06-19 21:11:42,397 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.19 vs. limit=10.0 2024-06-19 21:11:44,611 INFO [train.py:1028] (1/2) Epoch 4, batch 3950, loss[loss=0.3598, simple_loss=0.3293, pruned_loss=0.1952, over 13117.00 frames. ], tot_loss[loss=0.3758, simple_loss=0.3601, pruned_loss=0.1958, over 2589778.69 frames. ], batch size: 132, lr: 1.41e-02, grad_scale: 0.5 2024-06-19 21:12:10,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=62920.0, ans=0.025 2024-06-19 21:12:21,926 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.046e+03 1.742e+03 2.139e+03 2.640e+03 7.983e+03, threshold=4.279e+03, percent-clipped=3.0 2024-06-19 21:12:25,850 INFO [train.py:1028] (1/2) Epoch 4, batch 4000, loss[loss=0.3853, simple_loss=0.3637, pruned_loss=0.2034, over 13276.00 frames. ], tot_loss[loss=0.3743, simple_loss=0.3587, pruned_loss=0.195, over 2584092.00 frames. ], batch size: 40, lr: 1.41e-02, grad_scale: 1.0 2024-06-19 21:12:30,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=62975.0, ans=0.125 2024-06-19 21:12:32,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=62993.333333333336, ans=0.05 2024-06-19 21:12:35,846 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.06 vs. limit=15.0 2024-06-19 21:12:40,500 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2024-06-19 21:12:42,539 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.69 vs. limit=15.0 2024-06-19 21:12:46,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=63030.0, ans=0.0 2024-06-19 21:12:49,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=63030.0, ans=0.025 2024-06-19 21:12:52,872 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.28 vs. limit=15.0 2024-06-19 21:12:53,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=63048.333333333336, ans=0.0 2024-06-19 21:12:54,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=63048.333333333336, ans=0.125 2024-06-19 21:12:59,734 INFO [train.py:1028] (1/2) Epoch 4, batch 4050, loss[loss=0.4452, simple_loss=0.3974, pruned_loss=0.2466, over 10981.00 frames. ], tot_loss[loss=0.377, simple_loss=0.3602, pruned_loss=0.1969, over 2581408.74 frames. ], batch size: 304, lr: 1.41e-02, grad_scale: 0.5 2024-06-19 21:13:03,247 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.98 vs. limit=15.0 2024-06-19 21:13:05,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=63085.0, ans=0.0 2024-06-19 21:13:07,716 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.46 vs. limit=15.0 2024-06-19 21:13:13,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=63103.333333333336, ans=0.0 2024-06-19 21:13:18,289 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.89 vs. limit=22.5 2024-06-19 21:13:20,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=63121.666666666664, ans=0.0 2024-06-19 21:13:21,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=63121.666666666664, ans=0.0 2024-06-19 21:13:21,624 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.26 vs. limit=10.0 2024-06-19 21:13:28,665 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.729e+03 2.471e+03 3.094e+03 3.730e+03 1.174e+04, threshold=6.188e+03, percent-clipped=13.0 2024-06-19 21:13:32,083 INFO [train.py:1028] (1/2) Epoch 4, batch 4100, loss[loss=0.3553, simple_loss=0.3334, pruned_loss=0.1886, over 13019.00 frames. ], tot_loss[loss=0.3777, simple_loss=0.3605, pruned_loss=0.1974, over 2577546.03 frames. ], batch size: 102, lr: 1.41e-02, grad_scale: 1.0 2024-06-19 21:13:34,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=63158.333333333336, ans=0.1 2024-06-19 21:13:37,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=63158.333333333336, ans=0.2 2024-06-19 21:13:39,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=63176.666666666664, ans=0.125 2024-06-19 21:13:40,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=63176.666666666664, ans=0.125 2024-06-19 21:13:51,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=63195.0, ans=0.2 2024-06-19 21:13:56,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=63195.0, ans=0.0 2024-06-19 21:13:59,150 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2024-06-19 21:14:05,653 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.54 vs. limit=15.0 2024-06-19 21:14:08,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=63231.666666666664, ans=10.0 2024-06-19 21:14:08,238 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.25 vs. limit=12.0 2024-06-19 21:14:11,931 INFO [train.py:1028] (1/2) Epoch 4, batch 4150, loss[loss=0.3529, simple_loss=0.3496, pruned_loss=0.178, over 13105.00 frames. ], tot_loss[loss=0.3766, simple_loss=0.3596, pruned_loss=0.1968, over 2574617.56 frames. ], batch size: 55, lr: 1.41e-02, grad_scale: 0.125 2024-06-19 21:14:12,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=63250.0, ans=0.125 2024-06-19 21:14:17,895 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.42 vs. limit=6.0 2024-06-19 21:14:23,866 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.69 vs. limit=15.0 2024-06-19 21:14:35,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=63305.0, ans=0.2 2024-06-19 21:14:38,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=63323.333333333336, ans=0.025 2024-06-19 21:14:39,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=63323.333333333336, ans=0.125 2024-06-19 21:14:44,219 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.844e+03 3.638e+03 4.303e+03 5.195e+03 1.635e+04, threshold=8.606e+03, percent-clipped=14.0 2024-06-19 21:14:45,545 INFO [train.py:1028] (1/2) Epoch 4, batch 4200, loss[loss=0.3757, simple_loss=0.3573, pruned_loss=0.1971, over 13031.00 frames. ], tot_loss[loss=0.3748, simple_loss=0.3581, pruned_loss=0.1957, over 2578299.48 frames. ], batch size: 102, lr: 1.40e-02, grad_scale: 0.25 2024-06-19 21:14:45,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=63341.666666666664, ans=0.04949747468305833 2024-06-19 21:14:46,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=63341.666666666664, ans=0.0 2024-06-19 21:14:59,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=63378.333333333336, ans=0.125 2024-06-19 21:15:02,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=63378.333333333336, ans=0.125 2024-06-19 21:15:05,475 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.76 vs. limit=15.0 2024-06-19 21:15:06,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=63396.666666666664, ans=0.125 2024-06-19 21:15:07,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=63396.666666666664, ans=0.125 2024-06-19 21:15:12,616 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.37 vs. limit=15.0 2024-06-19 21:15:14,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=63415.0, ans=0.0 2024-06-19 21:15:17,132 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.58 vs. limit=15.0 2024-06-19 21:15:17,711 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.71 vs. limit=15.0 2024-06-19 21:15:18,098 INFO [train.py:1028] (1/2) Epoch 4, batch 4250, loss[loss=0.3569, simple_loss=0.3515, pruned_loss=0.1812, over 13329.00 frames. ], tot_loss[loss=0.3748, simple_loss=0.358, pruned_loss=0.1958, over 2581334.73 frames. ], batch size: 46, lr: 1.40e-02, grad_scale: 0.25 2024-06-19 21:15:19,293 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.50 vs. limit=22.5 2024-06-19 21:15:23,687 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:15:30,195 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.24 vs. limit=15.0 2024-06-19 21:15:33,258 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.067e+01 2024-06-19 21:15:35,503 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.83 vs. limit=22.5 2024-06-19 21:15:35,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=63470.0, ans=0.125 2024-06-19 21:15:36,679 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.05 vs. limit=22.5 2024-06-19 21:15:46,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=63506.666666666664, ans=0.125 2024-06-19 21:15:47,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=63506.666666666664, ans=0.125 2024-06-19 21:15:48,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=63506.666666666664, ans=0.125 2024-06-19 21:15:49,968 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.261e+03 3.933e+03 4.476e+03 5.321e+03 1.337e+04, threshold=8.953e+03, percent-clipped=6.0 2024-06-19 21:15:51,299 INFO [train.py:1028] (1/2) Epoch 4, batch 4300, loss[loss=0.3513, simple_loss=0.3439, pruned_loss=0.1793, over 13253.00 frames. ], tot_loss[loss=0.3742, simple_loss=0.3576, pruned_loss=0.1954, over 2580650.40 frames. ], batch size: 59, lr: 1.40e-02, grad_scale: 0.5 2024-06-19 21:16:11,917 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.94 vs. limit=22.5 2024-06-19 21:16:22,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=63580.0, ans=0.1 2024-06-19 21:16:31,463 INFO [train.py:1028] (1/2) Epoch 4, batch 4350, loss[loss=0.374, simple_loss=0.3672, pruned_loss=0.1904, over 13204.00 frames. ], tot_loss[loss=0.3723, simple_loss=0.3561, pruned_loss=0.1942, over 2585392.56 frames. ], batch size: 59, lr: 1.40e-02, grad_scale: 0.5 2024-06-19 21:16:41,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=63635.0, ans=0.125 2024-06-19 21:16:45,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=63653.333333333336, ans=0.125 2024-06-19 21:16:48,171 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.00 vs. limit=15.0 2024-06-19 21:16:51,501 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.49 vs. limit=22.5 2024-06-19 21:17:02,649 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.563e+03 2.710e+03 3.104e+03 3.775e+03 7.105e+03, threshold=6.208e+03, percent-clipped=0.0 2024-06-19 21:17:03,984 INFO [train.py:1028] (1/2) Epoch 4, batch 4400, loss[loss=0.3618, simple_loss=0.3512, pruned_loss=0.1862, over 13259.00 frames. ], tot_loss[loss=0.3709, simple_loss=0.3553, pruned_loss=0.1933, over 2587027.67 frames. ], batch size: 83, lr: 1.40e-02, grad_scale: 1.0 2024-06-19 21:17:05,116 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.77 vs. limit=22.5 2024-06-19 21:17:05,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=63708.333333333336, ans=0.125 2024-06-19 21:17:09,710 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.97 vs. limit=15.0 2024-06-19 21:17:12,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=63726.666666666664, ans=0.1 2024-06-19 21:17:22,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=63745.0, ans=0.125 2024-06-19 21:17:22,523 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.02 vs. limit=15.0 2024-06-19 21:17:25,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63763.333333333336, ans=0.1 2024-06-19 21:17:37,729 INFO [train.py:1028] (1/2) Epoch 4, batch 4450, loss[loss=0.3427, simple_loss=0.3374, pruned_loss=0.174, over 12856.00 frames. ], tot_loss[loss=0.3727, simple_loss=0.3564, pruned_loss=0.1945, over 2582593.17 frames. ], batch size: 33, lr: 1.40e-02, grad_scale: 0.5 2024-06-19 21:17:37,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=63800.0, ans=0.125 2024-06-19 21:17:39,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=63800.0, ans=0.125 2024-06-19 21:17:45,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=63818.333333333336, ans=0.125 2024-06-19 21:17:48,165 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.86 vs. limit=6.0 2024-06-19 21:18:14,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=63873.333333333336, ans=0.025 2024-06-19 21:18:16,487 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.330e+03 2.603e+03 3.054e+03 3.614e+03 9.678e+03, threshold=6.108e+03, percent-clipped=2.0 2024-06-19 21:18:17,118 INFO [train.py:1028] (1/2) Epoch 4, batch 4500, loss[loss=0.3551, simple_loss=0.3444, pruned_loss=0.1829, over 13264.00 frames. ], tot_loss[loss=0.3715, simple_loss=0.3558, pruned_loss=0.1936, over 2586132.33 frames. ], batch size: 89, lr: 1.40e-02, grad_scale: 1.0 2024-06-19 21:18:31,385 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=25.71 vs. limit=22.5 2024-06-19 21:18:38,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=63946.666666666664, ans=0.125 2024-06-19 21:18:38,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=63946.666666666664, ans=0.125 2024-06-19 21:18:42,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.33 vs. limit=15.0 2024-06-19 21:18:45,630 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=20.25 vs. limit=15.0 2024-06-19 21:18:49,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=63983.333333333336, ans=0.1 2024-06-19 21:18:49,709 INFO [train.py:1028] (1/2) Epoch 4, batch 4550, loss[loss=0.3321, simple_loss=0.3389, pruned_loss=0.1626, over 13250.00 frames. ], tot_loss[loss=0.371, simple_loss=0.3552, pruned_loss=0.1933, over 2589649.57 frames. ], batch size: 52, lr: 1.40e-02, grad_scale: 0.5 2024-06-19 21:18:54,544 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=28.32 vs. limit=22.5 2024-06-19 21:18:55,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=64001.666666666664, ans=0.125 2024-06-19 21:18:57,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64001.666666666664, ans=0.1 2024-06-19 21:18:58,788 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.16 vs. limit=10.0 2024-06-19 21:18:59,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=64001.666666666664, ans=0.125 2024-06-19 21:19:08,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=64038.333333333336, ans=0.0 2024-06-19 21:19:22,403 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.827e+03 2.867e+03 3.374e+03 3.870e+03 6.926e+03, threshold=6.747e+03, percent-clipped=1.0 2024-06-19 21:19:22,432 INFO [train.py:1028] (1/2) Epoch 4, batch 4600, loss[loss=0.4271, simple_loss=0.3892, pruned_loss=0.2325, over 12564.00 frames. ], tot_loss[loss=0.3724, simple_loss=0.3565, pruned_loss=0.1941, over 2584424.53 frames. ], batch size: 202, lr: 1.40e-02, grad_scale: 1.0 2024-06-19 21:19:22,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=64075.0, ans=0.125 2024-06-19 21:19:28,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=64075.0, ans=0.5 2024-06-19 21:19:30,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64093.333333333336, ans=0.1 2024-06-19 21:19:31,781 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.89 vs. limit=6.0 2024-06-19 21:19:36,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=64111.666666666664, ans=0.125 2024-06-19 21:19:44,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=64130.0, ans=0.125 2024-06-19 21:20:03,706 INFO [train.py:1028] (1/2) Epoch 4, batch 4650, loss[loss=0.3462, simple_loss=0.3333, pruned_loss=0.1795, over 13157.00 frames. ], tot_loss[loss=0.37, simple_loss=0.3548, pruned_loss=0.1926, over 2587604.82 frames. ], batch size: 132, lr: 1.40e-02, grad_scale: 0.5 2024-06-19 21:20:07,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=64166.666666666664, ans=0.0 2024-06-19 21:20:07,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=64166.666666666664, ans=0.125 2024-06-19 21:20:11,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=64185.0, ans=0.1 2024-06-19 21:20:14,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=64185.0, ans=0.125 2024-06-19 21:20:30,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=64240.0, ans=10.0 2024-06-19 21:20:30,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=64240.0, ans=0.0 2024-06-19 21:20:34,734 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.94 vs. limit=22.5 2024-06-19 21:20:37,060 INFO [train.py:1028] (1/2) Epoch 4, batch 4700, loss[loss=0.3289, simple_loss=0.3332, pruned_loss=0.1623, over 12453.00 frames. ], tot_loss[loss=0.37, simple_loss=0.3547, pruned_loss=0.1927, over 2582784.28 frames. ], batch size: 25, lr: 1.40e-02, grad_scale: 1.0 2024-06-19 21:20:37,690 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.705e+03 2.629e+03 3.060e+03 3.626e+03 6.484e+03, threshold=6.121e+03, percent-clipped=0.0 2024-06-19 21:20:43,472 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.26 vs. limit=15.0 2024-06-19 21:21:07,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=64331.666666666664, ans=0.0 2024-06-19 21:21:09,694 INFO [train.py:1028] (1/2) Epoch 4, batch 4750, loss[loss=0.4006, simple_loss=0.3653, pruned_loss=0.218, over 12555.00 frames. ], tot_loss[loss=0.3695, simple_loss=0.3542, pruned_loss=0.1924, over 2579963.26 frames. ], batch size: 202, lr: 1.39e-02, grad_scale: 1.0 2024-06-19 21:21:09,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=64350.0, ans=0.125 2024-06-19 21:21:15,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=64350.0, ans=0.2 2024-06-19 21:21:43,756 INFO [train.py:1028] (1/2) Epoch 4, batch 4800, loss[loss=0.3763, simple_loss=0.3579, pruned_loss=0.1974, over 13246.00 frames. ], tot_loss[loss=0.3689, simple_loss=0.3538, pruned_loss=0.192, over 2576035.20 frames. ], batch size: 63, lr: 1.39e-02, grad_scale: 2.0 2024-06-19 21:21:44,350 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.568e+03 2.492e+03 2.950e+03 3.618e+03 4.767e+03, threshold=5.900e+03, percent-clipped=0.0 2024-06-19 21:21:55,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=64460.0, ans=0.125 2024-06-19 21:22:08,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=64478.333333333336, ans=0.125 2024-06-19 21:22:13,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=64496.666666666664, ans=0.125 2024-06-19 21:22:15,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=64515.0, ans=0.125 2024-06-19 21:22:17,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64515.0, ans=0.1 2024-06-19 21:22:18,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=64515.0, ans=0.125 2024-06-19 21:22:23,127 INFO [train.py:1028] (1/2) Epoch 4, batch 4850, loss[loss=0.38, simple_loss=0.3576, pruned_loss=0.2012, over 13263.00 frames. ], tot_loss[loss=0.3688, simple_loss=0.3536, pruned_loss=0.192, over 2574121.09 frames. ], batch size: 89, lr: 1.39e-02, grad_scale: 1.0 2024-06-19 21:22:25,771 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.70 vs. limit=15.0 2024-06-19 21:22:34,003 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.08 vs. limit=22.5 2024-06-19 21:22:37,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=64570.0, ans=0.125 2024-06-19 21:22:38,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=64570.0, ans=0.0 2024-06-19 21:22:58,034 INFO [train.py:1028] (1/2) Epoch 4, batch 4900, loss[loss=0.4057, simple_loss=0.3875, pruned_loss=0.2119, over 13193.00 frames. ], tot_loss[loss=0.3697, simple_loss=0.3541, pruned_loss=0.1926, over 2574787.99 frames. ], batch size: 59, lr: 1.39e-02, grad_scale: 2.0 2024-06-19 21:22:58,382 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.25 vs. limit=22.5 2024-06-19 21:22:59,421 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.825e+03 2.923e+03 3.236e+03 3.747e+03 8.319e+03, threshold=6.472e+03, percent-clipped=2.0 2024-06-19 21:23:02,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=64625.0, ans=0.0 2024-06-19 21:23:03,717 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.00 vs. limit=10.0 2024-06-19 21:23:21,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=64680.0, ans=0.125 2024-06-19 21:23:21,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=64680.0, ans=0.125 2024-06-19 21:23:25,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=64698.333333333336, ans=10.0 2024-06-19 21:23:31,136 INFO [train.py:1028] (1/2) Epoch 4, batch 4950, loss[loss=0.4188, simple_loss=0.3726, pruned_loss=0.2325, over 11012.00 frames. ], tot_loss[loss=0.3703, simple_loss=0.3542, pruned_loss=0.1932, over 2569757.79 frames. ], batch size: 303, lr: 1.39e-02, grad_scale: 0.25 2024-06-19 21:23:36,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=64716.666666666664, ans=0.2 2024-06-19 21:23:37,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=64735.0, ans=0.125 2024-06-19 21:23:57,710 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.10 vs. limit=15.0 2024-06-19 21:24:10,908 INFO [train.py:1028] (1/2) Epoch 4, batch 5000, loss[loss=0.3745, simple_loss=0.3523, pruned_loss=0.1983, over 13126.00 frames. ], tot_loss[loss=0.3701, simple_loss=0.354, pruned_loss=0.1931, over 2574235.60 frames. ], batch size: 95, lr: 1.39e-02, grad_scale: 0.5 2024-06-19 21:24:14,185 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.121e+03 3.493e+03 4.281e+03 5.212e+03 1.088e+04, threshold=8.563e+03, percent-clipped=11.0 2024-06-19 21:24:16,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=64808.333333333336, ans=0.125 2024-06-19 21:24:32,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=64863.333333333336, ans=0.1 2024-06-19 21:24:36,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=64863.333333333336, ans=0.2 2024-06-19 21:24:45,344 INFO [train.py:1028] (1/2) Epoch 4, batch 5050, loss[loss=0.3545, simple_loss=0.3558, pruned_loss=0.1766, over 12872.00 frames. ], tot_loss[loss=0.3691, simple_loss=0.3536, pruned_loss=0.1923, over 2571713.90 frames. ], batch size: 36, lr: 1.39e-02, grad_scale: 0.5 2024-06-19 21:24:49,699 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=12.0 2024-06-19 21:24:53,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=64918.333333333336, ans=0.0 2024-06-19 21:24:57,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=64918.333333333336, ans=0.125 2024-06-19 21:25:02,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=64936.666666666664, ans=0.025 2024-06-19 21:25:07,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=64955.0, ans=0.125 2024-06-19 21:25:18,774 INFO [train.py:1028] (1/2) Epoch 4, batch 5100, loss[loss=0.3297, simple_loss=0.3344, pruned_loss=0.1624, over 13003.00 frames. ], tot_loss[loss=0.37, simple_loss=0.3539, pruned_loss=0.193, over 2567054.64 frames. ], batch size: 39, lr: 1.39e-02, grad_scale: 1.0 2024-06-19 21:25:22,137 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.935e+03 2.934e+03 3.565e+03 4.127e+03 1.136e+04, threshold=7.129e+03, percent-clipped=2.0 2024-06-19 21:25:22,649 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.74 vs. limit=15.0 2024-06-19 21:25:33,187 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=65028.333333333336, ans=0.125 2024-06-19 21:25:39,418 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.04 vs. limit=15.0 2024-06-19 21:25:39,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=65046.666666666664, ans=0.125 2024-06-19 21:25:59,040 INFO [train.py:1028] (1/2) Epoch 4, batch 5150, loss[loss=0.3905, simple_loss=0.3541, pruned_loss=0.2134, over 13105.00 frames. ], tot_loss[loss=0.3709, simple_loss=0.3542, pruned_loss=0.1938, over 2569492.22 frames. ], batch size: 132, lr: 1.39e-02, grad_scale: 0.25 2024-06-19 21:26:06,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=65101.666666666664, ans=0.025 2024-06-19 21:26:09,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65101.666666666664, ans=0.1 2024-06-19 21:26:22,119 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2024-06-19 21:26:32,451 INFO [train.py:1028] (1/2) Epoch 4, batch 5200, loss[loss=0.362, simple_loss=0.3447, pruned_loss=0.1897, over 13176.00 frames. ], tot_loss[loss=0.3724, simple_loss=0.3551, pruned_loss=0.1948, over 2574170.94 frames. ], batch size: 95, lr: 1.39e-02, grad_scale: 0.5 2024-06-19 21:26:33,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=65175.0, ans=0.125 2024-06-19 21:26:37,113 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.071e+03 3.522e+03 4.311e+03 5.463e+03 1.297e+04, threshold=8.621e+03, percent-clipped=5.0 2024-06-19 21:26:40,748 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=5.946e-01 2024-06-19 21:26:40,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=65193.333333333336, ans=0.09899494936611666 2024-06-19 21:26:45,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=65211.666666666664, ans=0.07 2024-06-19 21:26:47,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65211.666666666664, ans=0.1 2024-06-19 21:26:47,580 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.90 vs. limit=15.0 2024-06-19 21:26:57,125 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.26 vs. limit=15.0 2024-06-19 21:26:58,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=65230.0, ans=0.0 2024-06-19 21:27:03,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=65248.333333333336, ans=0.0 2024-06-19 21:27:06,667 INFO [train.py:1028] (1/2) Epoch 4, batch 5250, loss[loss=0.3465, simple_loss=0.3472, pruned_loss=0.1729, over 13294.00 frames. ], tot_loss[loss=0.3707, simple_loss=0.354, pruned_loss=0.1938, over 2570618.48 frames. ], batch size: 52, lr: 1.38e-02, grad_scale: 0.5 2024-06-19 21:27:11,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=65266.666666666664, ans=0.0 2024-06-19 21:27:12,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=65285.0, ans=0.0 2024-06-19 21:27:18,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.44 vs. limit=15.0 2024-06-19 21:27:18,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=65285.0, ans=0.0 2024-06-19 21:27:18,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=65285.0, ans=0.0 2024-06-19 21:27:26,488 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.41 vs. limit=15.0 2024-06-19 21:27:28,540 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.66 vs. limit=15.0 2024-06-19 21:27:36,470 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.70 vs. limit=15.0 2024-06-19 21:27:40,008 INFO [train.py:1028] (1/2) Epoch 4, batch 5300, loss[loss=0.3653, simple_loss=0.3513, pruned_loss=0.1897, over 13061.00 frames. ], tot_loss[loss=0.3691, simple_loss=0.3532, pruned_loss=0.1925, over 2567316.64 frames. ], batch size: 144, lr: 1.38e-02, grad_scale: 1.0 2024-06-19 21:27:42,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=65358.333333333336, ans=0.125 2024-06-19 21:27:42,808 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=27.96 vs. limit=22.5 2024-06-19 21:27:44,468 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.307e+03 1.849e+03 2.406e+03 2.853e+03 6.653e+03, threshold=4.811e+03, percent-clipped=0.0 2024-06-19 21:28:10,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=65413.333333333336, ans=0.05 2024-06-19 21:28:11,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=65413.333333333336, ans=0.125 2024-06-19 21:28:19,599 INFO [train.py:1028] (1/2) Epoch 4, batch 5350, loss[loss=0.3884, simple_loss=0.3824, pruned_loss=0.1972, over 11341.00 frames. ], tot_loss[loss=0.3678, simple_loss=0.3523, pruned_loss=0.1917, over 2573297.56 frames. ], batch size: 16, lr: 1.38e-02, grad_scale: 1.0 2024-06-19 21:28:22,944 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.50 vs. limit=15.0 2024-06-19 21:28:27,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=65468.333333333336, ans=0.0 2024-06-19 21:28:28,080 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.70 vs. limit=22.5 2024-06-19 21:28:38,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=65505.0, ans=0.125 2024-06-19 21:28:42,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=65505.0, ans=0.125 2024-06-19 21:28:52,276 INFO [train.py:1028] (1/2) Epoch 4, batch 5400, loss[loss=0.431, simple_loss=0.3815, pruned_loss=0.2403, over 12213.00 frames. ], tot_loss[loss=0.3686, simple_loss=0.3528, pruned_loss=0.1922, over 2566235.77 frames. ], batch size: 240, lr: 1.38e-02, grad_scale: 2.0 2024-06-19 21:28:56,915 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.388e+02 1.652e+03 2.033e+03 2.523e+03 6.863e+03, threshold=4.065e+03, percent-clipped=2.0 2024-06-19 21:29:10,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65578.33333333333, ans=0.1 2024-06-19 21:29:12,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=65596.66666666667, ans=0.125 2024-06-19 21:29:18,521 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.79 vs. limit=12.0 2024-06-19 21:29:21,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=65615.0, ans=0.125 2024-06-19 21:29:26,094 INFO [train.py:1028] (1/2) Epoch 4, batch 5450, loss[loss=0.3857, simple_loss=0.3697, pruned_loss=0.2009, over 12346.00 frames. ], tot_loss[loss=0.3673, simple_loss=0.3524, pruned_loss=0.1911, over 2570942.86 frames. ], batch size: 25, lr: 1.38e-02, grad_scale: 1.0 2024-06-19 21:29:34,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=65651.66666666667, ans=0.1 2024-06-19 21:29:47,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=65670.0, ans=0.1 2024-06-19 21:29:56,001 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.50 vs. limit=15.0 2024-06-19 21:29:56,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=65688.33333333333, ans=0.125 2024-06-19 21:29:57,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=65688.33333333333, ans=0.1 2024-06-19 21:29:59,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=65706.66666666667, ans=0.125 2024-06-19 21:30:06,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=65725.0, ans=0.025 2024-06-19 21:30:06,861 INFO [train.py:1028] (1/2) Epoch 4, batch 5500, loss[loss=0.4394, simple_loss=0.3921, pruned_loss=0.2433, over 12202.00 frames. ], tot_loss[loss=0.3662, simple_loss=0.3515, pruned_loss=0.1905, over 2563929.28 frames. ], batch size: 240, lr: 1.38e-02, grad_scale: 2.0 2024-06-19 21:30:12,244 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.189e+03 1.515e+03 1.855e+03 2.473e+03 4.819e+03, threshold=3.710e+03, percent-clipped=1.0 2024-06-19 21:30:24,086 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.96 vs. limit=6.0 2024-06-19 21:30:24,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=65761.66666666667, ans=0.025 2024-06-19 21:30:25,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.91 vs. limit=22.5 2024-06-19 21:30:27,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65780.0, ans=0.1 2024-06-19 21:30:31,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=65780.0, ans=0.2 2024-06-19 21:30:31,617 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.93 vs. limit=15.0 2024-06-19 21:30:41,464 INFO [train.py:1028] (1/2) Epoch 4, batch 5550, loss[loss=0.3395, simple_loss=0.3433, pruned_loss=0.1679, over 13331.00 frames. ], tot_loss[loss=0.3625, simple_loss=0.3493, pruned_loss=0.1878, over 2568018.22 frames. ], batch size: 43, lr: 1.38e-02, grad_scale: 2.0 2024-06-19 21:30:43,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=65816.66666666667, ans=0.125 2024-06-19 21:30:51,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=65835.0, ans=0.0 2024-06-19 21:30:58,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65853.33333333333, ans=0.1 2024-06-19 21:31:02,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=65871.66666666667, ans=0.0 2024-06-19 21:31:15,286 INFO [train.py:1028] (1/2) Epoch 4, batch 5600, loss[loss=0.351, simple_loss=0.3376, pruned_loss=0.1822, over 13226.00 frames. ], tot_loss[loss=0.3614, simple_loss=0.3487, pruned_loss=0.1871, over 2570763.06 frames. ], batch size: 89, lr: 1.38e-02, grad_scale: 4.0 2024-06-19 21:31:20,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=65908.33333333333, ans=0.0 2024-06-19 21:31:22,362 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.243e+03 1.951e+03 2.330e+03 2.751e+03 7.079e+03, threshold=4.660e+03, percent-clipped=6.0 2024-06-19 21:31:36,010 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.48 vs. limit=6.0 2024-06-19 21:31:43,794 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:31:45,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=65981.66666666667, ans=0.1 2024-06-19 21:31:58,614 INFO [train.py:1028] (1/2) Epoch 4, batch 5650, loss[loss=0.4279, simple_loss=0.3863, pruned_loss=0.2348, over 12553.00 frames. ], tot_loss[loss=0.3632, simple_loss=0.3499, pruned_loss=0.1882, over 2576370.13 frames. ], batch size: 202, lr: 1.38e-02, grad_scale: 0.5 2024-06-19 21:32:05,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.90 vs. limit=15.0 2024-06-19 21:32:08,490 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.43 vs. limit=15.0 2024-06-19 21:32:13,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=66018.33333333333, ans=0.04949747468305833 2024-06-19 21:32:35,106 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.59 vs. limit=6.0 2024-06-19 21:32:36,050 INFO [train.py:1028] (1/2) Epoch 4, batch 5700, loss[loss=0.3008, simple_loss=0.3094, pruned_loss=0.1461, over 13252.00 frames. ], tot_loss[loss=0.3617, simple_loss=0.3489, pruned_loss=0.1872, over 2580272.32 frames. ], batch size: 63, lr: 1.38e-02, grad_scale: 1.0 2024-06-19 21:32:36,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=66091.66666666667, ans=0.125 2024-06-19 21:32:41,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=66091.66666666667, ans=0.125 2024-06-19 21:32:43,646 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+03 2.699e+03 3.054e+03 3.650e+03 9.422e+03, threshold=6.107e+03, percent-clipped=5.0 2024-06-19 21:32:50,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=66128.33333333333, ans=0.0 2024-06-19 21:32:52,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=66128.33333333333, ans=0.125 2024-06-19 21:32:58,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=66146.66666666667, ans=0.0 2024-06-19 21:33:03,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=66165.0, ans=0.0 2024-06-19 21:33:09,121 INFO [train.py:1028] (1/2) Epoch 4, batch 5750, loss[loss=0.4075, simple_loss=0.3755, pruned_loss=0.2198, over 12756.00 frames. ], tot_loss[loss=0.3632, simple_loss=0.3501, pruned_loss=0.1881, over 2580018.87 frames. ], batch size: 176, lr: 1.38e-02, grad_scale: 0.5 2024-06-19 21:33:09,376 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:33:15,088 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.78 vs. limit=15.0 2024-06-19 21:33:28,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=66220.0, ans=0.2 2024-06-19 21:33:28,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=66238.33333333333, ans=0.0 2024-06-19 21:33:42,668 INFO [train.py:1028] (1/2) Epoch 4, batch 5800, loss[loss=0.3823, simple_loss=0.3542, pruned_loss=0.2052, over 12787.00 frames. ], tot_loss[loss=0.3665, simple_loss=0.3523, pruned_loss=0.1903, over 2578467.33 frames. ], batch size: 176, lr: 1.37e-02, grad_scale: 1.0 2024-06-19 21:33:47,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=66275.0, ans=0.2 2024-06-19 21:33:48,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=66275.0, ans=0.125 2024-06-19 21:33:54,410 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.850e+03 2.716e+03 3.254e+03 3.972e+03 6.133e+03, threshold=6.508e+03, percent-clipped=1.0 2024-06-19 21:33:57,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=66293.33333333333, ans=0.2 2024-06-19 21:34:07,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.84 vs. limit=15.0 2024-06-19 21:34:12,525 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.40 vs. limit=22.5 2024-06-19 21:34:13,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=66330.0, ans=0.125 2024-06-19 21:34:21,271 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.90 vs. limit=12.0 2024-06-19 21:34:22,984 INFO [train.py:1028] (1/2) Epoch 4, batch 5850, loss[loss=0.4343, simple_loss=0.3902, pruned_loss=0.2392, over 12500.00 frames. ], tot_loss[loss=0.3704, simple_loss=0.3557, pruned_loss=0.1926, over 2577348.21 frames. ], batch size: 202, lr: 1.37e-02, grad_scale: 1.0 2024-06-19 21:34:23,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=66366.66666666667, ans=0.2 2024-06-19 21:34:25,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=66366.66666666667, ans=0.1 2024-06-19 21:34:28,867 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.89 vs. limit=15.0 2024-06-19 21:34:37,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=66403.33333333333, ans=0.125 2024-06-19 21:34:51,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=66440.0, ans=0.1 2024-06-19 21:34:55,842 INFO [train.py:1028] (1/2) Epoch 4, batch 5900, loss[loss=0.3596, simple_loss=0.3379, pruned_loss=0.1907, over 13090.00 frames. ], tot_loss[loss=0.3747, simple_loss=0.359, pruned_loss=0.1952, over 2576901.44 frames. ], batch size: 121, lr: 1.37e-02, grad_scale: 1.0 2024-06-19 21:34:58,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=66458.33333333333, ans=0.2 2024-06-19 21:35:01,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=66458.33333333333, ans=0.1 2024-06-19 21:35:04,474 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.935e+03 2.944e+03 3.366e+03 4.074e+03 8.277e+03, threshold=6.731e+03, percent-clipped=1.0 2024-06-19 21:35:08,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=66495.0, ans=0.125 2024-06-19 21:35:14,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=66495.0, ans=0.1 2024-06-19 21:35:22,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=66531.66666666667, ans=0.0 2024-06-19 21:35:28,691 INFO [train.py:1028] (1/2) Epoch 4, batch 5950, loss[loss=0.3677, simple_loss=0.3557, pruned_loss=0.1898, over 13140.00 frames. ], tot_loss[loss=0.3762, simple_loss=0.3605, pruned_loss=0.196, over 2582097.07 frames. ], batch size: 121, lr: 1.37e-02, grad_scale: 1.0 2024-06-19 21:35:29,211 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.98 vs. limit=15.0 2024-06-19 21:35:34,787 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.57 vs. limit=6.0 2024-06-19 21:35:39,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=66568.33333333333, ans=0.125 2024-06-19 21:35:40,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=66568.33333333333, ans=0.125 2024-06-19 21:35:47,591 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.36 vs. limit=15.0 2024-06-19 21:35:48,269 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.43 vs. limit=10.0 2024-06-19 21:36:03,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=66623.33333333333, ans=0.125 2024-06-19 21:36:05,039 INFO [train.py:1028] (1/2) Epoch 4, batch 6000, loss[loss=0.4984, simple_loss=0.4367, pruned_loss=0.2801, over 12273.00 frames. ], tot_loss[loss=0.3793, simple_loss=0.3632, pruned_loss=0.1977, over 2575426.89 frames. ], batch size: 241, lr: 1.37e-02, grad_scale: 2.0 2024-06-19 21:36:05,039 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 21:36:14,132 INFO [train.py:1060] (1/2) Epoch 4, validation: loss=0.278, simple_loss=0.3202, pruned_loss=0.1178, over 351949.00 frames. 2024-06-19 21:36:14,132 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17340MB 2024-06-19 21:36:16,092 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2024-06-19 21:36:23,406 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.266e+03 2.500e+03 3.044e+03 3.496e+03 9.206e+03, threshold=6.087e+03, percent-clipped=1.0 2024-06-19 21:36:25,401 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.07 vs. limit=15.0 2024-06-19 21:36:26,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=66660.0, ans=0.0 2024-06-19 21:36:27,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=66678.33333333333, ans=0.0 2024-06-19 21:36:41,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=66715.0, ans=0.125 2024-06-19 21:36:44,728 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.02 vs. limit=15.0 2024-06-19 21:36:46,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=66715.0, ans=0.025 2024-06-19 21:36:47,922 INFO [train.py:1028] (1/2) Epoch 4, batch 6050, loss[loss=0.3956, simple_loss=0.3754, pruned_loss=0.2079, over 12944.00 frames. ], tot_loss[loss=0.3805, simple_loss=0.3647, pruned_loss=0.1981, over 2578187.70 frames. ], batch size: 39, lr: 1.37e-02, grad_scale: 0.5 2024-06-19 21:36:49,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=66733.33333333333, ans=0.125 2024-06-19 21:36:51,024 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2024-06-19 21:36:53,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=66733.33333333333, ans=0.1 2024-06-19 21:36:53,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=66733.33333333333, ans=6.0 2024-06-19 21:36:57,859 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.96 vs. limit=15.0 2024-06-19 21:37:17,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=66806.66666666667, ans=0.125 2024-06-19 21:37:21,166 INFO [train.py:1028] (1/2) Epoch 4, batch 6100, loss[loss=0.3623, simple_loss=0.3443, pruned_loss=0.1902, over 13142.00 frames. ], tot_loss[loss=0.3821, simple_loss=0.3663, pruned_loss=0.1989, over 2579945.22 frames. ], batch size: 121, lr: 1.37e-02, grad_scale: 1.0 2024-06-19 21:37:24,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66825.0, ans=0.1 2024-06-19 21:37:31,458 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.769e+03 2.403e+03 3.010e+03 3.642e+03 7.150e+03, threshold=6.021e+03, percent-clipped=3.0 2024-06-19 21:37:32,552 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.70 vs. limit=15.0 2024-06-19 21:37:38,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=66861.66666666667, ans=15.0 2024-06-19 21:37:43,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=66880.0, ans=0.5 2024-06-19 21:37:48,940 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.73 vs. limit=22.5 2024-06-19 21:37:56,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=66898.33333333333, ans=0.0 2024-06-19 21:37:58,311 INFO [train.py:1028] (1/2) Epoch 4, batch 6150, loss[loss=0.4207, simple_loss=0.378, pruned_loss=0.2317, over 11015.00 frames. ], tot_loss[loss=0.3841, simple_loss=0.368, pruned_loss=0.2001, over 2577919.47 frames. ], batch size: 303, lr: 1.37e-02, grad_scale: 1.0 2024-06-19 21:38:02,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=66916.66666666667, ans=0.0 2024-06-19 21:38:17,498 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=3.170e+02 2024-06-19 21:38:21,157 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.45 vs. limit=15.0 2024-06-19 21:38:29,542 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.01 vs. limit=15.0 2024-06-19 21:38:35,095 INFO [train.py:1028] (1/2) Epoch 4, batch 6200, loss[loss=0.4232, simple_loss=0.3993, pruned_loss=0.2236, over 13217.00 frames. ], tot_loss[loss=0.3877, simple_loss=0.371, pruned_loss=0.2021, over 2575790.83 frames. ], batch size: 89, lr: 1.37e-02, grad_scale: 2.0 2024-06-19 21:38:35,603 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.38 vs. limit=15.0 2024-06-19 21:38:43,000 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.24 vs. limit=22.5 2024-06-19 21:38:46,135 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.474e+03 2.565e+03 2.960e+03 3.317e+03 1.164e+04, threshold=5.920e+03, percent-clipped=3.0 2024-06-19 21:38:48,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=67045.0, ans=0.2 2024-06-19 21:38:59,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=67063.33333333333, ans=0.02 2024-06-19 21:39:04,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=67081.66666666667, ans=0.2 2024-06-19 21:39:08,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=67081.66666666667, ans=0.0 2024-06-19 21:39:09,596 INFO [train.py:1028] (1/2) Epoch 4, batch 6250, loss[loss=0.3688, simple_loss=0.3511, pruned_loss=0.1932, over 13186.00 frames. ], tot_loss[loss=0.3915, simple_loss=0.3738, pruned_loss=0.2046, over 2567885.68 frames. ], batch size: 83, lr: 1.37e-02, grad_scale: 0.5 2024-06-19 21:39:11,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=67100.0, ans=0.0 2024-06-19 21:39:20,630 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.85 vs. limit=22.5 2024-06-19 21:39:29,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=67155.0, ans=10.0 2024-06-19 21:39:30,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=67155.0, ans=0.125 2024-06-19 21:39:39,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=67173.33333333333, ans=0.125 2024-06-19 21:39:42,125 INFO [train.py:1028] (1/2) Epoch 4, batch 6300, loss[loss=0.3831, simple_loss=0.3748, pruned_loss=0.1957, over 11258.00 frames. ], tot_loss[loss=0.3939, simple_loss=0.3762, pruned_loss=0.2058, over 2562728.55 frames. ], batch size: 16, lr: 1.37e-02, grad_scale: 1.0 2024-06-19 21:39:53,266 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=15.42 vs. limit=15.0 2024-06-19 21:39:57,489 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.533e+02 1.884e+03 2.214e+03 2.766e+03 4.435e+03, threshold=4.429e+03, percent-clipped=0.0 2024-06-19 21:40:02,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=67228.33333333333, ans=0.125 2024-06-19 21:40:09,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=67246.66666666667, ans=0.0 2024-06-19 21:40:10,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67246.66666666667, ans=0.1 2024-06-19 21:40:22,147 INFO [train.py:1028] (1/2) Epoch 4, batch 6350, loss[loss=0.4674, simple_loss=0.4212, pruned_loss=0.2568, over 12516.00 frames. ], tot_loss[loss=0.3933, simple_loss=0.3768, pruned_loss=0.2049, over 2572792.36 frames. ], batch size: 202, lr: 1.36e-02, grad_scale: 0.5 2024-06-19 21:40:23,336 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.90 vs. limit=15.0 2024-06-19 21:40:28,076 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.52 vs. limit=12.0 2024-06-19 21:40:32,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=67301.66666666667, ans=0.1 2024-06-19 21:40:41,786 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.70 vs. limit=15.0 2024-06-19 21:40:44,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=67338.33333333333, ans=0.125 2024-06-19 21:40:55,166 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.01 vs. limit=15.0 2024-06-19 21:40:55,329 INFO [train.py:1028] (1/2) Epoch 4, batch 6400, loss[loss=0.3628, simple_loss=0.3618, pruned_loss=0.1819, over 13244.00 frames. ], tot_loss[loss=0.3958, simple_loss=0.3792, pruned_loss=0.2062, over 2574325.58 frames. ], batch size: 67, lr: 1.36e-02, grad_scale: 1.0 2024-06-19 21:40:56,450 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=12.0 2024-06-19 21:40:57,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=67375.0, ans=0.125 2024-06-19 21:40:58,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=67375.0, ans=0.025 2024-06-19 21:41:00,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=67375.0, ans=0.125 2024-06-19 21:41:02,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=67393.33333333333, ans=0.125 2024-06-19 21:41:03,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.64 vs. limit=15.0 2024-06-19 21:41:07,364 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.545e+02 1.445e+03 1.729e+03 2.045e+03 4.800e+03, threshold=3.458e+03, percent-clipped=1.0 2024-06-19 21:41:09,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=67411.66666666667, ans=0.1 2024-06-19 21:41:13,678 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.89 vs. limit=22.5 2024-06-19 21:41:24,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=67448.33333333333, ans=0.025 2024-06-19 21:41:27,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=67466.66666666667, ans=0.1 2024-06-19 21:41:28,511 INFO [train.py:1028] (1/2) Epoch 4, batch 6450, loss[loss=0.465, simple_loss=0.4195, pruned_loss=0.2553, over 12552.00 frames. ], tot_loss[loss=0.3974, simple_loss=0.3809, pruned_loss=0.207, over 2580244.96 frames. ], batch size: 202, lr: 1.36e-02, grad_scale: 1.0 2024-06-19 21:41:31,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=67466.66666666667, ans=0.2 2024-06-19 21:41:36,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=67485.0, ans=0.125 2024-06-19 21:41:42,975 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.06 vs. limit=15.0 2024-06-19 21:41:43,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=67503.33333333333, ans=0.95 2024-06-19 21:41:44,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=67503.33333333333, ans=0.0 2024-06-19 21:42:01,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=67540.0, ans=0.0 2024-06-19 21:42:03,783 INFO [train.py:1028] (1/2) Epoch 4, batch 6500, loss[loss=0.4672, simple_loss=0.4177, pruned_loss=0.2584, over 10933.00 frames. ], tot_loss[loss=0.3982, simple_loss=0.3823, pruned_loss=0.2071, over 2583401.98 frames. ], batch size: 304, lr: 1.36e-02, grad_scale: 2.0 2024-06-19 21:42:05,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=67558.33333333333, ans=0.125 2024-06-19 21:42:06,532 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=67558.33333333333, ans=0.0 2024-06-19 21:42:07,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=67558.33333333333, ans=0.125 2024-06-19 21:42:12,110 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.39 vs. limit=22.5 2024-06-19 21:42:15,449 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.023e+03 1.514e+03 1.786e+03 2.259e+03 8.139e+03, threshold=3.572e+03, percent-clipped=2.0 2024-06-19 21:42:15,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=67576.66666666667, ans=10.0 2024-06-19 21:42:15,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=67576.66666666667, ans=0.125 2024-06-19 21:42:34,792 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.77 vs. limit=15.0 2024-06-19 21:42:40,128 INFO [train.py:1028] (1/2) Epoch 4, batch 6550, loss[loss=0.3438, simple_loss=0.3472, pruned_loss=0.1702, over 12621.00 frames. ], tot_loss[loss=0.3983, simple_loss=0.3829, pruned_loss=0.2069, over 2587757.36 frames. ], batch size: 22, lr: 1.36e-02, grad_scale: 2.0 2024-06-19 21:42:40,635 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.17 vs. limit=10.0 2024-06-19 21:42:40,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=67650.0, ans=0.1 2024-06-19 21:42:42,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=67650.0, ans=0.125 2024-06-19 21:42:44,780 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:42:50,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=67668.33333333333, ans=0.1 2024-06-19 21:42:54,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=67686.66666666667, ans=0.0 2024-06-19 21:43:00,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=67705.0, ans=0.025 2024-06-19 21:43:13,284 INFO [train.py:1028] (1/2) Epoch 4, batch 6600, loss[loss=0.3453, simple_loss=0.3462, pruned_loss=0.1722, over 13235.00 frames. ], tot_loss[loss=0.3975, simple_loss=0.3825, pruned_loss=0.2062, over 2589563.24 frames. ], batch size: 72, lr: 1.36e-02, grad_scale: 2.0 2024-06-19 21:43:14,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=67741.66666666667, ans=0.125 2024-06-19 21:43:19,607 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.74 vs. limit=15.0 2024-06-19 21:43:22,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=67760.0, ans=0.1 2024-06-19 21:43:26,535 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.249e+03 1.975e+03 2.513e+03 3.181e+03 7.798e+03, threshold=5.026e+03, percent-clipped=18.0 2024-06-19 21:43:30,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=67778.33333333333, ans=0.2 2024-06-19 21:43:42,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=67815.0, ans=0.0 2024-06-19 21:43:42,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=67815.0, ans=0.2 2024-06-19 21:43:46,590 INFO [train.py:1028] (1/2) Epoch 4, batch 6650, loss[loss=0.4552, simple_loss=0.4227, pruned_loss=0.2438, over 12914.00 frames. ], tot_loss[loss=0.4008, simple_loss=0.3852, pruned_loss=0.2082, over 2583951.05 frames. ], batch size: 158, lr: 1.36e-02, grad_scale: 1.0 2024-06-19 21:43:49,157 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.53 vs. limit=22.5 2024-06-19 21:43:50,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.50 vs. limit=15.0 2024-06-19 21:43:55,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=67851.66666666667, ans=0.0 2024-06-19 21:43:59,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=67870.0, ans=0.2 2024-06-19 21:44:19,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=67906.66666666667, ans=0.125 2024-06-19 21:44:21,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=67906.66666666667, ans=10.0 2024-06-19 21:44:25,430 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.60 vs. limit=15.0 2024-06-19 21:44:26,850 INFO [train.py:1028] (1/2) Epoch 4, batch 6700, loss[loss=0.454, simple_loss=0.4164, pruned_loss=0.2458, over 12751.00 frames. ], tot_loss[loss=0.4014, simple_loss=0.3861, pruned_loss=0.2084, over 2583970.83 frames. ], batch size: 176, lr: 1.36e-02, grad_scale: 2.0 2024-06-19 21:44:40,248 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.099e+03 1.874e+03 2.204e+03 2.515e+03 4.654e+03, threshold=4.407e+03, percent-clipped=0.0 2024-06-19 21:44:40,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=67961.66666666667, ans=0.0 2024-06-19 21:44:48,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=12.0 2024-06-19 21:44:50,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=67980.0, ans=0.125 2024-06-19 21:44:51,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=67980.0, ans=0.07 2024-06-19 21:44:57,777 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=19.12 vs. limit=15.0 2024-06-19 21:44:59,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=68016.66666666667, ans=0.025 2024-06-19 21:45:00,104 INFO [train.py:1028] (1/2) Epoch 4, batch 6750, loss[loss=0.4772, simple_loss=0.4341, pruned_loss=0.2602, over 12290.00 frames. ], tot_loss[loss=0.4031, simple_loss=0.3873, pruned_loss=0.2094, over 2577963.54 frames. ], batch size: 241, lr: 1.36e-02, grad_scale: 2.0 2024-06-19 21:45:01,782 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.75 vs. limit=15.0 2024-06-19 21:45:08,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=68035.0, ans=0.125 2024-06-19 21:45:09,464 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.10 vs. limit=15.0 2024-06-19 21:45:21,842 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.26 vs. limit=10.0 2024-06-19 21:45:26,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=68090.0, ans=10.0 2024-06-19 21:45:32,722 INFO [train.py:1028] (1/2) Epoch 4, batch 6800, loss[loss=0.3865, simple_loss=0.3832, pruned_loss=0.1949, over 13168.00 frames. ], tot_loss[loss=0.4029, simple_loss=0.3879, pruned_loss=0.2089, over 2580380.74 frames. ], batch size: 67, lr: 1.36e-02, grad_scale: 4.0 2024-06-19 21:45:38,164 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.53 vs. limit=22.5 2024-06-19 21:45:38,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=68126.66666666667, ans=0.1 2024-06-19 21:45:44,426 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.58 vs. limit=15.0 2024-06-19 21:45:46,563 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.154e+03 1.846e+03 2.129e+03 2.592e+03 3.733e+03, threshold=4.258e+03, percent-clipped=0.0 2024-06-19 21:45:49,469 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.67 vs. limit=6.0 2024-06-19 21:45:51,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=68163.33333333333, ans=0.1 2024-06-19 21:45:53,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=68163.33333333333, ans=0.125 2024-06-19 21:45:58,953 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.52 vs. limit=22.5 2024-06-19 21:46:09,005 INFO [train.py:1028] (1/2) Epoch 4, batch 6850, loss[loss=0.442, simple_loss=0.4221, pruned_loss=0.231, over 13294.00 frames. ], tot_loss[loss=0.4015, simple_loss=0.3873, pruned_loss=0.2079, over 2583951.13 frames. ], batch size: 63, lr: 1.36e-02, grad_scale: 2.0 2024-06-19 21:46:14,018 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.14 vs. limit=22.5 2024-06-19 21:46:22,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=68236.66666666667, ans=0.1 2024-06-19 21:46:41,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=68273.33333333333, ans=0.125 2024-06-19 21:46:44,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=68291.66666666667, ans=0.125 2024-06-19 21:46:44,930 INFO [train.py:1028] (1/2) Epoch 4, batch 6900, loss[loss=0.3931, simple_loss=0.3901, pruned_loss=0.1981, over 13302.00 frames. ], tot_loss[loss=0.4033, simple_loss=0.3888, pruned_loss=0.2089, over 2585219.53 frames. ], batch size: 49, lr: 1.35e-02, grad_scale: 4.0 2024-06-19 21:46:45,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=68291.66666666667, ans=0.125 2024-06-19 21:46:47,722 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=68291.66666666667, ans=0.0 2024-06-19 21:46:52,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=68310.0, ans=0.125 2024-06-19 21:46:56,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=68310.0, ans=0.125 2024-06-19 21:46:58,870 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.105e+03 1.822e+03 2.239e+03 2.712e+03 4.035e+03, threshold=4.478e+03, percent-clipped=0.0 2024-06-19 21:47:02,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=68328.33333333333, ans=0.125 2024-06-19 21:47:04,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=68346.66666666667, ans=0.125 2024-06-19 21:47:05,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=68346.66666666667, ans=0.125 2024-06-19 21:47:07,712 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.44 vs. limit=10.0 2024-06-19 21:47:09,019 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.40 vs. limit=15.0 2024-06-19 21:47:17,762 INFO [train.py:1028] (1/2) Epoch 4, batch 6950, loss[loss=0.311, simple_loss=0.315, pruned_loss=0.1535, over 11317.00 frames. ], tot_loss[loss=0.403, simple_loss=0.3889, pruned_loss=0.2085, over 2579069.42 frames. ], batch size: 16, lr: 1.35e-02, grad_scale: 1.0 2024-06-19 21:47:34,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=68420.0, ans=0.5 2024-06-19 21:47:51,441 INFO [train.py:1028] (1/2) Epoch 4, batch 7000, loss[loss=0.4418, simple_loss=0.4143, pruned_loss=0.2346, over 12961.00 frames. ], tot_loss[loss=0.4014, simple_loss=0.3881, pruned_loss=0.2074, over 2575817.09 frames. ], batch size: 158, lr: 1.35e-02, grad_scale: 2.0 2024-06-19 21:47:51,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=68475.0, ans=0.125 2024-06-19 21:47:51,793 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.20 vs. limit=22.5 2024-06-19 21:47:52,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=68475.0, ans=0.1 2024-06-19 21:47:57,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=68493.33333333333, ans=0.0 2024-06-19 21:48:07,797 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.135e+03 1.680e+03 2.027e+03 2.444e+03 5.504e+03, threshold=4.055e+03, percent-clipped=2.0 2024-06-19 21:48:25,879 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2024-06-19 21:48:27,357 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=22.28 vs. limit=15.0 2024-06-19 21:48:29,104 INFO [train.py:1028] (1/2) Epoch 4, batch 7050, loss[loss=0.4359, simple_loss=0.4099, pruned_loss=0.231, over 12790.00 frames. ], tot_loss[loss=0.4033, simple_loss=0.3902, pruned_loss=0.2082, over 2582008.01 frames. ], batch size: 177, lr: 1.35e-02, grad_scale: 1.0 2024-06-19 21:48:29,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=68566.66666666667, ans=0.125 2024-06-19 21:48:52,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=68621.66666666667, ans=0.0 2024-06-19 21:48:52,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=68621.66666666667, ans=0.125 2024-06-19 21:49:05,183 INFO [train.py:1028] (1/2) Epoch 4, batch 7100, loss[loss=0.4431, simple_loss=0.4337, pruned_loss=0.2262, over 13194.00 frames. ], tot_loss[loss=0.4064, simple_loss=0.3923, pruned_loss=0.2102, over 2575771.60 frames. ], batch size: 112, lr: 1.35e-02, grad_scale: 2.0 2024-06-19 21:49:15,356 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.84 vs. limit=15.0 2024-06-19 21:49:21,471 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.039e+03 1.808e+03 2.162e+03 2.614e+03 7.204e+03, threshold=4.324e+03, percent-clipped=4.0 2024-06-19 21:49:28,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=68713.33333333333, ans=0.125 2024-06-19 21:49:30,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=68713.33333333333, ans=0.0 2024-06-19 21:49:38,151 INFO [train.py:1028] (1/2) Epoch 4, batch 7150, loss[loss=0.5128, simple_loss=0.4626, pruned_loss=0.2815, over 12508.00 frames. ], tot_loss[loss=0.4076, simple_loss=0.3933, pruned_loss=0.2109, over 2573973.68 frames. ], batch size: 202, lr: 1.35e-02, grad_scale: 0.5 2024-06-19 21:49:38,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=68750.0, ans=0.0 2024-06-19 21:49:46,187 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=68768.33333333333, ans=0.1 2024-06-19 21:49:48,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=68768.33333333333, ans=0.125 2024-06-19 21:49:49,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=68768.33333333333, ans=0.2 2024-06-19 21:49:54,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=68786.66666666667, ans=0.09899494936611666 2024-06-19 21:49:57,745 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.91 vs. limit=22.5 2024-06-19 21:50:00,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=68805.0, ans=0.0 2024-06-19 21:50:00,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=68805.0, ans=0.125 2024-06-19 21:50:00,297 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.16 vs. limit=22.5 2024-06-19 21:50:10,519 INFO [train.py:1028] (1/2) Epoch 4, batch 7200, loss[loss=0.4465, simple_loss=0.4299, pruned_loss=0.2315, over 13209.00 frames. ], tot_loss[loss=0.4093, simple_loss=0.3952, pruned_loss=0.2117, over 2578209.31 frames. ], batch size: 112, lr: 1.35e-02, grad_scale: 1.0 2024-06-19 21:50:10,872 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.80 vs. limit=22.5 2024-06-19 21:50:17,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=68841.66666666667, ans=0.2 2024-06-19 21:50:25,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=68860.0, ans=0.2 2024-06-19 21:50:30,563 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.046e+03 2.089e+03 2.454e+03 3.037e+03 5.608e+03, threshold=4.907e+03, percent-clipped=3.0 2024-06-19 21:50:37,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=68896.66666666667, ans=0.125 2024-06-19 21:50:38,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=68896.66666666667, ans=0.125 2024-06-19 21:50:40,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=68915.0, ans=0.125 2024-06-19 21:50:47,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=68915.0, ans=0.125 2024-06-19 21:50:50,101 INFO [train.py:1028] (1/2) Epoch 4, batch 7250, loss[loss=0.4022, simple_loss=0.3957, pruned_loss=0.2044, over 12893.00 frames. ], tot_loss[loss=0.4095, simple_loss=0.3956, pruned_loss=0.2117, over 2579462.72 frames. ], batch size: 36, lr: 1.35e-02, grad_scale: 1.0 2024-06-19 21:50:50,546 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.44 vs. limit=15.0 2024-06-19 21:50:55,150 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.02 vs. limit=10.0 2024-06-19 21:50:59,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=68951.66666666667, ans=0.125 2024-06-19 21:51:07,427 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.23 vs. limit=22.5 2024-06-19 21:51:12,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=68988.33333333333, ans=0.125 2024-06-19 21:51:21,463 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.07 vs. limit=15.0 2024-06-19 21:51:22,276 INFO [train.py:1028] (1/2) Epoch 4, batch 7300, loss[loss=0.3771, simple_loss=0.3678, pruned_loss=0.1932, over 12976.00 frames. ], tot_loss[loss=0.4125, simple_loss=0.3978, pruned_loss=0.2136, over 2579294.45 frames. ], batch size: 36, lr: 1.35e-02, grad_scale: 1.0 2024-06-19 21:51:31,737 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:51:38,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=69061.66666666667, ans=0.0 2024-06-19 21:51:40,154 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.500e+03 2.294e+03 2.660e+03 3.164e+03 5.742e+03, threshold=5.319e+03, percent-clipped=2.0 2024-06-19 21:51:41,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=69080.0, ans=0.0 2024-06-19 21:51:50,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=69098.33333333333, ans=0.0 2024-06-19 21:51:52,371 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.53 vs. limit=15.0 2024-06-19 21:51:53,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=69098.33333333333, ans=0.125 2024-06-19 21:51:55,106 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.69 vs. limit=15.0 2024-06-19 21:51:55,329 INFO [train.py:1028] (1/2) Epoch 4, batch 7350, loss[loss=0.4686, simple_loss=0.4446, pruned_loss=0.2463, over 13286.00 frames. ], tot_loss[loss=0.4128, simple_loss=0.3984, pruned_loss=0.2136, over 2581914.52 frames. ], batch size: 46, lr: 1.35e-02, grad_scale: 1.0 2024-06-19 21:51:59,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69116.66666666667, ans=0.1 2024-06-19 21:52:12,611 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.99 vs. limit=15.0 2024-06-19 21:52:26,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=69190.0, ans=0.0 2024-06-19 21:52:29,954 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.42 vs. limit=12.0 2024-06-19 21:52:32,049 INFO [train.py:1028] (1/2) Epoch 4, batch 7400, loss[loss=0.4216, simple_loss=0.4145, pruned_loss=0.2144, over 13263.00 frames. ], tot_loss[loss=0.4114, simple_loss=0.3976, pruned_loss=0.2125, over 2586395.76 frames. ], batch size: 63, lr: 1.35e-02, grad_scale: 2.0 2024-06-19 21:52:36,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=69208.33333333333, ans=0.0 2024-06-19 21:52:45,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=69245.0, ans=0.125 2024-06-19 21:52:49,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=69245.0, ans=0.125 2024-06-19 21:52:53,287 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.663e+02 1.704e+03 2.009e+03 2.398e+03 3.760e+03, threshold=4.019e+03, percent-clipped=0.0 2024-06-19 21:52:57,073 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.94 vs. limit=6.0 2024-06-19 21:53:06,376 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=69281.66666666667, ans=0.0 2024-06-19 21:53:08,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=69300.0, ans=0.0 2024-06-19 21:53:09,101 INFO [train.py:1028] (1/2) Epoch 4, batch 7450, loss[loss=0.3721, simple_loss=0.3789, pruned_loss=0.1826, over 12662.00 frames. ], tot_loss[loss=0.4102, simple_loss=0.3977, pruned_loss=0.2114, over 2580954.83 frames. ], batch size: 29, lr: 1.35e-02, grad_scale: 2.0 2024-06-19 21:53:10,214 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.47 vs. limit=15.0 2024-06-19 21:53:10,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=69300.0, ans=0.0 2024-06-19 21:53:11,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=69300.0, ans=0.125 2024-06-19 21:53:20,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=69318.33333333333, ans=0.1 2024-06-19 21:53:21,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=69318.33333333333, ans=0.2 2024-06-19 21:53:25,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=69336.66666666667, ans=0.2 2024-06-19 21:53:29,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=69355.0, ans=0.0 2024-06-19 21:53:32,013 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.86 vs. limit=15.0 2024-06-19 21:53:42,139 INFO [train.py:1028] (1/2) Epoch 4, batch 7500, loss[loss=0.4289, simple_loss=0.3885, pruned_loss=0.2347, over 10634.00 frames. ], tot_loss[loss=0.4108, simple_loss=0.3984, pruned_loss=0.2117, over 2578800.71 frames. ], batch size: 303, lr: 1.34e-02, grad_scale: 4.0 2024-06-19 21:53:42,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=69391.66666666667, ans=0.2 2024-06-19 21:53:45,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=69391.66666666667, ans=0.0 2024-06-19 21:54:01,022 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.362e+02 1.565e+03 1.843e+03 2.306e+03 3.859e+03, threshold=3.686e+03, percent-clipped=0.0 2024-06-19 21:54:18,354 INFO [train.py:1028] (1/2) Epoch 4, batch 7550, loss[loss=0.4052, simple_loss=0.3851, pruned_loss=0.2127, over 12917.00 frames. ], tot_loss[loss=0.4126, simple_loss=0.3997, pruned_loss=0.2127, over 2577313.89 frames. ], batch size: 158, lr: 1.34e-02, grad_scale: 1.0 2024-06-19 21:54:36,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=69520.0, ans=0.125 2024-06-19 21:54:38,037 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.13 vs. limit=12.0 2024-06-19 21:54:42,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69538.33333333333, ans=0.1 2024-06-19 21:54:42,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=69538.33333333333, ans=0.0 2024-06-19 21:54:53,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=69556.66666666667, ans=0.0 2024-06-19 21:54:54,828 INFO [train.py:1028] (1/2) Epoch 4, batch 7600, loss[loss=0.3794, simple_loss=0.3717, pruned_loss=0.1935, over 13190.00 frames. ], tot_loss[loss=0.4119, simple_loss=0.3993, pruned_loss=0.2123, over 2576313.70 frames. ], batch size: 83, lr: 1.34e-02, grad_scale: 2.0 2024-06-19 21:54:56,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=69575.0, ans=0.0 2024-06-19 21:55:04,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=69593.33333333333, ans=0.05 2024-06-19 21:55:06,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=69593.33333333333, ans=0.125 2024-06-19 21:55:06,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=69593.33333333333, ans=0.125 2024-06-19 21:55:14,541 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.917e+02 1.453e+03 1.737e+03 2.118e+03 4.377e+03, threshold=3.474e+03, percent-clipped=3.0 2024-06-19 21:55:18,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=69630.0, ans=0.0 2024-06-19 21:55:21,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=69648.33333333333, ans=15.0 2024-06-19 21:55:21,903 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.35 vs. limit=22.5 2024-06-19 21:55:23,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=69648.33333333333, ans=0.025 2024-06-19 21:55:27,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=69648.33333333333, ans=0.125 2024-06-19 21:55:28,327 INFO [train.py:1028] (1/2) Epoch 4, batch 7650, loss[loss=0.3906, simple_loss=0.3834, pruned_loss=0.1989, over 12839.00 frames. ], tot_loss[loss=0.411, simple_loss=0.3989, pruned_loss=0.2116, over 2572595.73 frames. ], batch size: 33, lr: 1.34e-02, grad_scale: 1.0 2024-06-19 21:55:32,191 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=22.21 vs. limit=15.0 2024-06-19 21:55:34,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=69685.0, ans=0.2 2024-06-19 21:55:35,300 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.87 vs. limit=15.0 2024-06-19 21:55:39,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=69685.0, ans=0.0 2024-06-19 21:55:54,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=69740.0, ans=0.2 2024-06-19 21:55:57,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=69740.0, ans=0.07 2024-06-19 21:55:58,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=69740.0, ans=0.125 2024-06-19 21:56:00,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.32 vs. limit=22.5 2024-06-19 21:56:01,646 INFO [train.py:1028] (1/2) Epoch 4, batch 7700, loss[loss=0.4058, simple_loss=0.411, pruned_loss=0.2003, over 13235.00 frames. ], tot_loss[loss=0.4108, simple_loss=0.399, pruned_loss=0.2113, over 2569099.77 frames. ], batch size: 63, lr: 1.34e-02, grad_scale: 2.0 2024-06-19 21:56:02,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=69758.33333333333, ans=0.1 2024-06-19 21:56:06,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=69758.33333333333, ans=0.125 2024-06-19 21:56:11,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=69776.66666666667, ans=0.2 2024-06-19 21:56:17,410 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.33 vs. limit=15.0 2024-06-19 21:56:20,328 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.18 vs. limit=22.5 2024-06-19 21:56:23,842 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.436e+02 1.234e+03 1.419e+03 1.739e+03 3.552e+03, threshold=2.838e+03, percent-clipped=1.0 2024-06-19 21:56:28,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=69813.33333333333, ans=0.0 2024-06-19 21:56:29,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=69813.33333333333, ans=0.1 2024-06-19 21:56:36,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=69850.0, ans=0.0 2024-06-19 21:56:36,326 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.52 vs. limit=10.0 2024-06-19 21:56:36,562 INFO [train.py:1028] (1/2) Epoch 4, batch 7750, loss[loss=0.426, simple_loss=0.4146, pruned_loss=0.2187, over 13245.00 frames. ], tot_loss[loss=0.4119, simple_loss=0.3998, pruned_loss=0.212, over 2573862.13 frames. ], batch size: 72, lr: 1.34e-02, grad_scale: 2.0 2024-06-19 21:56:36,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=69850.0, ans=0.1 2024-06-19 21:56:37,024 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=11.22 vs. limit=12.0 2024-06-19 21:56:49,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=69868.33333333333, ans=0.125 2024-06-19 21:56:51,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=69868.33333333333, ans=0.0 2024-06-19 21:56:54,203 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.29 vs. limit=10.0 2024-06-19 21:57:12,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=69941.66666666667, ans=0.025 2024-06-19 21:57:13,329 INFO [train.py:1028] (1/2) Epoch 4, batch 7800, loss[loss=0.4177, simple_loss=0.4124, pruned_loss=0.2114, over 13163.00 frames. ], tot_loss[loss=0.411, simple_loss=0.3999, pruned_loss=0.211, over 2579114.25 frames. ], batch size: 95, lr: 1.34e-02, grad_scale: 4.0 2024-06-19 21:57:13,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=69941.66666666667, ans=0.125 2024-06-19 21:57:20,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=69960.0, ans=0.125 2024-06-19 21:57:21,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=69960.0, ans=0.0 2024-06-19 21:57:26,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=69978.33333333333, ans=0.1 2024-06-19 21:57:26,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=69978.33333333333, ans=15.0 2024-06-19 21:57:34,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=69996.66666666667, ans=0.04949747468305833 2024-06-19 21:57:34,693 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.37 vs. limit=15.0 2024-06-19 21:57:34,981 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.683e+02 1.200e+03 1.462e+03 1.756e+03 3.966e+03, threshold=2.924e+03, percent-clipped=6.0 2024-06-19 21:57:39,357 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.49 vs. limit=15.0 2024-06-19 21:57:40,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=70015.0, ans=0.0 2024-06-19 21:57:44,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=70015.0, ans=0.125 2024-06-19 21:57:46,588 INFO [train.py:1028] (1/2) Epoch 4, batch 7850, loss[loss=0.3714, simple_loss=0.3753, pruned_loss=0.1837, over 10911.00 frames. ], tot_loss[loss=0.4137, simple_loss=0.4017, pruned_loss=0.2128, over 2571759.83 frames. ], batch size: 16, lr: 1.34e-02, grad_scale: 1.0 2024-06-19 21:57:48,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=70033.33333333333, ans=0.0 2024-06-19 21:57:50,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=70033.33333333333, ans=0.2 2024-06-19 21:58:01,024 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=7.299e-01 2024-06-19 21:58:01,953 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.92 vs. limit=22.5 2024-06-19 21:58:11,864 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.68 vs. limit=15.0 2024-06-19 21:58:24,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=70125.0, ans=0.0 2024-06-19 21:58:24,459 INFO [train.py:1028] (1/2) Epoch 4, batch 7900, loss[loss=0.412, simple_loss=0.4044, pruned_loss=0.2098, over 13188.00 frames. ], tot_loss[loss=0.4132, simple_loss=0.4016, pruned_loss=0.2125, over 2572516.08 frames. ], batch size: 77, lr: 1.34e-02, grad_scale: 2.0 2024-06-19 21:58:26,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=70125.0, ans=0.2 2024-06-19 21:58:27,268 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=5.238e+02 2024-06-19 21:58:28,825 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.99 vs. limit=22.5 2024-06-19 21:58:37,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=70143.33333333333, ans=0.125 2024-06-19 21:58:38,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=70143.33333333333, ans=0.2 2024-06-19 21:58:48,958 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.483e+02 1.494e+03 1.721e+03 1.956e+03 3.878e+03, threshold=3.441e+03, percent-clipped=2.0 2024-06-19 21:58:56,700 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.79 vs. limit=15.0 2024-06-19 21:59:00,928 INFO [train.py:1028] (1/2) Epoch 4, batch 7950, loss[loss=0.4713, simple_loss=0.4313, pruned_loss=0.2556, over 10534.00 frames. ], tot_loss[loss=0.4138, simple_loss=0.4024, pruned_loss=0.2127, over 2576126.73 frames. ], batch size: 303, lr: 1.34e-02, grad_scale: 2.0 2024-06-19 21:59:01,481 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.33 vs. limit=15.0 2024-06-19 21:59:01,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=70216.66666666667, ans=0.125 2024-06-19 21:59:07,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=70235.0, ans=0.125 2024-06-19 21:59:11,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=70235.0, ans=0.2 2024-06-19 21:59:16,554 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.86 vs. limit=15.0 2024-06-19 21:59:34,556 INFO [train.py:1028] (1/2) Epoch 4, batch 8000, loss[loss=0.3747, simple_loss=0.3733, pruned_loss=0.1881, over 13027.00 frames. ], tot_loss[loss=0.4149, simple_loss=0.4033, pruned_loss=0.2132, over 2573925.38 frames. ], batch size: 30, lr: 1.34e-02, grad_scale: 2.0 2024-06-19 21:59:36,833 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.018e-03 2024-06-19 21:59:40,913 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.74 vs. limit=22.5 2024-06-19 21:59:40,980 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.62 vs. limit=22.5 2024-06-19 21:59:43,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=70326.66666666667, ans=0.125 2024-06-19 21:59:51,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=70345.0, ans=0.0 2024-06-19 21:59:56,530 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.115e+03 1.657e+03 1.928e+03 2.330e+03 3.894e+03, threshold=3.857e+03, percent-clipped=4.0 2024-06-19 22:00:02,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=70381.66666666667, ans=0.125 2024-06-19 22:00:10,382 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.44 vs. limit=15.0 2024-06-19 22:00:11,571 INFO [train.py:1028] (1/2) Epoch 4, batch 8050, loss[loss=0.4213, simple_loss=0.4159, pruned_loss=0.2134, over 13238.00 frames. ], tot_loss[loss=0.4127, simple_loss=0.4018, pruned_loss=0.2118, over 2574510.81 frames. ], batch size: 83, lr: 1.34e-02, grad_scale: 1.0 2024-06-19 22:00:11,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=70400.0, ans=0.0 2024-06-19 22:00:16,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=70400.0, ans=0.125 2024-06-19 22:00:23,095 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.55 vs. limit=10.0 2024-06-19 22:00:24,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=70436.66666666667, ans=0.125 2024-06-19 22:00:43,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=70473.33333333333, ans=0.125 2024-06-19 22:00:43,899 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.45 vs. limit=22.5 2024-06-19 22:00:45,567 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.31 vs. limit=15.0 2024-06-19 22:00:47,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=70473.33333333333, ans=0.0 2024-06-19 22:00:48,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=70473.33333333333, ans=0.025 2024-06-19 22:00:49,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=70473.33333333333, ans=0.125 2024-06-19 22:00:49,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=70491.66666666667, ans=0.125 2024-06-19 22:00:50,380 INFO [train.py:1028] (1/2) Epoch 4, batch 8100, loss[loss=0.4273, simple_loss=0.4154, pruned_loss=0.2196, over 13129.00 frames. ], tot_loss[loss=0.4149, simple_loss=0.4034, pruned_loss=0.2132, over 2577968.60 frames. ], batch size: 112, lr: 1.33e-02, grad_scale: 2.0 2024-06-19 22:01:02,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=70510.0, ans=0.125 2024-06-19 22:01:06,953 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.63 vs. limit=10.0 2024-06-19 22:01:08,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=70528.33333333333, ans=0.1 2024-06-19 22:01:09,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=70528.33333333333, ans=0.0 2024-06-19 22:01:15,264 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+03 2.350e+03 2.726e+03 3.270e+03 7.314e+03, threshold=5.451e+03, percent-clipped=9.0 2024-06-19 22:01:25,028 INFO [train.py:1028] (1/2) Epoch 4, batch 8150, loss[loss=0.4144, simple_loss=0.3986, pruned_loss=0.2151, over 13113.00 frames. ], tot_loss[loss=0.4137, simple_loss=0.4026, pruned_loss=0.2124, over 2580036.84 frames. ], batch size: 121, lr: 1.33e-02, grad_scale: 0.5 2024-06-19 22:01:28,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=70583.33333333333, ans=0.125 2024-06-19 22:01:36,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70601.66666666667, ans=0.1 2024-06-19 22:01:37,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=70620.0, ans=0.125 2024-06-19 22:01:40,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=70620.0, ans=0.125 2024-06-19 22:01:41,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=70620.0, ans=0.0 2024-06-19 22:01:57,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=70675.0, ans=0.125 2024-06-19 22:01:57,828 INFO [train.py:1028] (1/2) Epoch 4, batch 8200, loss[loss=0.4208, simple_loss=0.4138, pruned_loss=0.2139, over 13205.00 frames. ], tot_loss[loss=0.4133, simple_loss=0.4028, pruned_loss=0.2119, over 2583707.22 frames. ], batch size: 112, lr: 1.33e-02, grad_scale: 1.0 2024-06-19 22:02:13,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=70693.33333333333, ans=0.125 2024-06-19 22:02:14,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=70693.33333333333, ans=0.0 2024-06-19 22:02:17,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=70711.66666666667, ans=0.125 2024-06-19 22:02:20,014 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.96 vs. limit=22.5 2024-06-19 22:02:27,587 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.046e+03 1.630e+03 1.953e+03 2.331e+03 4.319e+03, threshold=3.906e+03, percent-clipped=0.0 2024-06-19 22:02:28,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=70730.0, ans=0.125 2024-06-19 22:02:31,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=70748.33333333333, ans=0.125 2024-06-19 22:02:34,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=70748.33333333333, ans=0.125 2024-06-19 22:02:36,679 INFO [train.py:1028] (1/2) Epoch 4, batch 8250, loss[loss=0.3955, simple_loss=0.4011, pruned_loss=0.1949, over 13206.00 frames. ], tot_loss[loss=0.4136, simple_loss=0.4033, pruned_loss=0.212, over 2583650.68 frames. ], batch size: 52, lr: 1.33e-02, grad_scale: 1.0 2024-06-19 22:02:45,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=70766.66666666667, ans=0.0 2024-06-19 22:02:46,089 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=18.58 vs. limit=15.0 2024-06-19 22:02:48,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=70785.0, ans=0.125 2024-06-19 22:02:50,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=70785.0, ans=0.125 2024-06-19 22:02:55,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=70803.33333333333, ans=0.125 2024-06-19 22:03:00,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70821.66666666667, ans=0.1 2024-06-19 22:03:10,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=70840.0, ans=10.0 2024-06-19 22:03:12,768 INFO [train.py:1028] (1/2) Epoch 4, batch 8300, loss[loss=0.409, simple_loss=0.3986, pruned_loss=0.2098, over 13013.00 frames. ], tot_loss[loss=0.4122, simple_loss=0.402, pruned_loss=0.2112, over 2581557.63 frames. ], batch size: 102, lr: 1.33e-02, grad_scale: 2.0 2024-06-19 22:03:14,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=70858.33333333333, ans=0.125 2024-06-19 22:03:31,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=70913.33333333333, ans=0.125 2024-06-19 22:03:36,057 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.593e+02 1.268e+03 1.516e+03 1.853e+03 2.745e+03, threshold=3.032e+03, percent-clipped=0.0 2024-06-19 22:03:39,210 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2024-06-19 22:03:45,231 INFO [train.py:1028] (1/2) Epoch 4, batch 8350, loss[loss=0.4126, simple_loss=0.4047, pruned_loss=0.2102, over 13134.00 frames. ], tot_loss[loss=0.4105, simple_loss=0.4012, pruned_loss=0.2099, over 2581720.65 frames. ], batch size: 112, lr: 1.33e-02, grad_scale: 2.0 2024-06-19 22:03:54,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=70968.33333333333, ans=0.0 2024-06-19 22:04:06,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=71005.0, ans=0.125 2024-06-19 22:04:15,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=71005.0, ans=0.2 2024-06-19 22:04:21,234 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.07 vs. limit=22.5 2024-06-19 22:04:23,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=71041.66666666667, ans=0.2 2024-06-19 22:04:23,646 INFO [train.py:1028] (1/2) Epoch 4, batch 8400, loss[loss=0.3918, simple_loss=0.3841, pruned_loss=0.1997, over 12934.00 frames. ], tot_loss[loss=0.4103, simple_loss=0.4011, pruned_loss=0.2098, over 2577262.62 frames. ], batch size: 39, lr: 1.33e-02, grad_scale: 2.0 2024-06-19 22:04:31,292 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.66 vs. limit=15.0 2024-06-19 22:04:31,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=71060.0, ans=0.0 2024-06-19 22:04:35,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=71060.0, ans=0.125 2024-06-19 22:04:38,965 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.44 vs. limit=15.0 2024-06-19 22:04:39,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=71078.33333333333, ans=0.0 2024-06-19 22:04:46,813 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.46 vs. limit=6.0 2024-06-19 22:04:51,708 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.103e+02 1.430e+03 1.737e+03 2.245e+03 5.249e+03, threshold=3.474e+03, percent-clipped=3.0 2024-06-19 22:04:54,849 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.02 vs. limit=15.0 2024-06-19 22:04:59,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=71133.33333333333, ans=0.2 2024-06-19 22:04:59,500 INFO [train.py:1028] (1/2) Epoch 4, batch 8450, loss[loss=0.4382, simple_loss=0.4266, pruned_loss=0.2249, over 13168.00 frames. ], tot_loss[loss=0.4112, simple_loss=0.4022, pruned_loss=0.2101, over 2579433.07 frames. ], batch size: 112, lr: 1.33e-02, grad_scale: 1.0 2024-06-19 22:05:01,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=71133.33333333333, ans=0.1 2024-06-19 22:05:04,504 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=12.08 vs. limit=15.0 2024-06-19 22:05:05,825 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.20 vs. limit=15.0 2024-06-19 22:05:06,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=71151.66666666667, ans=0.025 2024-06-19 22:05:08,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=71151.66666666667, ans=0.125 2024-06-19 22:05:13,604 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 22:05:20,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=71188.33333333333, ans=0.125 2024-06-19 22:05:26,557 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.19 vs. limit=15.0 2024-06-19 22:05:29,908 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.02 vs. limit=15.0 2024-06-19 22:05:32,828 INFO [train.py:1028] (1/2) Epoch 4, batch 8500, loss[loss=0.4593, simple_loss=0.4426, pruned_loss=0.238, over 12930.00 frames. ], tot_loss[loss=0.4138, simple_loss=0.4045, pruned_loss=0.2116, over 2577771.06 frames. ], batch size: 30, lr: 1.33e-02, grad_scale: 2.0 2024-06-19 22:05:35,980 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.54 vs. limit=15.0 2024-06-19 22:05:53,565 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.29 vs. limit=15.0 2024-06-19 22:05:53,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=71280.0, ans=0.0 2024-06-19 22:05:58,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=71280.0, ans=0.125 2024-06-19 22:05:58,522 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.587e+02 1.540e+03 1.992e+03 2.442e+03 6.048e+03, threshold=3.983e+03, percent-clipped=3.0 2024-06-19 22:06:02,093 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=22.5 2024-06-19 22:06:03,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71298.33333333333, ans=0.1 2024-06-19 22:06:05,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=71298.33333333333, ans=0.125 2024-06-19 22:06:06,448 INFO [train.py:1028] (1/2) Epoch 4, batch 8550, loss[loss=0.4292, simple_loss=0.427, pruned_loss=0.2157, over 12537.00 frames. ], tot_loss[loss=0.4122, simple_loss=0.4032, pruned_loss=0.2106, over 2575148.87 frames. ], batch size: 22, lr: 1.33e-02, grad_scale: 1.0 2024-06-19 22:06:12,889 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.41 vs. limit=10.0 2024-06-19 22:06:13,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=71335.0, ans=0.125 2024-06-19 22:06:19,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=71335.0, ans=0.125 2024-06-19 22:06:25,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=71353.33333333333, ans=0.0 2024-06-19 22:06:32,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=71371.66666666667, ans=0.0 2024-06-19 22:06:36,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=71390.0, ans=0.2 2024-06-19 22:06:40,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=71390.0, ans=0.125 2024-06-19 22:06:40,539 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.75 vs. limit=10.0 2024-06-19 22:06:42,870 INFO [train.py:1028] (1/2) Epoch 4, batch 8600, loss[loss=0.4342, simple_loss=0.4145, pruned_loss=0.2269, over 13128.00 frames. ], tot_loss[loss=0.4134, simple_loss=0.404, pruned_loss=0.2114, over 2572581.23 frames. ], batch size: 121, lr: 1.33e-02, grad_scale: 2.0 2024-06-19 22:06:44,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=71408.33333333333, ans=0.1 2024-06-19 22:06:45,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=71408.33333333333, ans=0.1 2024-06-19 22:06:51,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=71426.66666666667, ans=0.125 2024-06-19 22:06:51,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=71426.66666666667, ans=0.125 2024-06-19 22:07:03,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=71445.0, ans=0.0 2024-06-19 22:07:05,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71463.33333333333, ans=0.1 2024-06-19 22:07:05,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=71463.33333333333, ans=0.0 2024-06-19 22:07:13,389 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.182e+03 2.353e+03 2.804e+03 3.263e+03 1.111e+04, threshold=5.609e+03, percent-clipped=5.0 2024-06-19 22:07:18,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=71481.66666666667, ans=0.07 2024-06-19 22:07:19,760 INFO [train.py:1028] (1/2) Epoch 4, batch 8650, loss[loss=0.3995, simple_loss=0.3841, pruned_loss=0.2075, over 13009.00 frames. ], tot_loss[loss=0.4142, simple_loss=0.4046, pruned_loss=0.212, over 2575995.36 frames. ], batch size: 102, lr: 1.33e-02, grad_scale: 0.5 2024-06-19 22:07:23,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=71500.0, ans=0.0 2024-06-19 22:07:25,629 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 22:07:26,557 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.19 vs. limit=22.5 2024-06-19 22:07:26,594 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=16.59 vs. limit=15.0 2024-06-19 22:07:31,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=71518.33333333333, ans=0.125 2024-06-19 22:07:32,608 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.10 vs. limit=6.0 2024-06-19 22:07:52,029 INFO [train.py:1028] (1/2) Epoch 4, batch 8700, loss[loss=0.4346, simple_loss=0.4366, pruned_loss=0.2163, over 13186.00 frames. ], tot_loss[loss=0.416, simple_loss=0.4056, pruned_loss=0.2132, over 2572433.82 frames. ], batch size: 59, lr: 1.32e-02, grad_scale: 1.0 2024-06-19 22:07:59,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=71610.0, ans=0.05 2024-06-19 22:08:06,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=71628.33333333333, ans=0.2 2024-06-19 22:08:15,986 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.53 vs. limit=15.0 2024-06-19 22:08:20,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=71646.66666666667, ans=0.125 2024-06-19 22:08:23,003 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.249e+03 1.970e+03 2.431e+03 2.908e+03 9.565e+03, threshold=4.861e+03, percent-clipped=5.0 2024-06-19 22:08:29,006 INFO [train.py:1028] (1/2) Epoch 4, batch 8750, loss[loss=0.4113, simple_loss=0.3975, pruned_loss=0.2126, over 13128.00 frames. ], tot_loss[loss=0.4155, simple_loss=0.4053, pruned_loss=0.2129, over 2568455.28 frames. ], batch size: 121, lr: 1.32e-02, grad_scale: 1.0 2024-06-19 22:08:32,102 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.50 vs. limit=6.0 2024-06-19 22:08:41,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=71720.0, ans=0.125 2024-06-19 22:08:45,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=71720.0, ans=0.125 2024-06-19 22:09:04,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=71756.66666666667, ans=0.125 2024-06-19 22:09:05,314 INFO [train.py:1028] (1/2) Epoch 4, batch 8800, loss[loss=0.4104, simple_loss=0.4106, pruned_loss=0.2051, over 13295.00 frames. ], tot_loss[loss=0.4158, simple_loss=0.4057, pruned_loss=0.2129, over 2573875.36 frames. ], batch size: 72, lr: 1.32e-02, grad_scale: 2.0 2024-06-19 22:09:06,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=71775.0, ans=0.0 2024-06-19 22:09:06,995 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.11 vs. limit=10.0 2024-06-19 22:09:11,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=71793.33333333333, ans=0.125 2024-06-19 22:09:13,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=71793.33333333333, ans=0.05 2024-06-19 22:09:25,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71830.0, ans=0.1 2024-06-19 22:09:28,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2024-06-19 22:09:30,027 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.94 vs. limit=15.0 2024-06-19 22:09:30,722 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.88 vs. limit=10.0 2024-06-19 22:09:33,009 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.055e+02 1.920e+03 2.207e+03 2.670e+03 3.964e+03, threshold=4.414e+03, percent-clipped=0.0 2024-06-19 22:09:39,286 INFO [train.py:1028] (1/2) Epoch 4, batch 8850, loss[loss=0.4587, simple_loss=0.4306, pruned_loss=0.2434, over 12567.00 frames. ], tot_loss[loss=0.4152, simple_loss=0.4051, pruned_loss=0.2127, over 2561580.70 frames. ], batch size: 202, lr: 1.32e-02, grad_scale: 1.0 2024-06-19 22:09:43,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=71866.66666666667, ans=0.2 2024-06-19 22:09:44,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=71866.66666666667, ans=0.125 2024-06-19 22:09:51,976 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.69 vs. limit=10.0 2024-06-19 22:09:52,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=71903.33333333333, ans=0.025 2024-06-19 22:09:59,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=71921.66666666667, ans=0.0 2024-06-19 22:10:09,228 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.70 vs. limit=15.0 2024-06-19 22:10:09,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=71940.0, ans=0.125 2024-06-19 22:10:16,134 INFO [train.py:1028] (1/2) Epoch 4, batch 8900, loss[loss=0.4268, simple_loss=0.4116, pruned_loss=0.221, over 12845.00 frames. ], tot_loss[loss=0.4169, simple_loss=0.4064, pruned_loss=0.2137, over 2559466.26 frames. ], batch size: 33, lr: 1.32e-02, grad_scale: 1.0 2024-06-19 22:10:16,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=71958.33333333333, ans=0.1 2024-06-19 22:10:20,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=71958.33333333333, ans=0.0 2024-06-19 22:10:20,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=71958.33333333333, ans=0.0 2024-06-19 22:10:27,401 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.83 vs. limit=15.0 2024-06-19 22:10:29,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=71995.0, ans=0.2 2024-06-19 22:10:31,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=71995.0, ans=0.2 2024-06-19 22:10:32,525 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.17 vs. limit=22.5 2024-06-19 22:10:33,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=71995.0, ans=0.0 2024-06-19 22:10:49,419 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.200e+02 1.641e+03 2.088e+03 2.654e+03 1.266e+04, threshold=4.175e+03, percent-clipped=5.0 2024-06-19 22:10:52,064 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.92 vs. limit=10.0 2024-06-19 22:10:52,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=72031.66666666667, ans=0.0 2024-06-19 22:10:54,231 INFO [train.py:1028] (1/2) Epoch 4, batch 8950, loss[loss=0.4832, simple_loss=0.4489, pruned_loss=0.2588, over 12620.00 frames. ], tot_loss[loss=0.4165, simple_loss=0.4066, pruned_loss=0.2132, over 2561331.18 frames. ], batch size: 202, lr: 1.32e-02, grad_scale: 1.0 2024-06-19 22:10:54,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72050.0, ans=0.1 2024-06-19 22:10:59,997 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.89 vs. limit=15.0 2024-06-19 22:11:10,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=72086.66666666667, ans=0.125 2024-06-19 22:11:17,431 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.89 vs. limit=15.0 2024-06-19 22:11:27,177 INFO [train.py:1028] (1/2) Epoch 4, batch 9000, loss[loss=0.4208, simple_loss=0.4202, pruned_loss=0.2107, over 13333.00 frames. ], tot_loss[loss=0.4148, simple_loss=0.4059, pruned_loss=0.2119, over 2567343.26 frames. ], batch size: 46, lr: 1.32e-02, grad_scale: 2.0 2024-06-19 22:11:27,178 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 22:11:34,867 INFO [train.py:1060] (1/2) Epoch 4, validation: loss=0.2624, simple_loss=0.3103, pruned_loss=0.1072, over 351949.00 frames. 2024-06-19 22:11:34,867 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17340MB 2024-06-19 22:11:41,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=72160.0, ans=0.125 2024-06-19 22:11:44,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=72160.0, ans=0.0 2024-06-19 22:11:49,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=72178.33333333333, ans=0.1 2024-06-19 22:11:53,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=72178.33333333333, ans=0.0 2024-06-19 22:12:02,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=72215.0, ans=0.125 2024-06-19 22:12:02,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=72215.0, ans=0.125 2024-06-19 22:12:03,723 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.013e+02 1.194e+03 1.439e+03 1.790e+03 3.062e+03, threshold=2.878e+03, percent-clipped=0.0 2024-06-19 22:12:06,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=72215.0, ans=0.0 2024-06-19 22:12:07,688 INFO [train.py:1028] (1/2) Epoch 4, batch 9050, loss[loss=0.4781, simple_loss=0.4387, pruned_loss=0.2588, over 10723.00 frames. ], tot_loss[loss=0.4163, simple_loss=0.4073, pruned_loss=0.2127, over 2566495.34 frames. ], batch size: 16, lr: 1.32e-02, grad_scale: 1.0 2024-06-19 22:12:07,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=72233.33333333333, ans=0.0 2024-06-19 22:12:07,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72233.33333333333, ans=0.1 2024-06-19 22:12:14,580 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.36 vs. limit=22.5 2024-06-19 22:12:17,872 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.63 vs. limit=15.0 2024-06-19 22:12:18,997 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=22.5 2024-06-19 22:12:23,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72270.0, ans=0.1 2024-06-19 22:12:27,486 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.63 vs. limit=22.5 2024-06-19 22:12:27,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=72288.33333333333, ans=0.07 2024-06-19 22:12:28,188 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.96 vs. limit=15.0 2024-06-19 22:12:28,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72288.33333333333, ans=0.1 2024-06-19 22:12:39,977 INFO [train.py:1028] (1/2) Epoch 4, batch 9100, loss[loss=0.4263, simple_loss=0.4219, pruned_loss=0.2153, over 13226.00 frames. ], tot_loss[loss=0.414, simple_loss=0.406, pruned_loss=0.211, over 2570556.78 frames. ], batch size: 72, lr: 1.32e-02, grad_scale: 2.0 2024-06-19 22:12:43,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=72325.0, ans=0.035 2024-06-19 22:12:45,625 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.25 vs. limit=22.5 2024-06-19 22:12:47,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=72343.33333333333, ans=0.5 2024-06-19 22:12:48,654 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.99 vs. limit=10.0 2024-06-19 22:12:57,559 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.77 vs. limit=15.0 2024-06-19 22:13:11,813 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.000e+02 1.268e+03 1.670e+03 2.168e+03 5.720e+03, threshold=3.339e+03, percent-clipped=8.0 2024-06-19 22:13:14,951 INFO [train.py:1028] (1/2) Epoch 4, batch 9150, loss[loss=0.4143, simple_loss=0.412, pruned_loss=0.2083, over 13169.00 frames. ], tot_loss[loss=0.4142, simple_loss=0.4061, pruned_loss=0.2111, over 2571098.86 frames. ], batch size: 77, lr: 1.32e-02, grad_scale: 1.0 2024-06-19 22:13:18,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=72416.66666666667, ans=0.07 2024-06-19 22:13:19,245 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.28 vs. limit=22.5 2024-06-19 22:13:19,725 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.53 vs. limit=12.0 2024-06-19 22:13:20,371 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.37 vs. limit=15.0 2024-06-19 22:13:22,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=72435.0, ans=0.125 2024-06-19 22:13:23,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=72435.0, ans=0.0 2024-06-19 22:13:26,680 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.25 vs. limit=15.0 2024-06-19 22:13:33,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=72471.66666666667, ans=0.2 2024-06-19 22:13:34,429 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.46 vs. limit=6.0 2024-06-19 22:13:37,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=72471.66666666667, ans=0.125 2024-06-19 22:13:46,694 INFO [train.py:1028] (1/2) Epoch 4, batch 9200, loss[loss=0.4215, simple_loss=0.4193, pruned_loss=0.2118, over 13276.00 frames. ], tot_loss[loss=0.4102, simple_loss=0.4037, pruned_loss=0.2084, over 2574509.88 frames. ], batch size: 37, lr: 1.32e-02, grad_scale: 2.0 2024-06-19 22:13:48,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=72508.33333333333, ans=0.125 2024-06-19 22:13:50,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=72508.33333333333, ans=0.025 2024-06-19 22:13:50,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=72508.33333333333, ans=0.0 2024-06-19 22:13:53,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=72526.66666666667, ans=0.0 2024-06-19 22:14:05,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=72545.0, ans=0.1 2024-06-19 22:14:08,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=72563.33333333333, ans=0.05 2024-06-19 22:14:10,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=72563.33333333333, ans=0.125 2024-06-19 22:14:10,920 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.54 vs. limit=22.5 2024-06-19 22:14:12,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=72563.33333333333, ans=0.0 2024-06-19 22:14:17,847 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.700e+02 1.527e+03 1.844e+03 2.145e+03 3.216e+03, threshold=3.688e+03, percent-clipped=0.0 2024-06-19 22:14:21,228 INFO [train.py:1028] (1/2) Epoch 4, batch 9250, loss[loss=0.3925, simple_loss=0.3896, pruned_loss=0.1977, over 13218.00 frames. ], tot_loss[loss=0.4088, simple_loss=0.4027, pruned_loss=0.2075, over 2576491.14 frames. ], batch size: 67, lr: 1.32e-02, grad_scale: 2.0 2024-06-19 22:14:21,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=72600.0, ans=0.125 2024-06-19 22:14:26,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=72600.0, ans=0.0 2024-06-19 22:14:27,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72618.33333333333, ans=0.1 2024-06-19 22:14:29,880 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.32 vs. limit=6.0 2024-06-19 22:14:36,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=72636.66666666667, ans=0.125 2024-06-19 22:14:49,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=72673.33333333333, ans=0.0 2024-06-19 22:14:52,676 INFO [train.py:1028] (1/2) Epoch 4, batch 9300, loss[loss=0.4464, simple_loss=0.4229, pruned_loss=0.2349, over 13026.00 frames. ], tot_loss[loss=0.4076, simple_loss=0.4016, pruned_loss=0.2068, over 2572983.10 frames. ], batch size: 39, lr: 1.31e-02, grad_scale: 4.0 2024-06-19 22:15:21,678 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.206e+02 1.715e+03 2.037e+03 2.565e+03 4.082e+03, threshold=4.073e+03, percent-clipped=4.0 2024-06-19 22:15:23,543 INFO [train.py:1028] (1/2) Epoch 4, batch 9350, loss[loss=0.3665, simple_loss=0.3709, pruned_loss=0.181, over 12674.00 frames. ], tot_loss[loss=0.4092, simple_loss=0.4024, pruned_loss=0.208, over 2570265.86 frames. ], batch size: 22, lr: 1.31e-02, grad_scale: 1.0 2024-06-19 22:15:27,017 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.47 vs. limit=15.0 2024-06-19 22:15:30,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=72801.66666666667, ans=0.125 2024-06-19 22:15:33,774 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=12.80 vs. limit=15.0 2024-06-19 22:15:35,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=72820.0, ans=0.5 2024-06-19 22:15:38,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=72820.0, ans=0.2 2024-06-19 22:15:54,909 INFO [train.py:1028] (1/2) Epoch 4, batch 9400, loss[loss=0.4362, simple_loss=0.4278, pruned_loss=0.2223, over 13234.00 frames. ], tot_loss[loss=0.4097, simple_loss=0.4027, pruned_loss=0.2084, over 2569060.31 frames. ], batch size: 52, lr: 1.31e-02, grad_scale: 2.0 2024-06-19 22:16:00,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=72893.33333333333, ans=0.125 2024-06-19 22:16:05,812 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.10 vs. limit=22.5 2024-06-19 22:16:11,119 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.75 vs. limit=12.0 2024-06-19 22:16:26,678 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.521e+03 2.438e+03 3.083e+03 3.703e+03 5.863e+03, threshold=6.166e+03, percent-clipped=12.0 2024-06-19 22:16:27,098 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.26 vs. limit=22.5 2024-06-19 22:16:27,397 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 22:16:28,023 INFO [train.py:1028] (1/2) Epoch 4, batch 9450, loss[loss=0.392, simple_loss=0.3913, pruned_loss=0.1963, over 12652.00 frames. ], tot_loss[loss=0.4127, simple_loss=0.4046, pruned_loss=0.2105, over 2568503.32 frames. ], batch size: 22, lr: 1.31e-02, grad_scale: 1.0 2024-06-19 22:16:30,378 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.70 vs. limit=15.0 2024-06-19 22:16:31,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=72966.66666666667, ans=0.1 2024-06-19 22:16:36,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=72985.0, ans=0.125 2024-06-19 22:16:39,583 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=13.36 vs. limit=15.0 2024-06-19 22:16:40,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=73003.33333333333, ans=0.125 2024-06-19 22:16:53,543 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2024-06-19 22:16:58,776 INFO [train.py:1028] (1/2) Epoch 4, batch 9500, loss[loss=0.4029, simple_loss=0.4003, pruned_loss=0.2028, over 13194.00 frames. ], tot_loss[loss=0.4113, simple_loss=0.4038, pruned_loss=0.2094, over 2576847.61 frames. ], batch size: 43, lr: 1.31e-02, grad_scale: 1.0 2024-06-19 22:16:58,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=73058.33333333333, ans=0.025 2024-06-19 22:17:00,341 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=19.48 vs. limit=15.0 2024-06-19 22:17:09,069 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.82 vs. limit=22.5 2024-06-19 22:17:11,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=73076.66666666667, ans=0.125 2024-06-19 22:17:18,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=73113.33333333333, ans=0.125 2024-06-19 22:17:22,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=73113.33333333333, ans=0.125 2024-06-19 22:17:27,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=73131.66666666667, ans=0.125 2024-06-19 22:17:30,605 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.331e+03 1.893e+03 2.366e+03 3.061e+03 5.566e+03, threshold=4.732e+03, percent-clipped=0.0 2024-06-19 22:17:31,251 INFO [train.py:1028] (1/2) Epoch 4, batch 9550, loss[loss=0.3577, simple_loss=0.3628, pruned_loss=0.1763, over 12821.00 frames. ], tot_loss[loss=0.4116, simple_loss=0.4037, pruned_loss=0.2098, over 2572657.89 frames. ], batch size: 39, lr: 1.31e-02, grad_scale: 1.0 2024-06-19 22:17:32,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=73150.0, ans=0.125 2024-06-19 22:17:32,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=73150.0, ans=0.95 2024-06-19 22:17:42,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=73168.33333333333, ans=0.125 2024-06-19 22:17:42,903 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 22:17:50,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=73205.0, ans=0.1 2024-06-19 22:17:54,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.42 vs. limit=15.0 2024-06-19 22:17:55,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=73205.0, ans=0.0 2024-06-19 22:17:58,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73223.33333333333, ans=0.1 2024-06-19 22:18:02,182 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=22.5 2024-06-19 22:18:02,537 INFO [train.py:1028] (1/2) Epoch 4, batch 9600, loss[loss=0.4602, simple_loss=0.4143, pruned_loss=0.2531, over 10774.00 frames. ], tot_loss[loss=0.41, simple_loss=0.4023, pruned_loss=0.2088, over 2572472.11 frames. ], batch size: 304, lr: 1.31e-02, grad_scale: 2.0 2024-06-19 22:18:04,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=73241.66666666667, ans=0.125 2024-06-19 22:18:08,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=73260.0, ans=0.125 2024-06-19 22:18:13,151 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.87 vs. limit=10.0 2024-06-19 22:18:13,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=73260.0, ans=0.0 2024-06-19 22:18:13,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=73260.0, ans=0.125 2024-06-19 22:18:21,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=73296.66666666667, ans=22.5 2024-06-19 22:18:23,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=73296.66666666667, ans=0.125 2024-06-19 22:18:26,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=73296.66666666667, ans=0.07 2024-06-19 22:18:26,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=73315.0, ans=0.1 2024-06-19 22:18:32,493 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.105e+03 1.763e+03 2.190e+03 2.546e+03 3.338e+03, threshold=4.380e+03, percent-clipped=0.0 2024-06-19 22:18:32,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=73333.33333333333, ans=0.125 2024-06-19 22:18:38,519 INFO [train.py:1028] (1/2) Epoch 4, batch 9650, loss[loss=0.3961, simple_loss=0.3851, pruned_loss=0.2036, over 13092.00 frames. ], tot_loss[loss=0.412, simple_loss=0.4034, pruned_loss=0.2103, over 2562552.55 frames. ], batch size: 132, lr: 1.31e-02, grad_scale: 2.0 2024-06-19 22:18:38,951 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.51 vs. limit=15.0 2024-06-19 22:18:54,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=73370.0, ans=0.125 2024-06-19 22:18:56,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=73370.0, ans=0.2 2024-06-19 22:19:04,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=73406.66666666667, ans=0.125 2024-06-19 22:19:09,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=73406.66666666667, ans=0.0 2024-06-19 22:19:10,387 INFO [train.py:1028] (1/2) Epoch 4, batch 9700, loss[loss=0.4094, simple_loss=0.3934, pruned_loss=0.2127, over 13028.00 frames. ], tot_loss[loss=0.4119, simple_loss=0.4028, pruned_loss=0.2105, over 2556801.99 frames. ], batch size: 144, lr: 1.31e-02, grad_scale: 2.0 2024-06-19 22:19:21,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=73443.33333333333, ans=0.125 2024-06-19 22:19:30,946 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.80 vs. limit=15.0 2024-06-19 22:19:41,958 INFO [train.py:1028] (1/2) Epoch 4, batch 9750, loss[loss=0.4041, simple_loss=0.3867, pruned_loss=0.2107, over 13117.00 frames. ], tot_loss[loss=0.4109, simple_loss=0.4021, pruned_loss=0.2099, over 2553968.59 frames. ], batch size: 132, lr: 1.31e-02, grad_scale: 1.0 2024-06-19 22:19:42,483 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.263e+03 2.159e+03 2.709e+03 3.187e+03 5.514e+03, threshold=5.418e+03, percent-clipped=9.0 2024-06-19 22:19:45,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=73516.66666666667, ans=0.125 2024-06-19 22:19:56,419 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2024-06-19 22:19:56,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73553.33333333333, ans=0.1 2024-06-19 22:20:05,060 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.44 vs. limit=15.0 2024-06-19 22:20:08,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=73590.0, ans=0.0 2024-06-19 22:20:11,097 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-19 22:20:11,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=73590.0, ans=0.125 2024-06-19 22:20:12,787 INFO [train.py:1028] (1/2) Epoch 4, batch 9800, loss[loss=0.3942, simple_loss=0.3917, pruned_loss=0.1983, over 12962.00 frames. ], tot_loss[loss=0.408, simple_loss=0.3999, pruned_loss=0.208, over 2546993.24 frames. ], batch size: 39, lr: 1.31e-02, grad_scale: 2.0 2024-06-19 22:20:18,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=73626.66666666667, ans=0.0 2024-06-19 22:20:19,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=73626.66666666667, ans=0.0 2024-06-19 22:20:21,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=73626.66666666667, ans=0.125 2024-06-19 22:20:22,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=73626.66666666667, ans=0.125 2024-06-19 22:20:32,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=73663.33333333333, ans=0.125 2024-06-19 22:20:44,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=73700.0, ans=0.125 2024-06-19 22:20:44,955 INFO [train.py:1028] (1/2) Epoch 4, batch 9850, loss[loss=0.4275, simple_loss=0.4093, pruned_loss=0.2228, over 13065.00 frames. ], tot_loss[loss=0.4087, simple_loss=0.4005, pruned_loss=0.2085, over 2538481.15 frames. ], batch size: 102, lr: 1.31e-02, grad_scale: 1.0 2024-06-19 22:20:46,075 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.885e+03 3.182e+03 3.725e+03 4.492e+03 7.979e+03, threshold=7.451e+03, percent-clipped=7.0 2024-06-19 22:20:49,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=73700.0, ans=0.125 2024-06-19 22:20:50,184 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.99 vs. limit=15.0 2024-06-19 22:21:09,984 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.43 vs. limit=15.0 2024-06-19 22:21:14,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=73773.33333333333, ans=0.125 2024-06-19 22:21:16,622 INFO [train.py:1028] (1/2) Epoch 4, batch 9900, loss[loss=0.3631, simple_loss=0.3736, pruned_loss=0.1763, over 12902.00 frames. ], tot_loss[loss=0.4082, simple_loss=0.3995, pruned_loss=0.2084, over 2530530.54 frames. ], batch size: 39, lr: 1.31e-02, grad_scale: 1.0 2024-06-19 22:21:18,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=73791.66666666667, ans=0.1 2024-06-19 22:21:20,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=73791.66666666667, ans=0.125 2024-06-19 22:21:21,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=73791.66666666667, ans=0.125 2024-06-19 22:21:25,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.00 vs. limit=15.0 2024-06-19 22:21:34,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=73828.33333333333, ans=0.1 2024-06-19 22:21:47,313 INFO [train.py:1028] (1/2) Epoch 4, batch 9950, loss[loss=0.421, simple_loss=0.4196, pruned_loss=0.2113, over 12674.00 frames. ], tot_loss[loss=0.4085, simple_loss=0.399, pruned_loss=0.209, over 2523225.57 frames. ], batch size: 29, lr: 1.30e-02, grad_scale: 0.5 2024-06-19 22:21:47,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=73883.33333333333, ans=0.025 2024-06-19 22:21:49,655 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.667e+03 3.513e+03 4.367e+03 5.141e+03 1.511e+04, threshold=8.734e+03, percent-clipped=9.0 2024-06-19 22:22:08,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=73938.33333333333, ans=0.025 2024-06-19 22:22:13,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=73956.66666666667, ans=0.125 2024-06-19 22:22:16,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=73956.66666666667, ans=0.5 2024-06-19 22:22:19,449 INFO [train.py:1028] (1/2) Epoch 4, batch 10000, loss[loss=0.3992, simple_loss=0.4035, pruned_loss=0.1974, over 12812.00 frames. ], tot_loss[loss=0.4121, simple_loss=0.401, pruned_loss=0.2116, over 2484994.49 frames. ], batch size: 22, lr: 1.30e-02, grad_scale: 1.0 2024-06-19 22:22:20,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=73975.0, ans=0.0 2024-06-19 22:22:28,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=73993.33333333333, ans=0.125 2024-06-19 22:22:29,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=73993.33333333333, ans=0.125 2024-06-19 22:22:41,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=74030.0, ans=0.025 2024-06-19 22:22:43,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=74030.0, ans=0.0 2024-06-19 22:22:44,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=74048.33333333333, ans=0.125 2024-06-19 22:22:48,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=74048.33333333333, ans=0.0 2024-06-19 22:22:51,263 INFO [train.py:1028] (1/2) Epoch 4, batch 10050, loss[loss=0.4052, simple_loss=0.3993, pruned_loss=0.2056, over 12598.00 frames. ], tot_loss[loss=0.4159, simple_loss=0.4027, pruned_loss=0.2146, over 2442803.24 frames. ], batch size: 22, lr: 1.30e-02, grad_scale: 1.0 2024-06-19 22:22:53,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=74066.66666666667, ans=0.2 2024-06-19 22:22:54,099 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.517e+03 2.745e+03 3.637e+03 4.570e+03 1.066e+04, threshold=7.273e+03, percent-clipped=3.0 2024-06-19 22:22:56,802 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=27.41 vs. limit=15.0 2024-06-19 22:23:01,759 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.21 vs. limit=12.0 2024-06-19 22:23:08,036 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=5.618e+01 2024-06-19 22:23:21,337 INFO [train.py:1028] (1/2) Epoch 4, batch 10100, loss[loss=0.3737, simple_loss=0.3718, pruned_loss=0.1878, over 11605.00 frames. ], tot_loss[loss=0.4146, simple_loss=0.402, pruned_loss=0.2136, over 2425774.47 frames. ], batch size: 17, lr: 1.30e-02, grad_scale: 1.0 2024-06-19 22:23:22,634 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=29.55 vs. limit=22.5 2024-06-19 22:23:27,641 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=32.57 vs. limit=22.5 2024-06-19 22:23:31,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74176.66666666667, ans=0.1 2024-06-19 22:25:35,960 INFO [train.py:1028] (1/2) Epoch 5, batch 0, loss[loss=0.341, simple_loss=0.3494, pruned_loss=0.1663, over 12852.00 frames. ], tot_loss[loss=0.341, simple_loss=0.3494, pruned_loss=0.1663, over 12852.00 frames. ], batch size: 36, lr: 1.21e-02, grad_scale: 2.0 2024-06-19 22:25:35,961 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 22:25:43,157 INFO [train.py:1060] (1/2) Epoch 5, validation: loss=0.2693, simple_loss=0.3155, pruned_loss=0.1116, over 351949.00 frames. 2024-06-19 22:25:43,157 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17340MB 2024-06-19 22:25:48,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=74191.33333333333, ans=0.0 2024-06-19 22:25:52,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=74209.66666666667, ans=0.125 2024-06-19 22:25:55,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=74228.0, ans=0.1 2024-06-19 22:26:02,438 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.15 vs. limit=10.0 2024-06-19 22:26:04,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=74228.0, ans=0.0 2024-06-19 22:26:10,298 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.528e+03 2.542e+03 3.177e+03 3.813e+03 7.832e+03, threshold=6.355e+03, percent-clipped=1.0 2024-06-19 22:26:18,722 INFO [train.py:1028] (1/2) Epoch 5, batch 50, loss[loss=0.3417, simple_loss=0.3574, pruned_loss=0.163, over 12613.00 frames. ], tot_loss[loss=0.383, simple_loss=0.3747, pruned_loss=0.1956, over 574384.08 frames. ], batch size: 29, lr: 1.21e-02, grad_scale: 1.0 2024-06-19 22:26:31,905 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2024-06-19 22:26:39,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=74338.0, ans=0.0 2024-06-19 22:26:41,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=74338.0, ans=0.2 2024-06-19 22:26:45,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=74356.33333333333, ans=0.025 2024-06-19 22:26:49,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=74356.33333333333, ans=0.125 2024-06-19 22:26:50,375 INFO [train.py:1028] (1/2) Epoch 5, batch 100, loss[loss=0.3378, simple_loss=0.3472, pruned_loss=0.1642, over 13271.00 frames. ], tot_loss[loss=0.3824, simple_loss=0.3741, pruned_loss=0.1954, over 1017374.60 frames. ], batch size: 46, lr: 1.21e-02, grad_scale: 2.0 2024-06-19 22:26:57,059 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.39 vs. limit=15.0 2024-06-19 22:27:15,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=74429.66666666667, ans=22.5 2024-06-19 22:27:19,836 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.071e+03 2.488e+03 2.913e+03 3.534e+03 5.846e+03, threshold=5.825e+03, percent-clipped=0.0 2024-06-19 22:27:23,937 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.68 vs. limit=15.0 2024-06-19 22:27:26,706 INFO [train.py:1028] (1/2) Epoch 5, batch 150, loss[loss=0.3769, simple_loss=0.38, pruned_loss=0.1869, over 12871.00 frames. ], tot_loss[loss=0.378, simple_loss=0.3724, pruned_loss=0.1918, over 1365072.94 frames. ], batch size: 29, lr: 1.21e-02, grad_scale: 1.0 2024-06-19 22:27:34,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=74484.66666666667, ans=0.125 2024-06-19 22:27:42,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=74503.0, ans=0.125 2024-06-19 22:27:44,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=74503.0, ans=0.125 2024-06-19 22:27:47,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=74521.33333333333, ans=0.1 2024-06-19 22:27:48,373 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.50 vs. limit=10.0 2024-06-19 22:28:01,927 INFO [train.py:1028] (1/2) Epoch 5, batch 200, loss[loss=0.45, simple_loss=0.4143, pruned_loss=0.2429, over 12503.00 frames. ], tot_loss[loss=0.3778, simple_loss=0.3724, pruned_loss=0.1916, over 1634200.18 frames. ], batch size: 202, lr: 1.21e-02, grad_scale: 2.0 2024-06-19 22:28:03,539 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.49 vs. limit=15.0 2024-06-19 22:28:04,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=74558.0, ans=0.2 2024-06-19 22:28:08,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=74576.33333333333, ans=0.125 2024-06-19 22:28:25,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=74613.0, ans=0.0 2024-06-19 22:28:26,455 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.369e+03 2.216e+03 2.599e+03 3.185e+03 5.285e+03, threshold=5.197e+03, percent-clipped=0.0 2024-06-19 22:28:33,751 INFO [train.py:1028] (1/2) Epoch 5, batch 250, loss[loss=0.3705, simple_loss=0.3582, pruned_loss=0.1914, over 13047.00 frames. ], tot_loss[loss=0.3742, simple_loss=0.3698, pruned_loss=0.1893, over 1845652.18 frames. ], batch size: 144, lr: 1.21e-02, grad_scale: 2.0 2024-06-19 22:28:49,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=74686.33333333333, ans=0.0 2024-06-19 22:28:57,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=74704.66666666667, ans=0.125 2024-06-19 22:28:58,063 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.61 vs. limit=12.0 2024-06-19 22:29:06,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=74741.33333333333, ans=0.025 2024-06-19 22:29:06,831 INFO [train.py:1028] (1/2) Epoch 5, batch 300, loss[loss=0.3914, simple_loss=0.3837, pruned_loss=0.1996, over 13201.00 frames. ], tot_loss[loss=0.3744, simple_loss=0.3705, pruned_loss=0.1891, over 2009454.97 frames. ], batch size: 112, lr: 1.21e-02, grad_scale: 2.0 2024-06-19 22:29:16,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=74759.66666666667, ans=0.125 2024-06-19 22:29:28,961 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.64 vs. limit=6.0 2024-06-19 22:29:37,241 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.640e+02 2.026e+03 2.377e+03 2.866e+03 4.081e+03, threshold=4.755e+03, percent-clipped=0.0 2024-06-19 22:29:37,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=74814.66666666667, ans=0.0 2024-06-19 22:29:42,509 INFO [train.py:1028] (1/2) Epoch 5, batch 350, loss[loss=0.3702, simple_loss=0.3766, pruned_loss=0.1819, over 12872.00 frames. ], tot_loss[loss=0.3723, simple_loss=0.3688, pruned_loss=0.1879, over 2139225.38 frames. ], batch size: 33, lr: 1.21e-02, grad_scale: 0.5 2024-06-19 22:29:44,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=74833.0, ans=0.025 2024-06-19 22:29:46,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=74833.0, ans=0.1 2024-06-19 22:29:48,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=74833.0, ans=22.5 2024-06-19 22:29:56,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=74851.33333333333, ans=0.0 2024-06-19 22:29:57,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=74851.33333333333, ans=0.125 2024-06-19 22:30:02,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=74869.66666666667, ans=0.0 2024-06-19 22:30:19,597 INFO [train.py:1028] (1/2) Epoch 5, batch 400, loss[loss=0.329, simple_loss=0.3381, pruned_loss=0.16, over 13258.00 frames. ], tot_loss[loss=0.3703, simple_loss=0.3675, pruned_loss=0.1865, over 2239782.85 frames. ], batch size: 63, lr: 1.21e-02, grad_scale: 1.0 2024-06-19 22:30:42,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=74979.66666666667, ans=0.125 2024-06-19 22:30:47,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=74998.0, ans=0.0 2024-06-19 22:30:48,328 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.617e+03 2.745e+03 3.195e+03 3.885e+03 8.353e+03, threshold=6.389e+03, percent-clipped=7.0 2024-06-19 22:30:52,685 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.61 vs. limit=15.0 2024-06-19 22:30:52,858 INFO [train.py:1028] (1/2) Epoch 5, batch 450, loss[loss=0.3391, simple_loss=0.3489, pruned_loss=0.1646, over 13198.00 frames. ], tot_loss[loss=0.3696, simple_loss=0.3672, pruned_loss=0.186, over 2314096.98 frames. ], batch size: 67, lr: 1.21e-02, grad_scale: 0.5 2024-06-19 22:31:06,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=75053.0, ans=0.0 2024-06-19 22:31:13,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=75071.33333333333, ans=0.125 2024-06-19 22:31:21,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=75089.66666666667, ans=0.0 2024-06-19 22:31:25,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=75089.66666666667, ans=0.0 2024-06-19 22:31:27,875 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.82 vs. limit=15.0 2024-06-19 22:31:28,696 INFO [train.py:1028] (1/2) Epoch 5, batch 500, loss[loss=0.3493, simple_loss=0.3487, pruned_loss=0.175, over 13083.00 frames. ], tot_loss[loss=0.3708, simple_loss=0.3687, pruned_loss=0.1865, over 2376078.10 frames. ], batch size: 121, lr: 1.21e-02, grad_scale: 1.0 2024-06-19 22:31:32,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=75108.0, ans=0.07 2024-06-19 22:31:45,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=75144.66666666667, ans=0.125 2024-06-19 22:31:51,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=75163.0, ans=0.5 2024-06-19 22:31:54,049 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.03 vs. limit=22.5 2024-06-19 22:31:56,743 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.549e+03 3.059e+03 3.757e+03 4.671e+03 9.028e+03, threshold=7.514e+03, percent-clipped=3.0 2024-06-19 22:32:03,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=75181.33333333333, ans=0.125 2024-06-19 22:32:04,141 INFO [train.py:1028] (1/2) Epoch 5, batch 550, loss[loss=0.3786, simple_loss=0.3667, pruned_loss=0.1953, over 12970.00 frames. ], tot_loss[loss=0.3722, simple_loss=0.3692, pruned_loss=0.1876, over 2421736.05 frames. ], batch size: 158, lr: 1.20e-02, grad_scale: 1.0 2024-06-19 22:32:07,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=75199.66666666667, ans=0.09899494936611666 2024-06-19 22:32:08,240 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.43 vs. limit=6.0 2024-06-19 22:32:20,857 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 22:32:29,397 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.61 vs. limit=15.0 2024-06-19 22:32:34,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=75273.0, ans=0.125 2024-06-19 22:32:35,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=75273.0, ans=0.125 2024-06-19 22:32:36,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=75291.33333333333, ans=0.0 2024-06-19 22:32:36,626 INFO [train.py:1028] (1/2) Epoch 5, batch 600, loss[loss=0.3795, simple_loss=0.3591, pruned_loss=0.1999, over 13033.00 frames. ], tot_loss[loss=0.3723, simple_loss=0.3694, pruned_loss=0.1876, over 2460120.67 frames. ], batch size: 144, lr: 1.20e-02, grad_scale: 1.0 2024-06-19 22:32:43,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=75309.66666666667, ans=0.05 2024-06-19 22:32:44,060 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.28 vs. limit=15.0 2024-06-19 22:32:54,396 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 22:33:00,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=75346.33333333333, ans=0.125 2024-06-19 22:33:01,048 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.77 vs. limit=10.0 2024-06-19 22:33:04,676 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.817e+03 2.953e+03 3.401e+03 4.172e+03 7.114e+03, threshold=6.802e+03, percent-clipped=0.0 2024-06-19 22:33:06,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=75364.66666666667, ans=0.125 2024-06-19 22:33:08,727 INFO [train.py:1028] (1/2) Epoch 5, batch 650, loss[loss=0.3816, simple_loss=0.3766, pruned_loss=0.1933, over 13145.00 frames. ], tot_loss[loss=0.3724, simple_loss=0.3699, pruned_loss=0.1874, over 2491230.87 frames. ], batch size: 59, lr: 1.20e-02, grad_scale: 1.0 2024-06-19 22:33:31,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=75438.0, ans=0.125 2024-06-19 22:33:31,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=75438.0, ans=0.125 2024-06-19 22:33:44,532 INFO [train.py:1028] (1/2) Epoch 5, batch 700, loss[loss=0.3773, simple_loss=0.3724, pruned_loss=0.1911, over 13287.00 frames. ], tot_loss[loss=0.371, simple_loss=0.3687, pruned_loss=0.1866, over 2514552.09 frames. ], batch size: 46, lr: 1.20e-02, grad_scale: 2.0 2024-06-19 22:34:00,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=75511.33333333333, ans=0.125 2024-06-19 22:34:01,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=75511.33333333333, ans=0.0 2024-06-19 22:34:09,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=75529.66666666667, ans=0.04949747468305833 2024-06-19 22:34:16,328 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.623e+03 2.432e+03 2.915e+03 3.440e+03 5.515e+03, threshold=5.830e+03, percent-clipped=0.0 2024-06-19 22:34:16,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=75548.0, ans=0.125 2024-06-19 22:34:17,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=75548.0, ans=0.125 2024-06-19 22:34:18,111 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.82 vs. limit=22.5 2024-06-19 22:34:20,249 INFO [train.py:1028] (1/2) Epoch 5, batch 750, loss[loss=0.3443, simple_loss=0.3487, pruned_loss=0.1699, over 13285.00 frames. ], tot_loss[loss=0.3693, simple_loss=0.368, pruned_loss=0.1853, over 2530959.20 frames. ], batch size: 63, lr: 1.20e-02, grad_scale: 2.0 2024-06-19 22:34:29,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=75584.66666666667, ans=0.0 2024-06-19 22:34:31,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=75584.66666666667, ans=0.125 2024-06-19 22:34:33,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=75603.0, ans=0.125 2024-06-19 22:34:38,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=75603.0, ans=0.07 2024-06-19 22:34:42,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=75621.33333333333, ans=0.125 2024-06-19 22:34:43,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=75621.33333333333, ans=0.1 2024-06-19 22:34:50,743 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=29.40 vs. limit=15.0 2024-06-19 22:34:52,916 INFO [train.py:1028] (1/2) Epoch 5, batch 800, loss[loss=0.3495, simple_loss=0.3572, pruned_loss=0.1709, over 12933.00 frames. ], tot_loss[loss=0.3694, simple_loss=0.3678, pruned_loss=0.1855, over 2542820.99 frames. ], batch size: 36, lr: 1.20e-02, grad_scale: 4.0 2024-06-19 22:35:02,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.90 vs. limit=15.0 2024-06-19 22:35:03,879 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.30 vs. limit=15.0 2024-06-19 22:35:23,801 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.379e+03 3.001e+03 3.685e+03 4.323e+03 7.197e+03, threshold=7.370e+03, percent-clipped=8.0 2024-06-19 22:35:25,864 INFO [train.py:1028] (1/2) Epoch 5, batch 850, loss[loss=0.3655, simple_loss=0.3627, pruned_loss=0.1842, over 13144.00 frames. ], tot_loss[loss=0.3692, simple_loss=0.3678, pruned_loss=0.1853, over 2554059.37 frames. ], batch size: 95, lr: 1.20e-02, grad_scale: 0.5 2024-06-19 22:35:37,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=75768.0, ans=0.0 2024-06-19 22:35:48,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=75804.66666666667, ans=0.0 2024-06-19 22:36:00,967 INFO [train.py:1028] (1/2) Epoch 5, batch 900, loss[loss=0.363, simple_loss=0.3711, pruned_loss=0.1775, over 12941.00 frames. ], tot_loss[loss=0.3689, simple_loss=0.3673, pruned_loss=0.1853, over 2558450.65 frames. ], batch size: 36, lr: 1.20e-02, grad_scale: 1.0 2024-06-19 22:36:04,497 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.29 vs. limit=15.0 2024-06-19 22:36:35,153 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.547e+03 2.761e+03 3.312e+03 4.209e+03 7.224e+03, threshold=6.625e+03, percent-clipped=0.0 2024-06-19 22:36:36,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=75933.0, ans=0.125 2024-06-19 22:36:36,534 INFO [train.py:1028] (1/2) Epoch 5, batch 950, loss[loss=0.3515, simple_loss=0.3584, pruned_loss=0.1723, over 12960.00 frames. ], tot_loss[loss=0.3685, simple_loss=0.367, pruned_loss=0.185, over 2561994.35 frames. ], batch size: 39, lr: 1.20e-02, grad_scale: 0.5 2024-06-19 22:36:39,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=75933.0, ans=0.0 2024-06-19 22:36:45,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=75951.33333333333, ans=0.1 2024-06-19 22:36:52,154 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.72 vs. limit=15.0 2024-06-19 22:36:53,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=75969.66666666667, ans=0.2 2024-06-19 22:36:55,279 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=13.11 vs. limit=15.0 2024-06-19 22:36:59,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=75988.0, ans=0.125 2024-06-19 22:37:00,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=75988.0, ans=0.125 2024-06-19 22:37:01,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=75988.0, ans=0.5 2024-06-19 22:37:07,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=76006.33333333333, ans=0.125 2024-06-19 22:37:07,600 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.53 vs. limit=10.0 2024-06-19 22:37:07,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76006.33333333333, ans=0.1 2024-06-19 22:37:09,340 INFO [train.py:1028] (1/2) Epoch 5, batch 1000, loss[loss=0.3719, simple_loss=0.3787, pruned_loss=0.1826, over 13307.00 frames. ], tot_loss[loss=0.369, simple_loss=0.3671, pruned_loss=0.1855, over 2563387.90 frames. ], batch size: 49, lr: 1.20e-02, grad_scale: 1.0 2024-06-19 22:37:16,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=76043.0, ans=0.025 2024-06-19 22:37:23,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76061.33333333333, ans=0.1 2024-06-19 22:37:39,607 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2024-06-19 22:37:44,327 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.337e+03 2.595e+03 2.899e+03 3.400e+03 5.036e+03, threshold=5.797e+03, percent-clipped=0.0 2024-06-19 22:37:45,008 INFO [train.py:1028] (1/2) Epoch 5, batch 1050, loss[loss=0.3431, simple_loss=0.3517, pruned_loss=0.1673, over 13173.00 frames. ], tot_loss[loss=0.3691, simple_loss=0.3674, pruned_loss=0.1854, over 2566216.85 frames. ], batch size: 77, lr: 1.20e-02, grad_scale: 0.5 2024-06-19 22:37:47,434 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.13 vs. limit=22.5 2024-06-19 22:37:49,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76116.33333333333, ans=0.1 2024-06-19 22:37:51,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=76134.66666666667, ans=0.125 2024-06-19 22:37:58,019 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=3.034e+02 2024-06-19 22:37:59,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=76153.0, ans=0.0 2024-06-19 22:38:14,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=76189.66666666667, ans=0.0 2024-06-19 22:38:20,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=76208.0, ans=0.125 2024-06-19 22:38:21,012 INFO [train.py:1028] (1/2) Epoch 5, batch 1100, loss[loss=0.3706, simple_loss=0.3694, pruned_loss=0.1859, over 13282.00 frames. ], tot_loss[loss=0.3694, simple_loss=0.3681, pruned_loss=0.1854, over 2570150.65 frames. ], batch size: 52, lr: 1.20e-02, grad_scale: 1.0 2024-06-19 22:38:30,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=76226.33333333333, ans=0.0 2024-06-19 22:38:33,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=76226.33333333333, ans=0.125 2024-06-19 22:38:46,996 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.73 vs. limit=15.0 2024-06-19 22:38:49,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=76281.33333333333, ans=0.125 2024-06-19 22:38:50,804 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=12.0 2024-06-19 22:38:52,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=76281.33333333333, ans=0.025 2024-06-19 22:38:53,237 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.288e+03 1.994e+03 2.418e+03 2.914e+03 4.189e+03, threshold=4.835e+03, percent-clipped=0.0 2024-06-19 22:38:53,837 INFO [train.py:1028] (1/2) Epoch 5, batch 1150, loss[loss=0.3428, simple_loss=0.3511, pruned_loss=0.1673, over 13209.00 frames. ], tot_loss[loss=0.3694, simple_loss=0.368, pruned_loss=0.1854, over 2570315.96 frames. ], batch size: 52, lr: 1.20e-02, grad_scale: 1.0 2024-06-19 22:38:56,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76299.66666666667, ans=0.1 2024-06-19 22:38:56,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=76299.66666666667, ans=0.125 2024-06-19 22:38:56,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=76299.66666666667, ans=0.025 2024-06-19 22:38:59,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=76299.66666666667, ans=0.125 2024-06-19 22:39:08,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=76336.33333333333, ans=0.035 2024-06-19 22:39:15,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=76354.66666666667, ans=0.125 2024-06-19 22:39:22,596 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.99 vs. limit=22.5 2024-06-19 22:39:29,141 INFO [train.py:1028] (1/2) Epoch 5, batch 1200, loss[loss=0.3438, simple_loss=0.3532, pruned_loss=0.1672, over 13193.00 frames. ], tot_loss[loss=0.3682, simple_loss=0.3671, pruned_loss=0.1847, over 2573161.83 frames. ], batch size: 77, lr: 1.20e-02, grad_scale: 2.0 2024-06-19 22:39:31,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=76391.33333333333, ans=0.0 2024-06-19 22:39:43,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=76428.0, ans=0.125 2024-06-19 22:39:45,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=76428.0, ans=0.07 2024-06-19 22:39:48,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=76446.33333333333, ans=0.125 2024-06-19 22:39:55,037 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.31 vs. limit=22.5 2024-06-19 22:39:59,721 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.331e+03 2.123e+03 2.518e+03 2.920e+03 6.180e+03, threshold=5.035e+03, percent-clipped=1.0 2024-06-19 22:40:00,333 INFO [train.py:1028] (1/2) Epoch 5, batch 1250, loss[loss=0.3576, simple_loss=0.357, pruned_loss=0.1791, over 13177.00 frames. ], tot_loss[loss=0.3662, simple_loss=0.3657, pruned_loss=0.1834, over 2583005.22 frames. ], batch size: 112, lr: 1.19e-02, grad_scale: 2.0 2024-06-19 22:40:08,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=76483.0, ans=10.0 2024-06-19 22:40:18,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=76519.66666666667, ans=0.125 2024-06-19 22:40:20,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=76519.66666666667, ans=0.0 2024-06-19 22:40:20,374 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.54 vs. limit=15.0 2024-06-19 22:40:31,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=76556.33333333333, ans=15.0 2024-06-19 22:40:31,659 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.40 vs. limit=15.0 2024-06-19 22:40:35,531 INFO [train.py:1028] (1/2) Epoch 5, batch 1300, loss[loss=0.3947, simple_loss=0.3796, pruned_loss=0.2049, over 12778.00 frames. ], tot_loss[loss=0.3658, simple_loss=0.3656, pruned_loss=0.183, over 2583654.95 frames. ], batch size: 176, lr: 1.19e-02, grad_scale: 2.0 2024-06-19 22:40:38,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=76574.66666666667, ans=0.05 2024-06-19 22:40:42,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=76593.0, ans=0.125 2024-06-19 22:40:42,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=76593.0, ans=0.95 2024-06-19 22:40:44,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=76593.0, ans=0.09899494936611666 2024-06-19 22:40:50,038 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.37 vs. limit=22.5 2024-06-19 22:40:53,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=76611.33333333333, ans=0.125 2024-06-19 22:41:05,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=76648.0, ans=0.125 2024-06-19 22:41:07,672 INFO [train.py:1028] (1/2) Epoch 5, batch 1350, loss[loss=0.3582, simple_loss=0.3672, pruned_loss=0.1747, over 13205.00 frames. ], tot_loss[loss=0.3653, simple_loss=0.3654, pruned_loss=0.1825, over 2585440.36 frames. ], batch size: 59, lr: 1.19e-02, grad_scale: 1.0 2024-06-19 22:41:08,306 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.290e+03 2.207e+03 2.591e+03 2.953e+03 4.688e+03, threshold=5.183e+03, percent-clipped=0.0 2024-06-19 22:41:11,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=76666.33333333333, ans=0.125 2024-06-19 22:41:15,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=76684.66666666667, ans=0.2 2024-06-19 22:41:19,962 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.67 vs. limit=15.0 2024-06-19 22:41:20,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=76703.0, ans=0.125 2024-06-19 22:41:26,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76703.0, ans=0.1 2024-06-19 22:41:35,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=76721.33333333333, ans=0.125 2024-06-19 22:41:36,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=76739.66666666667, ans=0.125 2024-06-19 22:41:38,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=76739.66666666667, ans=0.125 2024-06-19 22:41:41,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=76739.66666666667, ans=0.125 2024-06-19 22:41:43,409 INFO [train.py:1028] (1/2) Epoch 5, batch 1400, loss[loss=0.3706, simple_loss=0.3759, pruned_loss=0.1827, over 12814.00 frames. ], tot_loss[loss=0.366, simple_loss=0.3659, pruned_loss=0.183, over 2586213.52 frames. ], batch size: 26, lr: 1.19e-02, grad_scale: 2.0 2024-06-19 22:41:44,510 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.98 vs. limit=15.0 2024-06-19 22:41:47,232 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=12.0 2024-06-19 22:41:55,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76794.66666666667, ans=0.1 2024-06-19 22:41:59,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=76794.66666666667, ans=0.2 2024-06-19 22:42:12,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=76831.33333333333, ans=0.04949747468305833 2024-06-19 22:42:17,925 INFO [train.py:1028] (1/2) Epoch 5, batch 1450, loss[loss=0.3503, simple_loss=0.3448, pruned_loss=0.1779, over 13130.00 frames. ], tot_loss[loss=0.3668, simple_loss=0.366, pruned_loss=0.1838, over 2586571.43 frames. ], batch size: 121, lr: 1.19e-02, grad_scale: 1.0 2024-06-19 22:42:19,177 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.228e+03 2.164e+03 2.554e+03 2.866e+03 7.464e+03, threshold=5.107e+03, percent-clipped=1.0 2024-06-19 22:42:41,466 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.01 vs. limit=22.5 2024-06-19 22:42:44,624 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.50 vs. limit=15.0 2024-06-19 22:42:45,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=76923.0, ans=0.0 2024-06-19 22:42:48,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=76923.0, ans=0.125 2024-06-19 22:42:49,951 INFO [train.py:1028] (1/2) Epoch 5, batch 1500, loss[loss=0.3422, simple_loss=0.3459, pruned_loss=0.1693, over 13252.00 frames. ], tot_loss[loss=0.3674, simple_loss=0.3662, pruned_loss=0.1843, over 2588669.72 frames. ], batch size: 83, lr: 1.19e-02, grad_scale: 1.0 2024-06-19 22:42:59,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=76959.66666666667, ans=0.125 2024-06-19 22:43:01,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=76959.66666666667, ans=0.5 2024-06-19 22:43:02,017 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.11 vs. limit=15.0 2024-06-19 22:43:07,027 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.20 vs. limit=15.0 2024-06-19 22:43:09,559 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.67 vs. limit=15.0 2024-06-19 22:43:24,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=77033.0, ans=0.125 2024-06-19 22:43:25,214 INFO [train.py:1028] (1/2) Epoch 5, batch 1550, loss[loss=0.346, simple_loss=0.3463, pruned_loss=0.1729, over 12982.00 frames. ], tot_loss[loss=0.3691, simple_loss=0.3672, pruned_loss=0.1855, over 2583888.81 frames. ], batch size: 102, lr: 1.19e-02, grad_scale: 1.0 2024-06-19 22:43:27,135 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.895e+03 2.711e+03 3.073e+03 3.675e+03 7.236e+03, threshold=6.147e+03, percent-clipped=5.0 2024-06-19 22:43:30,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.21 vs. limit=15.0 2024-06-19 22:43:39,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=77069.66666666667, ans=0.1 2024-06-19 22:43:46,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=77088.0, ans=0.125 2024-06-19 22:44:00,603 INFO [train.py:1028] (1/2) Epoch 5, batch 1600, loss[loss=0.3671, simple_loss=0.3702, pruned_loss=0.182, over 13165.00 frames. ], tot_loss[loss=0.3692, simple_loss=0.3671, pruned_loss=0.1857, over 2579260.89 frames. ], batch size: 77, lr: 1.19e-02, grad_scale: 2.0 2024-06-19 22:44:09,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=77143.0, ans=0.025 2024-06-19 22:44:18,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=77161.33333333333, ans=0.125 2024-06-19 22:44:20,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=77179.66666666667, ans=0.1 2024-06-19 22:44:29,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=77198.0, ans=0.125 2024-06-19 22:44:31,869 INFO [train.py:1028] (1/2) Epoch 5, batch 1650, loss[loss=0.3879, simple_loss=0.3741, pruned_loss=0.2008, over 13159.00 frames. ], tot_loss[loss=0.3718, simple_loss=0.3685, pruned_loss=0.1876, over 2576270.76 frames. ], batch size: 95, lr: 1.19e-02, grad_scale: 0.5 2024-06-19 22:44:35,111 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.002e+03 3.212e+03 3.825e+03 4.431e+03 9.177e+03, threshold=7.649e+03, percent-clipped=3.0 2024-06-19 22:44:35,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=77216.33333333333, ans=0.125 2024-06-19 22:44:42,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=77234.66666666667, ans=0.0 2024-06-19 22:44:51,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=77271.33333333333, ans=0.125 2024-06-19 22:44:53,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.05 vs. limit=6.0 2024-06-19 22:44:55,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=77271.33333333333, ans=0.125 2024-06-19 22:45:02,808 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=24.30 vs. limit=22.5 2024-06-19 22:45:04,849 INFO [train.py:1028] (1/2) Epoch 5, batch 1700, loss[loss=0.3741, simple_loss=0.3725, pruned_loss=0.1879, over 12523.00 frames. ], tot_loss[loss=0.3721, simple_loss=0.3689, pruned_loss=0.1877, over 2580571.25 frames. ], batch size: 25, lr: 1.19e-02, grad_scale: 1.0 2024-06-19 22:45:05,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.94 vs. limit=10.0 2024-06-19 22:45:16,163 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.73 vs. limit=22.5 2024-06-19 22:45:16,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=77326.33333333333, ans=0.125 2024-06-19 22:45:17,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=77344.66666666667, ans=0.05 2024-06-19 22:45:21,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=77344.66666666667, ans=0.125 2024-06-19 22:45:21,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=77344.66666666667, ans=0.025 2024-06-19 22:45:29,453 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.21 vs. limit=22.5 2024-06-19 22:45:36,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=77381.33333333333, ans=0.07 2024-06-19 22:45:39,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=77381.33333333333, ans=15.0 2024-06-19 22:45:40,217 INFO [train.py:1028] (1/2) Epoch 5, batch 1750, loss[loss=0.3941, simple_loss=0.3962, pruned_loss=0.196, over 12658.00 frames. ], tot_loss[loss=0.3712, simple_loss=0.3686, pruned_loss=0.1869, over 2581978.47 frames. ], batch size: 22, lr: 1.19e-02, grad_scale: 1.0 2024-06-19 22:45:40,629 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.86 vs. limit=15.0 2024-06-19 22:45:43,569 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.419e+02 1.793e+03 2.253e+03 2.706e+03 4.326e+03, threshold=4.507e+03, percent-clipped=0.0 2024-06-19 22:45:44,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=77399.66666666667, ans=0.0 2024-06-19 22:45:59,639 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=22.31 vs. limit=15.0 2024-06-19 22:46:15,415 INFO [train.py:1028] (1/2) Epoch 5, batch 1800, loss[loss=0.3311, simple_loss=0.3472, pruned_loss=0.1575, over 13241.00 frames. ], tot_loss[loss=0.3696, simple_loss=0.3675, pruned_loss=0.1859, over 2582022.00 frames. ], batch size: 67, lr: 1.19e-02, grad_scale: 2.0 2024-06-19 22:46:18,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=77491.33333333333, ans=0.0 2024-06-19 22:46:21,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=77509.66666666667, ans=0.95 2024-06-19 22:46:24,836 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.79 vs. limit=15.0 2024-06-19 22:46:30,014 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.31 vs. limit=15.0 2024-06-19 22:46:32,712 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.64 vs. limit=22.5 2024-06-19 22:46:33,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=77528.0, ans=0.125 2024-06-19 22:46:40,866 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.59 vs. limit=15.0 2024-06-19 22:46:44,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=77564.66666666667, ans=0.2 2024-06-19 22:46:46,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=77583.0, ans=0.025 2024-06-19 22:46:47,391 INFO [train.py:1028] (1/2) Epoch 5, batch 1850, loss[loss=0.3585, simple_loss=0.3616, pruned_loss=0.1777, over 13160.00 frames. ], tot_loss[loss=0.3685, simple_loss=0.367, pruned_loss=0.185, over 2583677.59 frames. ], batch size: 83, lr: 1.19e-02, grad_scale: 2.0 2024-06-19 22:46:47,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=77583.0, ans=0.0 2024-06-19 22:46:49,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=37.79 vs. limit=15.0 2024-06-19 22:46:50,641 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.400e+02 1.530e+03 1.810e+03 2.144e+03 3.543e+03, threshold=3.620e+03, percent-clipped=0.0 2024-06-19 22:46:59,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=77601.33333333333, ans=0.0 2024-06-19 22:47:00,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=77619.66666666667, ans=0.125 2024-06-19 22:47:03,830 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.75 vs. limit=15.0 2024-06-19 22:47:04,538 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.21 vs. limit=22.5 2024-06-19 22:47:04,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=77619.66666666667, ans=0.125 2024-06-19 22:47:06,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=77638.0, ans=0.0 2024-06-19 22:47:14,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=77656.33333333333, ans=0.2 2024-06-19 22:47:17,472 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.19 vs. limit=15.0 2024-06-19 22:47:19,034 INFO [train.py:1028] (1/2) Epoch 5, batch 1900, loss[loss=0.3297, simple_loss=0.3361, pruned_loss=0.1616, over 13126.00 frames. ], tot_loss[loss=0.3674, simple_loss=0.366, pruned_loss=0.1844, over 2586092.43 frames. ], batch size: 95, lr: 1.19e-02, grad_scale: 2.0 2024-06-19 22:47:22,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=77674.66666666667, ans=0.1 2024-06-19 22:47:28,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=77693.0, ans=0.0 2024-06-19 22:47:35,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=77711.33333333333, ans=0.2 2024-06-19 22:47:36,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=77711.33333333333, ans=0.125 2024-06-19 22:47:52,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=77748.0, ans=0.1 2024-06-19 22:47:53,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=77748.0, ans=15.0 2024-06-19 22:47:56,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=77748.0, ans=0.125 2024-06-19 22:47:57,884 INFO [train.py:1028] (1/2) Epoch 5, batch 1950, loss[loss=0.3289, simple_loss=0.3349, pruned_loss=0.1615, over 13242.00 frames. ], tot_loss[loss=0.3668, simple_loss=0.3654, pruned_loss=0.1841, over 2592215.45 frames. ], batch size: 52, lr: 1.19e-02, grad_scale: 1.0 2024-06-19 22:48:02,251 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.187e+03 1.877e+03 2.140e+03 2.417e+03 3.562e+03, threshold=4.280e+03, percent-clipped=0.0 2024-06-19 22:48:04,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=77784.66666666667, ans=0.125 2024-06-19 22:48:05,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=77784.66666666667, ans=0.2 2024-06-19 22:48:05,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=77784.66666666667, ans=0.0 2024-06-19 22:48:10,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=77803.0, ans=0.125 2024-06-19 22:48:12,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=77803.0, ans=0.125 2024-06-19 22:48:19,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=77821.33333333333, ans=0.125 2024-06-19 22:48:21,123 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.21 vs. limit=15.0 2024-06-19 22:48:21,178 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.72 vs. limit=15.0 2024-06-19 22:48:30,203 INFO [train.py:1028] (1/2) Epoch 5, batch 2000, loss[loss=0.3607, simple_loss=0.3772, pruned_loss=0.1721, over 12743.00 frames. ], tot_loss[loss=0.3659, simple_loss=0.3648, pruned_loss=0.1835, over 2588402.84 frames. ], batch size: 22, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:48:31,932 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.41 vs. limit=6.0 2024-06-19 22:48:33,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=77858.0, ans=0.0 2024-06-19 22:48:35,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=77858.0, ans=0.125 2024-06-19 22:48:39,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=77876.33333333333, ans=0.125 2024-06-19 22:48:44,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=77894.66666666667, ans=0.125 2024-06-19 22:48:48,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=77894.66666666667, ans=0.0 2024-06-19 22:48:54,096 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.47 vs. limit=22.5 2024-06-19 22:48:55,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=77913.0, ans=0.0 2024-06-19 22:49:02,594 INFO [train.py:1028] (1/2) Epoch 5, batch 2050, loss[loss=0.3614, simple_loss=0.3771, pruned_loss=0.1729, over 12593.00 frames. ], tot_loss[loss=0.3665, simple_loss=0.3655, pruned_loss=0.1838, over 2582671.57 frames. ], batch size: 29, lr: 1.18e-02, grad_scale: 1.0 2024-06-19 22:49:07,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=77949.66666666667, ans=0.125 2024-06-19 22:49:07,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=77949.66666666667, ans=0.1 2024-06-19 22:49:07,834 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.876e+02 1.508e+03 1.777e+03 2.111e+03 4.263e+03, threshold=3.554e+03, percent-clipped=0.0 2024-06-19 22:49:22,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=77986.33333333333, ans=0.125 2024-06-19 22:49:23,836 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.60 vs. limit=15.0 2024-06-19 22:49:25,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=78004.66666666667, ans=0.125 2024-06-19 22:49:29,339 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.52 vs. limit=22.5 2024-06-19 22:49:38,816 INFO [train.py:1028] (1/2) Epoch 5, batch 2100, loss[loss=0.3248, simple_loss=0.3384, pruned_loss=0.1556, over 13240.00 frames. ], tot_loss[loss=0.3643, simple_loss=0.3647, pruned_loss=0.1819, over 2585892.78 frames. ], batch size: 59, lr: 1.18e-02, grad_scale: 1.0 2024-06-19 22:49:42,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=78041.33333333333, ans=0.2 2024-06-19 22:49:42,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=78041.33333333333, ans=0.1 2024-06-19 22:49:50,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=78059.66666666667, ans=0.1 2024-06-19 22:49:52,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.75 vs. limit=15.0 2024-06-19 22:49:59,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=78078.0, ans=0.0 2024-06-19 22:50:06,009 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.39 vs. limit=22.5 2024-06-19 22:50:11,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=78114.66666666667, ans=10.0 2024-06-19 22:50:11,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=78114.66666666667, ans=0.025 2024-06-19 22:50:12,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=78114.66666666667, ans=0.1 2024-06-19 22:50:14,173 INFO [train.py:1028] (1/2) Epoch 5, batch 2150, loss[loss=0.3188, simple_loss=0.342, pruned_loss=0.1478, over 13304.00 frames. ], tot_loss[loss=0.3622, simple_loss=0.3634, pruned_loss=0.1805, over 2588730.00 frames. ], batch size: 52, lr: 1.18e-02, grad_scale: 1.0 2024-06-19 22:50:17,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=78133.0, ans=0.0 2024-06-19 22:50:20,243 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.131e+03 1.605e+03 1.810e+03 2.282e+03 9.917e+03, threshold=3.620e+03, percent-clipped=4.0 2024-06-19 22:50:37,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=78188.0, ans=0.0 2024-06-19 22:50:46,799 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.77 vs. limit=12.0 2024-06-19 22:50:46,998 INFO [train.py:1028] (1/2) Epoch 5, batch 2200, loss[loss=0.3304, simple_loss=0.3386, pruned_loss=0.1611, over 13205.00 frames. ], tot_loss[loss=0.3632, simple_loss=0.3643, pruned_loss=0.1811, over 2588677.78 frames. ], batch size: 83, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:50:50,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=78224.66666666667, ans=0.0 2024-06-19 22:50:50,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=78224.66666666667, ans=0.2 2024-06-19 22:50:53,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=78243.0, ans=0.125 2024-06-19 22:51:03,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=78261.33333333333, ans=0.0 2024-06-19 22:51:10,945 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.83 vs. limit=15.0 2024-06-19 22:51:14,395 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.50 vs. limit=15.0 2024-06-19 22:51:17,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=78298.0, ans=0.1 2024-06-19 22:51:17,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=78298.0, ans=0.125 2024-06-19 22:51:19,524 INFO [train.py:1028] (1/2) Epoch 5, batch 2250, loss[loss=0.383, simple_loss=0.3848, pruned_loss=0.1906, over 13264.00 frames. ], tot_loss[loss=0.3629, simple_loss=0.3641, pruned_loss=0.1808, over 2587281.31 frames. ], batch size: 63, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:51:22,386 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.19 vs. limit=15.0 2024-06-19 22:51:22,431 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.03 vs. limit=15.0 2024-06-19 22:51:22,778 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=78316.33333333333, ans=0.125 2024-06-19 22:51:23,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=78316.33333333333, ans=0.0 2024-06-19 22:51:24,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=78316.33333333333, ans=0.0 2024-06-19 22:51:25,206 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.003e+03 1.473e+03 1.702e+03 1.969e+03 3.738e+03, threshold=3.405e+03, percent-clipped=1.0 2024-06-19 22:51:28,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=78334.66666666667, ans=0.125 2024-06-19 22:51:36,069 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.97 vs. limit=15.0 2024-06-19 22:51:36,089 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=24.10 vs. limit=15.0 2024-06-19 22:51:36,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=78353.0, ans=0.2 2024-06-19 22:51:37,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=78353.0, ans=0.2 2024-06-19 22:51:50,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=78389.66666666667, ans=0.0 2024-06-19 22:51:53,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=78389.66666666667, ans=0.1 2024-06-19 22:51:55,128 INFO [train.py:1028] (1/2) Epoch 5, batch 2300, loss[loss=0.3216, simple_loss=0.3309, pruned_loss=0.1562, over 13047.00 frames. ], tot_loss[loss=0.3628, simple_loss=0.3641, pruned_loss=0.1808, over 2580983.43 frames. ], batch size: 33, lr: 1.18e-02, grad_scale: 4.0 2024-06-19 22:52:09,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=78426.33333333333, ans=0.0 2024-06-19 22:52:15,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=78444.66666666667, ans=15.0 2024-06-19 22:52:16,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=78463.0, ans=0.125 2024-06-19 22:52:28,732 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.99 vs. limit=15.0 2024-06-19 22:52:30,720 INFO [train.py:1028] (1/2) Epoch 5, batch 2350, loss[loss=0.3663, simple_loss=0.3705, pruned_loss=0.1811, over 13186.00 frames. ], tot_loss[loss=0.3629, simple_loss=0.3642, pruned_loss=0.1808, over 2584699.98 frames. ], batch size: 67, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:52:34,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=78499.66666666667, ans=0.025 2024-06-19 22:52:37,273 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.573e+02 1.753e+03 1.990e+03 2.317e+03 4.083e+03, threshold=3.979e+03, percent-clipped=3.0 2024-06-19 22:52:40,940 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.85 vs. limit=15.0 2024-06-19 22:52:43,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=78536.33333333333, ans=0.125 2024-06-19 22:52:52,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=78554.66666666667, ans=0.1 2024-06-19 22:52:54,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=78554.66666666667, ans=0.0 2024-06-19 22:52:57,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=78573.0, ans=0.125 2024-06-19 22:53:03,095 INFO [train.py:1028] (1/2) Epoch 5, batch 2400, loss[loss=0.3284, simple_loss=0.338, pruned_loss=0.1594, over 13292.00 frames. ], tot_loss[loss=0.3622, simple_loss=0.3632, pruned_loss=0.1806, over 2587236.53 frames. ], batch size: 46, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:53:20,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=78628.0, ans=0.1 2024-06-19 22:53:38,665 INFO [train.py:1028] (1/2) Epoch 5, batch 2450, loss[loss=0.3448, simple_loss=0.3551, pruned_loss=0.1673, over 13237.00 frames. ], tot_loss[loss=0.3622, simple_loss=0.3626, pruned_loss=0.1809, over 2582893.97 frames. ], batch size: 63, lr: 1.18e-02, grad_scale: 1.0 2024-06-19 22:53:38,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=78683.0, ans=0.2 2024-06-19 22:53:39,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=78683.0, ans=0.125 2024-06-19 22:53:43,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=78683.0, ans=0.2 2024-06-19 22:53:46,525 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.235e+03 1.910e+03 2.342e+03 2.731e+03 5.956e+03, threshold=4.683e+03, percent-clipped=2.0 2024-06-19 22:53:46,722 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=78701.33333333333, ans=0.125 2024-06-19 22:53:57,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=78719.66666666667, ans=0.125 2024-06-19 22:54:01,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=78738.0, ans=0.2 2024-06-19 22:54:08,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=78756.33333333333, ans=0.125 2024-06-19 22:54:13,796 INFO [train.py:1028] (1/2) Epoch 5, batch 2500, loss[loss=0.3187, simple_loss=0.3268, pruned_loss=0.1553, over 13247.00 frames. ], tot_loss[loss=0.3594, simple_loss=0.3602, pruned_loss=0.1793, over 2585506.64 frames. ], batch size: 83, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:54:29,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=78811.33333333333, ans=0.125 2024-06-19 22:54:35,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=78829.66666666667, ans=0.025 2024-06-19 22:54:36,440 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.56 vs. limit=10.0 2024-06-19 22:54:37,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=78829.66666666667, ans=0.09899494936611666 2024-06-19 22:54:46,452 INFO [train.py:1028] (1/2) Epoch 5, batch 2550, loss[loss=0.3707, simple_loss=0.3782, pruned_loss=0.1816, over 12374.00 frames. ], tot_loss[loss=0.3592, simple_loss=0.36, pruned_loss=0.1792, over 2585389.44 frames. ], batch size: 22, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:54:49,474 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.94 vs. limit=15.0 2024-06-19 22:54:49,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=78866.33333333333, ans=0.0 2024-06-19 22:54:54,369 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.346e+03 1.883e+03 2.192e+03 2.712e+03 4.016e+03, threshold=4.384e+03, percent-clipped=0.0 2024-06-19 22:54:58,549 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.99 vs. limit=15.0 2024-06-19 22:55:11,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=78921.33333333333, ans=0.0 2024-06-19 22:55:11,125 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.41 vs. limit=10.0 2024-06-19 22:55:18,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=78939.66666666667, ans=0.125 2024-06-19 22:55:21,941 INFO [train.py:1028] (1/2) Epoch 5, batch 2600, loss[loss=0.3768, simple_loss=0.3793, pruned_loss=0.1872, over 13286.00 frames. ], tot_loss[loss=0.3576, simple_loss=0.3583, pruned_loss=0.1785, over 2585645.56 frames. ], batch size: 52, lr: 1.18e-02, grad_scale: 4.0 2024-06-19 22:55:29,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=78976.33333333333, ans=0.1 2024-06-19 22:55:30,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=78976.33333333333, ans=0.025 2024-06-19 22:55:31,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=78976.33333333333, ans=0.125 2024-06-19 22:55:35,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=78994.66666666667, ans=0.1 2024-06-19 22:55:35,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=78994.66666666667, ans=0.125 2024-06-19 22:55:42,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=78994.66666666667, ans=0.125 2024-06-19 22:55:45,962 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.18 vs. limit=15.0 2024-06-19 22:55:47,472 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.22 vs. limit=10.0 2024-06-19 22:55:57,558 INFO [train.py:1028] (1/2) Epoch 5, batch 2650, loss[loss=0.3625, simple_loss=0.3495, pruned_loss=0.1877, over 13070.00 frames. ], tot_loss[loss=0.3551, simple_loss=0.356, pruned_loss=0.1771, over 2586783.82 frames. ], batch size: 144, lr: 1.18e-02, grad_scale: 1.0 2024-06-19 22:56:06,409 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.025e+03 1.809e+03 2.066e+03 2.309e+03 3.584e+03, threshold=4.133e+03, percent-clipped=0.0 2024-06-19 22:56:11,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=79086.33333333333, ans=0.0 2024-06-19 22:56:11,289 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.35 vs. limit=6.0 2024-06-19 22:56:12,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=79086.33333333333, ans=0.05 2024-06-19 22:56:13,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=79086.33333333333, ans=0.125 2024-06-19 22:56:13,859 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.14 vs. limit=15.0 2024-06-19 22:56:17,499 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.155e+03 2024-06-19 22:56:21,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=79104.66666666667, ans=0.025 2024-06-19 22:56:25,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=79123.0, ans=0.025 2024-06-19 22:56:29,608 INFO [train.py:1028] (1/2) Epoch 5, batch 2700, loss[loss=0.3233, simple_loss=0.3226, pruned_loss=0.162, over 13275.00 frames. ], tot_loss[loss=0.3534, simple_loss=0.354, pruned_loss=0.1764, over 2584917.12 frames. ], batch size: 89, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:57:02,273 INFO [train.py:1028] (1/2) Epoch 5, batch 2750, loss[loss=0.3901, simple_loss=0.3728, pruned_loss=0.2038, over 13239.00 frames. ], tot_loss[loss=0.3512, simple_loss=0.3524, pruned_loss=0.175, over 2581588.43 frames. ], batch size: 43, lr: 1.17e-02, grad_scale: 1.0 2024-06-19 22:57:15,308 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.076e+03 1.770e+03 2.077e+03 2.494e+03 5.562e+03, threshold=4.154e+03, percent-clipped=4.0 2024-06-19 22:57:30,380 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.044e+03 2024-06-19 22:57:36,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=79306.33333333333, ans=0.2 2024-06-19 22:57:41,185 INFO [train.py:1028] (1/2) Epoch 5, batch 2800, loss[loss=0.3513, simple_loss=0.3307, pruned_loss=0.186, over 10839.00 frames. ], tot_loss[loss=0.3508, simple_loss=0.3517, pruned_loss=0.175, over 2579255.98 frames. ], batch size: 304, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 22:57:48,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=79343.0, ans=0.125 2024-06-19 22:57:52,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=79343.0, ans=0.0 2024-06-19 22:57:55,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.51 vs. limit=15.0 2024-06-19 22:57:56,945 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=12.05 vs. limit=12.0 2024-06-19 22:57:57,002 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.45 vs. limit=15.0 2024-06-19 22:57:59,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=79361.33333333333, ans=0.035 2024-06-19 22:57:59,809 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 22:58:06,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79398.0, ans=0.1 2024-06-19 22:58:06,969 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.39 vs. limit=10.0 2024-06-19 22:58:08,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=79398.0, ans=0.125 2024-06-19 22:58:13,150 INFO [train.py:1028] (1/2) Epoch 5, batch 2850, loss[loss=0.3425, simple_loss=0.3532, pruned_loss=0.1659, over 13005.00 frames. ], tot_loss[loss=0.3501, simple_loss=0.3505, pruned_loss=0.1749, over 2577306.02 frames. ], batch size: 48, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 22:58:14,141 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.76 vs. limit=15.0 2024-06-19 22:58:16,665 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.37 vs. limit=10.0 2024-06-19 22:58:22,991 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.543e+03 2.131e+03 2.550e+03 2.976e+03 4.403e+03, threshold=5.100e+03, percent-clipped=1.0 2024-06-19 22:58:24,686 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.80 vs. limit=15.0 2024-06-19 22:58:30,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=79453.0, ans=0.125 2024-06-19 22:58:33,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=79471.33333333333, ans=0.2 2024-06-19 22:58:38,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=79489.66666666667, ans=0.0 2024-06-19 22:58:39,215 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.10 vs. limit=15.0 2024-06-19 22:58:44,392 INFO [train.py:1028] (1/2) Epoch 5, batch 2900, loss[loss=0.3172, simple_loss=0.3231, pruned_loss=0.1556, over 13148.00 frames. ], tot_loss[loss=0.3477, simple_loss=0.3481, pruned_loss=0.1736, over 2586134.35 frames. ], batch size: 55, lr: 1.17e-02, grad_scale: 1.0 2024-06-19 22:58:49,090 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.21 vs. limit=15.0 2024-06-19 22:58:50,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=79508.0, ans=0.2 2024-06-19 22:58:58,113 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.45 vs. limit=15.0 2024-06-19 22:59:08,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=79563.0, ans=0.0 2024-06-19 22:59:09,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=79563.0, ans=0.2 2024-06-19 22:59:12,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=79563.0, ans=0.125 2024-06-19 22:59:14,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79581.33333333333, ans=0.1 2024-06-19 22:59:18,046 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.64 vs. limit=10.0 2024-06-19 22:59:23,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=79581.33333333333, ans=0.125 2024-06-19 22:59:24,552 INFO [train.py:1028] (1/2) Epoch 5, batch 2950, loss[loss=0.2946, simple_loss=0.3172, pruned_loss=0.136, over 13197.00 frames. ], tot_loss[loss=0.3467, simple_loss=0.3473, pruned_loss=0.1731, over 2581072.35 frames. ], batch size: 43, lr: 1.17e-02, grad_scale: 1.0 2024-06-19 22:59:25,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=79599.66666666667, ans=0.125 2024-06-19 22:59:32,954 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.220e+03 2024-06-19 22:59:32,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=79618.0, ans=0.0 2024-06-19 22:59:33,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=79618.0, ans=0.1 2024-06-19 22:59:36,316 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.181e+03 2.454e+03 2.906e+03 3.485e+03 8.729e+03, threshold=5.811e+03, percent-clipped=3.0 2024-06-19 22:59:56,322 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.80 vs. limit=15.0 2024-06-19 22:59:57,763 INFO [train.py:1028] (1/2) Epoch 5, batch 3000, loss[loss=0.2958, simple_loss=0.3178, pruned_loss=0.1369, over 13232.00 frames. ], tot_loss[loss=0.3441, simple_loss=0.3451, pruned_loss=0.1716, over 2579292.58 frames. ], batch size: 59, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 22:59:57,763 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 23:00:05,486 INFO [train.py:1060] (1/2) Epoch 5, validation: loss=0.2538, simple_loss=0.3037, pruned_loss=0.1019, over 351949.00 frames. 2024-06-19 23:00:05,487 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17340MB 2024-06-19 23:00:07,768 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.03 vs. limit=10.0 2024-06-19 23:00:10,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=79691.33333333333, ans=0.125 2024-06-19 23:00:11,852 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.65 vs. limit=15.0 2024-06-19 23:00:24,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=79728.0, ans=0.125 2024-06-19 23:00:35,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=79764.66666666667, ans=0.125 2024-06-19 23:00:38,026 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2024-06-19 23:00:39,756 INFO [train.py:1028] (1/2) Epoch 5, batch 3050, loss[loss=0.3284, simple_loss=0.3312, pruned_loss=0.1628, over 13294.00 frames. ], tot_loss[loss=0.3442, simple_loss=0.3446, pruned_loss=0.1719, over 2578604.77 frames. ], batch size: 46, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 23:00:51,011 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+03 2.722e+03 3.308e+03 3.742e+03 5.908e+03, threshold=6.617e+03, percent-clipped=1.0 2024-06-19 23:00:51,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=79801.33333333333, ans=0.0 2024-06-19 23:00:53,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=79819.66666666667, ans=0.025 2024-06-19 23:00:57,521 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.23 vs. limit=22.5 2024-06-19 23:01:09,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79856.33333333333, ans=0.1 2024-06-19 23:01:10,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=79856.33333333333, ans=0.125 2024-06-19 23:01:14,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79856.33333333333, ans=0.1 2024-06-19 23:01:16,093 INFO [train.py:1028] (1/2) Epoch 5, batch 3100, loss[loss=0.3192, simple_loss=0.3185, pruned_loss=0.16, over 13020.00 frames. ], tot_loss[loss=0.3433, simple_loss=0.3439, pruned_loss=0.1714, over 2578918.10 frames. ], batch size: 144, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 23:01:19,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=79874.66666666667, ans=0.2 2024-06-19 23:01:22,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=79874.66666666667, ans=0.0 2024-06-19 23:01:30,117 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=23.55 vs. limit=15.0 2024-06-19 23:01:49,673 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.53 vs. limit=15.0 2024-06-19 23:01:50,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=79948.0, ans=0.125 2024-06-19 23:01:52,567 INFO [train.py:1028] (1/2) Epoch 5, batch 3150, loss[loss=0.3547, simple_loss=0.3451, pruned_loss=0.1821, over 12932.00 frames. ], tot_loss[loss=0.3413, simple_loss=0.3424, pruned_loss=0.1701, over 2579841.97 frames. ], batch size: 158, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 23:01:56,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=79966.33333333333, ans=0.125 2024-06-19 23:01:59,262 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.36 vs. limit=22.5 2024-06-19 23:02:04,590 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.371e+03 2.414e+03 2.944e+03 3.348e+03 4.771e+03, threshold=5.888e+03, percent-clipped=0.0 2024-06-19 23:02:21,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=80039.66666666667, ans=0.0 2024-06-19 23:02:26,178 INFO [train.py:1028] (1/2) Epoch 5, batch 3200, loss[loss=0.3178, simple_loss=0.3342, pruned_loss=0.1507, over 13169.00 frames. ], tot_loss[loss=0.3422, simple_loss=0.343, pruned_loss=0.1707, over 2580603.14 frames. ], batch size: 55, lr: 1.17e-02, grad_scale: 4.0 2024-06-19 23:02:28,656 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.09 vs. limit=12.0 2024-06-19 23:02:36,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=80076.33333333333, ans=0.125 2024-06-19 23:02:36,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=80076.33333333333, ans=0.1 2024-06-19 23:02:44,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=80094.66666666667, ans=0.0 2024-06-19 23:02:50,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=80113.0, ans=0.125 2024-06-19 23:03:02,463 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.50 vs. limit=15.0 2024-06-19 23:03:02,742 INFO [train.py:1028] (1/2) Epoch 5, batch 3250, loss[loss=0.3431, simple_loss=0.3538, pruned_loss=0.1662, over 13233.00 frames. ], tot_loss[loss=0.3425, simple_loss=0.3426, pruned_loss=0.1712, over 2585197.42 frames. ], batch size: 72, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 23:03:13,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=80168.0, ans=0.95 2024-06-19 23:03:13,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=80168.0, ans=0.125 2024-06-19 23:03:16,212 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.575e+03 2.523e+03 2.997e+03 3.669e+03 6.402e+03, threshold=5.993e+03, percent-clipped=1.0 2024-06-19 23:03:18,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=80186.33333333333, ans=0.1 2024-06-19 23:03:20,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=80186.33333333333, ans=0.0 2024-06-19 23:03:25,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=80186.33333333333, ans=0.125 2024-06-19 23:03:26,304 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.56 vs. limit=15.0 2024-06-19 23:03:29,649 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.036e+03 2024-06-19 23:03:30,625 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.31 vs. limit=15.0 2024-06-19 23:03:39,780 INFO [train.py:1028] (1/2) Epoch 5, batch 3300, loss[loss=0.3655, simple_loss=0.3513, pruned_loss=0.1898, over 12766.00 frames. ], tot_loss[loss=0.3428, simple_loss=0.3428, pruned_loss=0.1714, over 2580395.10 frames. ], batch size: 176, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 23:03:42,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=80241.33333333333, ans=0.125 2024-06-19 23:03:50,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80259.66666666667, ans=0.1 2024-06-19 23:04:03,086 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.54 vs. limit=15.0 2024-06-19 23:04:12,841 INFO [train.py:1028] (1/2) Epoch 5, batch 3350, loss[loss=0.3533, simple_loss=0.3412, pruned_loss=0.1827, over 12987.00 frames. ], tot_loss[loss=0.3427, simple_loss=0.3424, pruned_loss=0.1715, over 2576272.62 frames. ], batch size: 158, lr: 1.17e-02, grad_scale: 0.5 2024-06-19 23:04:14,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.40 vs. limit=12.0 2024-06-19 23:04:18,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=80333.0, ans=0.125 2024-06-19 23:04:27,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=80369.66666666667, ans=0.1 2024-06-19 23:04:27,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.77 vs. limit=10.0 2024-06-19 23:04:27,472 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.950e+02 1.730e+03 2.047e+03 2.434e+03 4.918e+03, threshold=4.093e+03, percent-clipped=0.0 2024-06-19 23:04:27,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=80369.66666666667, ans=0.2 2024-06-19 23:04:27,996 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.68 vs. limit=15.0 2024-06-19 23:04:38,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=80388.0, ans=0.0 2024-06-19 23:04:39,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=80406.33333333333, ans=0.125 2024-06-19 23:04:51,761 INFO [train.py:1028] (1/2) Epoch 5, batch 3400, loss[loss=0.4023, simple_loss=0.3854, pruned_loss=0.2096, over 12608.00 frames. ], tot_loss[loss=0.3426, simple_loss=0.3419, pruned_loss=0.1717, over 2573401.62 frames. ], batch size: 22, lr: 1.17e-02, grad_scale: 1.0 2024-06-19 23:04:52,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=80424.66666666667, ans=0.1 2024-06-19 23:05:07,642 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.51 vs. limit=10.0 2024-06-19 23:05:07,649 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=11.42 vs. limit=12.0 2024-06-19 23:05:13,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=80461.33333333333, ans=0.125 2024-06-19 23:05:16,138 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:05:28,954 INFO [train.py:1028] (1/2) Epoch 5, batch 3450, loss[loss=0.3742, simple_loss=0.3601, pruned_loss=0.1941, over 12755.00 frames. ], tot_loss[loss=0.341, simple_loss=0.3406, pruned_loss=0.1707, over 2574912.71 frames. ], batch size: 176, lr: 1.17e-02, grad_scale: 1.0 2024-06-19 23:05:33,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=80516.33333333333, ans=0.125 2024-06-19 23:05:36,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=80534.66666666667, ans=0.0 2024-06-19 23:05:37,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=80534.66666666667, ans=0.0 2024-06-19 23:05:42,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=80553.0, ans=0.125 2024-06-19 23:05:42,867 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.572e+02 1.500e+03 1.792e+03 2.258e+03 5.058e+03, threshold=3.585e+03, percent-clipped=2.0 2024-06-19 23:05:43,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=80553.0, ans=0.125 2024-06-19 23:05:51,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=80571.33333333333, ans=10.0 2024-06-19 23:06:01,252 INFO [train.py:1028] (1/2) Epoch 5, batch 3500, loss[loss=0.3303, simple_loss=0.3369, pruned_loss=0.1618, over 12958.00 frames. ], tot_loss[loss=0.3388, simple_loss=0.3391, pruned_loss=0.1693, over 2575410.94 frames. ], batch size: 33, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:06:03,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=80608.0, ans=0.0 2024-06-19 23:06:14,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=80644.66666666667, ans=0.0 2024-06-19 23:06:16,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=80644.66666666667, ans=0.125 2024-06-19 23:06:20,972 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.47 vs. limit=15.0 2024-06-19 23:06:27,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=80663.0, ans=0.025 2024-06-19 23:06:31,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=80663.0, ans=0.07 2024-06-19 23:06:33,062 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=12.0 2024-06-19 23:06:36,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=80681.33333333333, ans=0.0 2024-06-19 23:06:39,649 INFO [train.py:1028] (1/2) Epoch 5, batch 3550, loss[loss=0.3036, simple_loss=0.3109, pruned_loss=0.1481, over 13203.00 frames. ], tot_loss[loss=0.3362, simple_loss=0.3375, pruned_loss=0.1675, over 2577583.54 frames. ], batch size: 95, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:06:40,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=80699.66666666667, ans=0.07 2024-06-19 23:06:41,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=80699.66666666667, ans=0.025 2024-06-19 23:06:48,836 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.15 vs. limit=12.0 2024-06-19 23:06:53,707 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.280e+02 1.174e+03 1.356e+03 1.634e+03 4.681e+03, threshold=2.712e+03, percent-clipped=1.0 2024-06-19 23:07:15,748 INFO [train.py:1028] (1/2) Epoch 5, batch 3600, loss[loss=0.3307, simple_loss=0.3411, pruned_loss=0.1602, over 13265.00 frames. ], tot_loss[loss=0.3356, simple_loss=0.3371, pruned_loss=0.1671, over 2580566.35 frames. ], batch size: 49, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:07:19,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=80791.33333333333, ans=0.2 2024-06-19 23:07:22,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=80809.66666666667, ans=0.1 2024-06-19 23:07:24,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=80809.66666666667, ans=0.2 2024-06-19 23:07:29,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=80809.66666666667, ans=0.125 2024-06-19 23:07:43,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=80846.33333333333, ans=0.125 2024-06-19 23:07:52,182 INFO [train.py:1028] (1/2) Epoch 5, batch 3650, loss[loss=0.2958, simple_loss=0.3034, pruned_loss=0.1441, over 13054.00 frames. ], tot_loss[loss=0.3354, simple_loss=0.3372, pruned_loss=0.1669, over 2579858.59 frames. ], batch size: 102, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:08:07,131 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.756e+02 1.233e+03 1.484e+03 1.778e+03 3.637e+03, threshold=2.968e+03, percent-clipped=4.0 2024-06-19 23:08:25,133 INFO [train.py:1028] (1/2) Epoch 5, batch 3700, loss[loss=0.3558, simple_loss=0.3562, pruned_loss=0.1777, over 13258.00 frames. ], tot_loss[loss=0.3317, simple_loss=0.3344, pruned_loss=0.1645, over 2584935.13 frames. ], batch size: 72, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:08:26,582 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:08:29,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=80974.66666666667, ans=0.125 2024-06-19 23:08:32,268 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.42 vs. limit=15.0 2024-06-19 23:08:42,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=81011.33333333333, ans=0.2 2024-06-19 23:08:55,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=81048.0, ans=0.125 2024-06-19 23:08:58,613 INFO [train.py:1028] (1/2) Epoch 5, batch 3750, loss[loss=0.3669, simple_loss=0.3607, pruned_loss=0.1866, over 12412.00 frames. ], tot_loss[loss=0.3311, simple_loss=0.3339, pruned_loss=0.1642, over 2586870.78 frames. ], batch size: 22, lr: 1.16e-02, grad_scale: 1.0 2024-06-19 23:09:00,472 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.50 vs. limit=10.0 2024-06-19 23:09:01,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=81066.33333333333, ans=0.125 2024-06-19 23:09:18,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=81103.0, ans=0.0 2024-06-19 23:09:18,510 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.153e+03 2.043e+03 2.365e+03 2.732e+03 4.895e+03, threshold=4.730e+03, percent-clipped=19.0 2024-06-19 23:09:19,237 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=81103.0, ans=0.125 2024-06-19 23:09:38,267 INFO [train.py:1028] (1/2) Epoch 5, batch 3800, loss[loss=0.31, simple_loss=0.319, pruned_loss=0.1505, over 13184.00 frames. ], tot_loss[loss=0.332, simple_loss=0.3346, pruned_loss=0.1648, over 2584349.46 frames. ], batch size: 83, lr: 1.16e-02, grad_scale: 1.0 2024-06-19 23:09:39,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81158.0, ans=0.1 2024-06-19 23:09:41,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=81158.0, ans=0.125 2024-06-19 23:09:46,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=81176.33333333333, ans=0.2 2024-06-19 23:09:46,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=81176.33333333333, ans=0.0 2024-06-19 23:09:50,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=81176.33333333333, ans=0.0 2024-06-19 23:10:02,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=81213.0, ans=0.025 2024-06-19 23:10:10,533 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.37 vs. limit=12.0 2024-06-19 23:10:10,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=81249.66666666667, ans=0.125 2024-06-19 23:10:11,407 INFO [train.py:1028] (1/2) Epoch 5, batch 3850, loss[loss=0.319, simple_loss=0.3166, pruned_loss=0.1607, over 13042.00 frames. ], tot_loss[loss=0.3317, simple_loss=0.3344, pruned_loss=0.1645, over 2584371.84 frames. ], batch size: 144, lr: 1.16e-02, grad_scale: 0.5 2024-06-19 23:10:11,982 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=15.86 vs. limit=15.0 2024-06-19 23:10:14,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81249.66666666667, ans=0.1 2024-06-19 23:10:19,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=81268.0, ans=0.0 2024-06-19 23:10:25,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=81286.33333333333, ans=0.04949747468305833 2024-06-19 23:10:27,935 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=5.704e+00 2024-06-19 23:10:29,100 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.493e+03 2.337e+03 2.713e+03 3.169e+03 6.664e+03, threshold=5.425e+03, percent-clipped=2.0 2024-06-19 23:10:31,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=81304.66666666667, ans=0.125 2024-06-19 23:10:33,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=81304.66666666667, ans=0.0 2024-06-19 23:10:34,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=81304.66666666667, ans=0.02 2024-06-19 23:10:40,633 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.64 vs. limit=6.0 2024-06-19 23:10:41,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=81323.0, ans=0.125 2024-06-19 23:10:43,592 INFO [train.py:1028] (1/2) Epoch 5, batch 3900, loss[loss=0.3043, simple_loss=0.3116, pruned_loss=0.1486, over 13205.00 frames. ], tot_loss[loss=0.3326, simple_loss=0.3347, pruned_loss=0.1652, over 2587738.27 frames. ], batch size: 83, lr: 1.16e-02, grad_scale: 1.0 2024-06-19 23:10:58,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=81378.0, ans=0.125 2024-06-19 23:11:09,862 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.27 vs. limit=15.0 2024-06-19 23:11:10,237 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=81414.66666666667, ans=0.07 2024-06-19 23:11:10,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=81414.66666666667, ans=0.025 2024-06-19 23:11:13,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=81414.66666666667, ans=0.125 2024-06-19 23:11:16,370 INFO [train.py:1028] (1/2) Epoch 5, batch 3950, loss[loss=0.2866, simple_loss=0.2885, pruned_loss=0.1423, over 13115.00 frames. ], tot_loss[loss=0.3282, simple_loss=0.3315, pruned_loss=0.1625, over 2589853.93 frames. ], batch size: 132, lr: 1.16e-02, grad_scale: 1.0 2024-06-19 23:11:16,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=81433.0, ans=0.2 2024-06-19 23:11:18,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=81433.0, ans=0.2 2024-06-19 23:11:27,268 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.49 vs. limit=22.5 2024-06-19 23:11:34,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=81469.66666666667, ans=0.125 2024-06-19 23:11:39,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=81469.66666666667, ans=0.125 2024-06-19 23:11:39,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=81469.66666666667, ans=0.0 2024-06-19 23:11:40,271 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.221e+03 1.594e+03 1.880e+03 2.309e+03 4.851e+03, threshold=3.760e+03, percent-clipped=0.0 2024-06-19 23:11:42,721 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.01 vs. limit=10.0 2024-06-19 23:11:46,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=81488.0, ans=0.125 2024-06-19 23:11:52,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=81506.33333333333, ans=0.125 2024-06-19 23:11:55,502 INFO [train.py:1028] (1/2) Epoch 5, batch 4000, loss[loss=0.3437, simple_loss=0.3514, pruned_loss=0.168, over 12906.00 frames. ], tot_loss[loss=0.3281, simple_loss=0.3312, pruned_loss=0.1625, over 2584440.11 frames. ], batch size: 39, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:12:01,963 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.83 vs. limit=10.0 2024-06-19 23:12:04,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=81543.0, ans=0.125 2024-06-19 23:12:05,362 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.24 vs. limit=15.0 2024-06-19 23:12:07,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=81543.0, ans=0.1 2024-06-19 23:12:07,871 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.36 vs. limit=10.0 2024-06-19 23:12:23,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=81598.0, ans=0.0 2024-06-19 23:12:28,363 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.08 vs. limit=15.0 2024-06-19 23:12:28,477 INFO [train.py:1028] (1/2) Epoch 5, batch 4050, loss[loss=0.3959, simple_loss=0.3596, pruned_loss=0.2161, over 10894.00 frames. ], tot_loss[loss=0.3285, simple_loss=0.3311, pruned_loss=0.1629, over 2581520.15 frames. ], batch size: 304, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:12:35,862 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.84 vs. limit=15.0 2024-06-19 23:12:45,799 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.157e+03 1.932e+03 2.305e+03 2.608e+03 6.099e+03, threshold=4.610e+03, percent-clipped=3.0 2024-06-19 23:12:49,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=81671.33333333333, ans=0.125 2024-06-19 23:12:58,579 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.82 vs. limit=15.0 2024-06-19 23:13:01,017 INFO [train.py:1028] (1/2) Epoch 5, batch 4100, loss[loss=0.3097, simple_loss=0.3195, pruned_loss=0.1499, over 13149.00 frames. ], tot_loss[loss=0.3299, simple_loss=0.332, pruned_loss=0.1638, over 2576514.89 frames. ], batch size: 103, lr: 1.16e-02, grad_scale: 4.0 2024-06-19 23:13:05,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=81708.0, ans=0.125 2024-06-19 23:13:10,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=81726.33333333333, ans=0.125 2024-06-19 23:13:14,778 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=81744.66666666667, ans=0.0 2024-06-19 23:13:22,591 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.19 vs. limit=22.5 2024-06-19 23:13:31,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=81781.33333333333, ans=0.0 2024-06-19 23:13:32,863 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.25 vs. limit=15.0 2024-06-19 23:13:34,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=81781.33333333333, ans=0.125 2024-06-19 23:13:41,114 INFO [train.py:1028] (1/2) Epoch 5, batch 4150, loss[loss=0.2946, simple_loss=0.3091, pruned_loss=0.14, over 13146.00 frames. ], tot_loss[loss=0.3291, simple_loss=0.3316, pruned_loss=0.1632, over 2575412.67 frames. ], batch size: 55, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:13:43,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=81799.66666666667, ans=0.0 2024-06-19 23:13:45,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=81799.66666666667, ans=0.125 2024-06-19 23:13:48,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=81818.0, ans=0.125 2024-06-19 23:13:49,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=81818.0, ans=0.025 2024-06-19 23:13:50,135 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.04 vs. limit=6.0 2024-06-19 23:13:50,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=81818.0, ans=0.07 2024-06-19 23:13:51,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=81818.0, ans=0.0 2024-06-19 23:13:59,582 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.352e+02 1.498e+03 1.854e+03 2.110e+03 3.690e+03, threshold=3.707e+03, percent-clipped=0.0 2024-06-19 23:14:00,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=81854.66666666667, ans=0.125 2024-06-19 23:14:07,841 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.51 vs. limit=10.0 2024-06-19 23:14:14,111 INFO [train.py:1028] (1/2) Epoch 5, batch 4200, loss[loss=0.3046, simple_loss=0.311, pruned_loss=0.1491, over 12995.00 frames. ], tot_loss[loss=0.3281, simple_loss=0.3308, pruned_loss=0.1627, over 2577533.22 frames. ], batch size: 102, lr: 1.16e-02, grad_scale: 4.0 2024-06-19 23:14:18,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81891.33333333333, ans=0.1 2024-06-19 23:14:23,835 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.65 vs. limit=15.0 2024-06-19 23:14:24,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=81909.66666666667, ans=0.2 2024-06-19 23:14:24,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81909.66666666667, ans=0.1 2024-06-19 23:14:30,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=81928.0, ans=0.0 2024-06-19 23:14:35,267 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.29 vs. limit=15.0 2024-06-19 23:14:47,677 INFO [train.py:1028] (1/2) Epoch 5, batch 4250, loss[loss=0.3045, simple_loss=0.3163, pruned_loss=0.1463, over 13312.00 frames. ], tot_loss[loss=0.3268, simple_loss=0.3299, pruned_loss=0.1618, over 2579974.00 frames. ], batch size: 46, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:15:02,788 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=1.714e+02 2024-06-19 23:15:07,170 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.140e+03 1.770e+03 1.968e+03 2.259e+03 4.961e+03, threshold=3.936e+03, percent-clipped=1.0 2024-06-19 23:15:14,828 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2024-06-19 23:15:15,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=82056.33333333333, ans=0.125 2024-06-19 23:15:20,195 INFO [train.py:1028] (1/2) Epoch 5, batch 4300, loss[loss=0.3009, simple_loss=0.3101, pruned_loss=0.1458, over 13191.00 frames. ], tot_loss[loss=0.326, simple_loss=0.3293, pruned_loss=0.1614, over 2580352.25 frames. ], batch size: 59, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:15:38,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82111.33333333333, ans=0.1 2024-06-19 23:15:52,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=82148.0, ans=0.125 2024-06-19 23:15:53,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=82148.0, ans=0.5 2024-06-19 23:15:56,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=82148.0, ans=0.025 2024-06-19 23:15:57,525 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.52 vs. limit=22.5 2024-06-19 23:15:59,102 INFO [train.py:1028] (1/2) Epoch 5, batch 4350, loss[loss=0.3435, simple_loss=0.3497, pruned_loss=0.1687, over 13161.00 frames. ], tot_loss[loss=0.3259, simple_loss=0.3293, pruned_loss=0.1613, over 2585822.07 frames. ], batch size: 59, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:16:08,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=82184.66666666667, ans=0.0 2024-06-19 23:16:18,870 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.348e+02 1.875e+03 2.161e+03 2.490e+03 4.207e+03, threshold=4.321e+03, percent-clipped=2.0 2024-06-19 23:16:24,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=82221.33333333333, ans=0.0 2024-06-19 23:16:26,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=82239.66666666667, ans=0.025 2024-06-19 23:16:29,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=82239.66666666667, ans=0.125 2024-06-19 23:16:30,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=82239.66666666667, ans=0.0 2024-06-19 23:16:32,147 INFO [train.py:1028] (1/2) Epoch 5, batch 4400, loss[loss=0.3222, simple_loss=0.3189, pruned_loss=0.1628, over 13238.00 frames. ], tot_loss[loss=0.3246, simple_loss=0.3283, pruned_loss=0.1604, over 2586945.47 frames. ], batch size: 83, lr: 1.15e-02, grad_scale: 4.0 2024-06-19 23:16:34,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=82258.0, ans=0.0 2024-06-19 23:16:37,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=82258.0, ans=0.125 2024-06-19 23:16:42,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=82276.33333333333, ans=0.125 2024-06-19 23:16:55,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82313.0, ans=0.1 2024-06-19 23:16:55,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=82313.0, ans=0.2 2024-06-19 23:16:58,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=82331.33333333333, ans=0.025 2024-06-19 23:17:03,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=82331.33333333333, ans=0.0 2024-06-19 23:17:04,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=82349.66666666667, ans=0.125 2024-06-19 23:17:05,056 INFO [train.py:1028] (1/2) Epoch 5, batch 4450, loss[loss=0.2912, simple_loss=0.3035, pruned_loss=0.1394, over 13020.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.3283, pruned_loss=0.1605, over 2581769.99 frames. ], batch size: 33, lr: 1.15e-02, grad_scale: 1.0 2024-06-19 23:17:05,492 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.17 vs. limit=22.5 2024-06-19 23:17:07,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=82349.66666666667, ans=0.125 2024-06-19 23:17:09,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=82349.66666666667, ans=0.1 2024-06-19 23:17:20,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=82386.33333333333, ans=0.125 2024-06-19 23:17:21,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=82386.33333333333, ans=0.125 2024-06-19 23:17:26,985 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.78 vs. limit=22.5 2024-06-19 23:17:28,447 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.476e+02 1.365e+03 1.583e+03 1.911e+03 3.208e+03, threshold=3.166e+03, percent-clipped=0.0 2024-06-19 23:17:31,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=82404.66666666667, ans=0.125 2024-06-19 23:17:43,286 INFO [train.py:1028] (1/2) Epoch 5, batch 4500, loss[loss=0.3093, simple_loss=0.3147, pruned_loss=0.152, over 13244.00 frames. ], tot_loss[loss=0.3218, simple_loss=0.3263, pruned_loss=0.1587, over 2586296.97 frames. ], batch size: 89, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:17:44,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=82441.33333333333, ans=0.2 2024-06-19 23:17:55,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=82459.66666666667, ans=0.2 2024-06-19 23:17:57,147 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.65 vs. limit=22.5 2024-06-19 23:17:57,882 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.66 vs. limit=15.0 2024-06-19 23:18:07,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=82496.33333333333, ans=0.125 2024-06-19 23:18:11,327 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.08 vs. limit=15.0 2024-06-19 23:18:14,960 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=6.591e+00 2024-06-19 23:18:16,745 INFO [train.py:1028] (1/2) Epoch 5, batch 4550, loss[loss=0.3144, simple_loss=0.3288, pruned_loss=0.15, over 13266.00 frames. ], tot_loss[loss=0.3222, simple_loss=0.3265, pruned_loss=0.159, over 2589659.72 frames. ], batch size: 52, lr: 1.15e-02, grad_scale: 1.0 2024-06-19 23:18:24,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=82551.33333333333, ans=0.125 2024-06-19 23:18:30,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=82569.66666666667, ans=0.025 2024-06-19 23:18:34,563 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.28 vs. limit=15.0 2024-06-19 23:18:38,586 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.19 vs. limit=15.0 2024-06-19 23:18:39,450 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.159e+03 1.604e+03 1.872e+03 2.106e+03 4.111e+03, threshold=3.744e+03, percent-clipped=3.0 2024-06-19 23:18:48,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82606.33333333333, ans=0.1 2024-06-19 23:18:50,224 INFO [train.py:1028] (1/2) Epoch 5, batch 4600, loss[loss=0.3613, simple_loss=0.3454, pruned_loss=0.1885, over 12572.00 frames. ], tot_loss[loss=0.3229, simple_loss=0.3272, pruned_loss=0.1593, over 2585446.24 frames. ], batch size: 202, lr: 1.15e-02, grad_scale: 1.0 2024-06-19 23:18:53,496 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=1.355e+02 2024-06-19 23:18:56,074 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:19:00,289 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.69 vs. limit=15.0 2024-06-19 23:19:03,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=82661.33333333333, ans=0.125 2024-06-19 23:19:12,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=82679.66666666667, ans=0.125 2024-06-19 23:19:13,512 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:19:14,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=82679.66666666667, ans=0.125 2024-06-19 23:19:15,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=82698.0, ans=0.125 2024-06-19 23:19:25,557 INFO [train.py:1028] (1/2) Epoch 5, batch 4650, loss[loss=0.3326, simple_loss=0.3296, pruned_loss=0.1678, over 13091.00 frames. ], tot_loss[loss=0.3211, simple_loss=0.3257, pruned_loss=0.1582, over 2588592.93 frames. ], batch size: 132, lr: 1.15e-02, grad_scale: 1.0 2024-06-19 23:19:25,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=82716.33333333333, ans=0.0 2024-06-19 23:19:37,528 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.69 vs. limit=15.0 2024-06-19 23:19:47,629 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.28 vs. limit=15.0 2024-06-19 23:19:51,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=82771.33333333333, ans=0.125 2024-06-19 23:19:53,124 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.035e+03 1.408e+03 1.621e+03 2.006e+03 4.457e+03, threshold=3.241e+03, percent-clipped=1.0 2024-06-19 23:19:55,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=82771.33333333333, ans=0.125 2024-06-19 23:19:55,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=82771.33333333333, ans=15.0 2024-06-19 23:20:00,485 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.02 vs. limit=6.0 2024-06-19 23:20:00,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=82789.66666666667, ans=0.125 2024-06-19 23:20:04,058 INFO [train.py:1028] (1/2) Epoch 5, batch 4700, loss[loss=0.3206, simple_loss=0.333, pruned_loss=0.1541, over 12895.00 frames. ], tot_loss[loss=0.3216, simple_loss=0.3262, pruned_loss=0.1585, over 2584348.50 frames. ], batch size: 26, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:20:06,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=82808.0, ans=0.125 2024-06-19 23:20:12,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=82826.33333333333, ans=0.09899494936611666 2024-06-19 23:20:23,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=82863.0, ans=0.2 2024-06-19 23:20:30,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=82881.33333333333, ans=0.0 2024-06-19 23:20:34,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=82881.33333333333, ans=0.0 2024-06-19 23:20:35,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=82881.33333333333, ans=0.5 2024-06-19 23:20:37,433 INFO [train.py:1028] (1/2) Epoch 5, batch 4750, loss[loss=0.3874, simple_loss=0.3613, pruned_loss=0.2068, over 12504.00 frames. ], tot_loss[loss=0.3223, simple_loss=0.3262, pruned_loss=0.1592, over 2581916.46 frames. ], batch size: 202, lr: 1.15e-02, grad_scale: 1.0 2024-06-19 23:20:37,832 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2024-06-19 23:20:39,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=82899.66666666667, ans=0.0 2024-06-19 23:20:56,945 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.26 vs. limit=22.5 2024-06-19 23:21:00,116 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.27 vs. limit=15.0 2024-06-19 23:21:00,416 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.091e+02 1.211e+03 1.524e+03 1.744e+03 5.539e+03, threshold=3.047e+03, percent-clipped=1.0 2024-06-19 23:21:10,669 INFO [train.py:1028] (1/2) Epoch 5, batch 4800, loss[loss=0.3312, simple_loss=0.3349, pruned_loss=0.1637, over 13210.00 frames. ], tot_loss[loss=0.3207, simple_loss=0.3252, pruned_loss=0.1581, over 2578233.35 frames. ], batch size: 63, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:21:12,417 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.99 vs. limit=15.0 2024-06-19 23:21:14,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.02 vs. limit=15.0 2024-06-19 23:21:49,684 INFO [train.py:1028] (1/2) Epoch 5, batch 4850, loss[loss=0.3163, simple_loss=0.3175, pruned_loss=0.1576, over 13219.00 frames. ], tot_loss[loss=0.319, simple_loss=0.3238, pruned_loss=0.1571, over 2575431.62 frames. ], batch size: 89, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:21:51,986 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.65 vs. limit=15.0 2024-06-19 23:21:59,602 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.06 vs. limit=22.5 2024-06-19 23:22:00,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=83101.33333333333, ans=0.125 2024-06-19 23:22:02,255 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.36 vs. limit=15.0 2024-06-19 23:22:04,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=83119.66666666667, ans=10.0 2024-06-19 23:22:09,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=83138.0, ans=0.125 2024-06-19 23:22:13,579 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.833e+02 1.308e+03 1.488e+03 1.873e+03 4.798e+03, threshold=2.977e+03, percent-clipped=7.0 2024-06-19 23:22:14,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=83138.0, ans=0.0 2024-06-19 23:22:23,172 INFO [train.py:1028] (1/2) Epoch 5, batch 4900, loss[loss=0.3102, simple_loss=0.3166, pruned_loss=0.1519, over 13213.00 frames. ], tot_loss[loss=0.319, simple_loss=0.3236, pruned_loss=0.1572, over 2575879.09 frames. ], batch size: 59, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:22:23,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=83174.66666666667, ans=0.125 2024-06-19 23:22:24,912 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.05 vs. limit=15.0 2024-06-19 23:22:28,667 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.25 vs. limit=15.0 2024-06-19 23:22:35,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=83211.33333333333, ans=0.125 2024-06-19 23:22:56,140 INFO [train.py:1028] (1/2) Epoch 5, batch 4950, loss[loss=0.3457, simple_loss=0.3301, pruned_loss=0.1806, over 11217.00 frames. ], tot_loss[loss=0.321, simple_loss=0.325, pruned_loss=0.1585, over 2570813.79 frames. ], batch size: 304, lr: 1.15e-02, grad_scale: 1.0 2024-06-19 23:22:58,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=83266.33333333333, ans=0.125 2024-06-19 23:22:59,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=83266.33333333333, ans=0.0 2024-06-19 23:23:09,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=83303.0, ans=0.025 2024-06-19 23:23:13,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=83303.0, ans=0.0 2024-06-19 23:23:23,602 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.403e+02 1.084e+03 1.336e+03 1.537e+03 4.727e+03, threshold=2.672e+03, percent-clipped=2.0 2024-06-19 23:23:24,628 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.52 vs. limit=22.5 2024-06-19 23:23:26,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=83339.66666666667, ans=0.0 2024-06-19 23:23:32,003 INFO [train.py:1028] (1/2) Epoch 5, batch 5000, loss[loss=0.2925, simple_loss=0.3015, pruned_loss=0.1418, over 13134.00 frames. ], tot_loss[loss=0.3195, simple_loss=0.324, pruned_loss=0.1574, over 2574612.90 frames. ], batch size: 95, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:23:42,801 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=26.86 vs. limit=22.5 2024-06-19 23:23:49,109 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.26 vs. limit=15.0 2024-06-19 23:23:51,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=83394.66666666667, ans=0.025 2024-06-19 23:23:55,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=83413.0, ans=0.2 2024-06-19 23:23:57,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=83413.0, ans=0.2 2024-06-19 23:23:57,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=83413.0, ans=0.025 2024-06-19 23:24:09,144 INFO [train.py:1028] (1/2) Epoch 5, batch 5050, loss[loss=0.3183, simple_loss=0.3367, pruned_loss=0.15, over 13151.00 frames. ], tot_loss[loss=0.3167, simple_loss=0.3224, pruned_loss=0.1555, over 2572456.42 frames. ], batch size: 37, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:24:09,448 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2024-06-19 23:24:18,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.02 vs. limit=10.0 2024-06-19 23:24:20,868 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.49 vs. limit=22.5 2024-06-19 23:24:21,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=83486.33333333333, ans=15.0 2024-06-19 23:24:29,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=83504.66666666667, ans=0.125 2024-06-19 23:24:30,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=83504.66666666667, ans=0.125 2024-06-19 23:24:33,157 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.560e+02 1.207e+03 1.422e+03 1.667e+03 3.146e+03, threshold=2.844e+03, percent-clipped=1.0 2024-06-19 23:24:34,351 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.43 vs. limit=15.0 2024-06-19 23:24:35,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=83523.0, ans=0.125 2024-06-19 23:24:42,037 INFO [train.py:1028] (1/2) Epoch 5, batch 5100, loss[loss=0.303, simple_loss=0.3207, pruned_loss=0.1426, over 12871.00 frames. ], tot_loss[loss=0.3174, simple_loss=0.3226, pruned_loss=0.1561, over 2567884.78 frames. ], batch size: 39, lr: 1.14e-02, grad_scale: 4.0 2024-06-19 23:24:42,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=83541.33333333333, ans=0.0 2024-06-19 23:24:45,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=83541.33333333333, ans=0.1 2024-06-19 23:24:49,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=83559.66666666667, ans=0.125 2024-06-19 23:24:58,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=83578.0, ans=0.125 2024-06-19 23:24:59,532 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.69 vs. limit=15.0 2024-06-19 23:25:05,547 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.54 vs. limit=10.0 2024-06-19 23:25:13,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=83614.66666666667, ans=0.0 2024-06-19 23:25:13,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=83614.66666666667, ans=0.0 2024-06-19 23:25:18,720 INFO [train.py:1028] (1/2) Epoch 5, batch 5150, loss[loss=0.3211, simple_loss=0.3135, pruned_loss=0.1644, over 13077.00 frames. ], tot_loss[loss=0.3179, simple_loss=0.3227, pruned_loss=0.1566, over 2570339.44 frames. ], batch size: 132, lr: 1.14e-02, grad_scale: 1.0 2024-06-19 23:25:20,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=83633.0, ans=0.125 2024-06-19 23:25:21,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=83633.0, ans=0.125 2024-06-19 23:25:38,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=83669.66666666667, ans=0.025 2024-06-19 23:25:38,293 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.85 vs. limit=15.0 2024-06-19 23:25:40,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=83669.66666666667, ans=0.125 2024-06-19 23:25:41,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=83688.0, ans=0.1 2024-06-19 23:25:47,464 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.751e+02 1.183e+03 1.402e+03 1.679e+03 3.109e+03, threshold=2.805e+03, percent-clipped=1.0 2024-06-19 23:25:49,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=83706.33333333333, ans=0.125 2024-06-19 23:25:54,423 INFO [train.py:1028] (1/2) Epoch 5, batch 5200, loss[loss=0.284, simple_loss=0.2973, pruned_loss=0.1354, over 13152.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.3227, pruned_loss=0.1565, over 2573442.57 frames. ], batch size: 95, lr: 1.14e-02, grad_scale: 2.0 2024-06-19 23:26:00,552 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=12.03 vs. limit=12.0 2024-06-19 23:26:00,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=83743.0, ans=0.125 2024-06-19 23:26:06,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=83743.0, ans=0.0 2024-06-19 23:26:07,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=83761.33333333333, ans=0.125 2024-06-19 23:26:07,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=83761.33333333333, ans=0.0 2024-06-19 23:26:08,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=83761.33333333333, ans=0.1 2024-06-19 23:26:17,242 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.19 vs. limit=15.0 2024-06-19 23:26:19,922 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.64 vs. limit=22.5 2024-06-19 23:26:23,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=83798.0, ans=0.0 2024-06-19 23:26:27,340 INFO [train.py:1028] (1/2) Epoch 5, batch 5250, loss[loss=0.28, simple_loss=0.296, pruned_loss=0.132, over 13226.00 frames. ], tot_loss[loss=0.3179, simple_loss=0.3231, pruned_loss=0.1564, over 2568990.27 frames. ], batch size: 52, lr: 1.14e-02, grad_scale: 1.0 2024-06-19 23:26:30,315 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.73 vs. limit=22.5 2024-06-19 23:26:31,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=83816.33333333333, ans=0.125 2024-06-19 23:26:35,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=83834.66666666667, ans=0.125 2024-06-19 23:26:35,753 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.62 vs. limit=15.0 2024-06-19 23:26:39,196 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2024-06-19 23:26:43,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=83853.0, ans=0.125 2024-06-19 23:26:53,982 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.357e+02 1.471e+03 1.693e+03 1.982e+03 4.407e+03, threshold=3.386e+03, percent-clipped=3.0 2024-06-19 23:27:00,616 INFO [train.py:1028] (1/2) Epoch 5, batch 5300, loss[loss=0.3135, simple_loss=0.3147, pruned_loss=0.1562, over 13053.00 frames. ], tot_loss[loss=0.3169, simple_loss=0.3223, pruned_loss=0.1557, over 2566039.80 frames. ], batch size: 144, lr: 1.14e-02, grad_scale: 2.0 2024-06-19 23:27:04,458 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=15.0 2024-06-19 23:27:07,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=83926.33333333333, ans=0.125 2024-06-19 23:27:07,962 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=8.625e+00 2024-06-19 23:27:15,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=83944.66666666667, ans=0.125 2024-06-19 23:27:19,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=83944.66666666667, ans=0.0 2024-06-19 23:27:24,680 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=10.42 vs. limit=12.0 2024-06-19 23:27:25,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=83963.0, ans=0.125 2024-06-19 23:27:31,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=83963.0, ans=0.125 2024-06-19 23:27:32,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=83963.0, ans=0.2 2024-06-19 23:27:39,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=83981.33333333333, ans=0.07 2024-06-19 23:27:40,856 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.07 vs. limit=15.0 2024-06-19 23:27:41,802 INFO [train.py:1028] (1/2) Epoch 5, batch 5350, loss[loss=0.3327, simple_loss=0.341, pruned_loss=0.1622, over 11779.00 frames. ], tot_loss[loss=0.3182, simple_loss=0.3231, pruned_loss=0.1567, over 2572933.13 frames. ], batch size: 17, lr: 1.14e-02, grad_scale: 2.0 2024-06-19 23:27:42,193 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.18 vs. limit=15.0 2024-06-19 23:27:44,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=83999.66666666667, ans=0.125 2024-06-19 23:27:44,770 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.53 vs. limit=15.0 2024-06-19 23:27:58,117 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.76 vs. limit=10.0 2024-06-19 23:28:00,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84054.66666666667, ans=0.1 2024-06-19 23:28:06,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=84073.0, ans=0.1 2024-06-19 23:28:07,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=84073.0, ans=0.125 2024-06-19 23:28:07,988 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.023e+02 1.502e+03 1.884e+03 2.236e+03 3.309e+03, threshold=3.768e+03, percent-clipped=0.0 2024-06-19 23:28:12,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84073.0, ans=0.1 2024-06-19 23:28:13,888 INFO [train.py:1028] (1/2) Epoch 5, batch 5400, loss[loss=0.3603, simple_loss=0.3391, pruned_loss=0.1908, over 12298.00 frames. ], tot_loss[loss=0.3195, simple_loss=0.3236, pruned_loss=0.1577, over 2565993.92 frames. ], batch size: 240, lr: 1.14e-02, grad_scale: 2.0 2024-06-19 23:28:20,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=84109.66666666667, ans=0.0 2024-06-19 23:28:20,748 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.48 vs. limit=15.0 2024-06-19 23:28:21,581 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.83 vs. limit=15.0 2024-06-19 23:28:22,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=84109.66666666667, ans=0.125 2024-06-19 23:28:24,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=84109.66666666667, ans=22.5 2024-06-19 23:28:31,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=84128.0, ans=0.0 2024-06-19 23:28:31,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=84128.0, ans=0.2 2024-06-19 23:28:41,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=84164.66666666667, ans=0.0 2024-06-19 23:28:44,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=84164.66666666667, ans=0.125 2024-06-19 23:28:47,216 INFO [train.py:1028] (1/2) Epoch 5, batch 5450, loss[loss=0.2815, simple_loss=0.2982, pruned_loss=0.1324, over 12466.00 frames. ], tot_loss[loss=0.32, simple_loss=0.3243, pruned_loss=0.1579, over 2569601.35 frames. ], batch size: 25, lr: 1.14e-02, grad_scale: 1.0 2024-06-19 23:28:47,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=84183.0, ans=0.125 2024-06-19 23:28:57,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84201.33333333333, ans=0.1 2024-06-19 23:29:00,781 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.93 vs. limit=22.5 2024-06-19 23:29:01,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=84219.66666666667, ans=0.0 2024-06-19 23:29:03,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=84219.66666666667, ans=0.2 2024-06-19 23:29:21,323 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.850e+02 1.426e+03 1.698e+03 2.025e+03 5.135e+03, threshold=3.397e+03, percent-clipped=4.0 2024-06-19 23:29:23,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=84256.33333333333, ans=0.5 2024-06-19 23:29:26,552 INFO [train.py:1028] (1/2) Epoch 5, batch 5500, loss[loss=0.3878, simple_loss=0.3595, pruned_loss=0.208, over 12312.00 frames. ], tot_loss[loss=0.32, simple_loss=0.3241, pruned_loss=0.158, over 2563416.50 frames. ], batch size: 241, lr: 1.14e-02, grad_scale: 2.0 2024-06-19 23:29:28,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=84274.66666666667, ans=0.2 2024-06-19 23:29:36,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=84293.0, ans=0.125 2024-06-19 23:29:41,965 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=22.5 2024-06-19 23:29:44,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=84311.33333333333, ans=0.125 2024-06-19 23:29:46,570 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.70 vs. limit=15.0 2024-06-19 23:29:48,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.56 vs. limit=15.0 2024-06-19 23:29:55,556 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.83 vs. limit=22.5 2024-06-19 23:29:56,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=84348.0, ans=0.125 2024-06-19 23:29:59,560 INFO [train.py:1028] (1/2) Epoch 5, batch 5550, loss[loss=0.2752, simple_loss=0.2994, pruned_loss=0.1255, over 13266.00 frames. ], tot_loss[loss=0.318, simple_loss=0.3231, pruned_loss=0.1565, over 2567436.30 frames. ], batch size: 43, lr: 1.14e-02, grad_scale: 2.0 2024-06-19 23:30:00,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=84366.33333333333, ans=10.0 2024-06-19 23:30:07,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.04 vs. limit=22.5 2024-06-19 23:30:08,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2024-06-19 23:30:21,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=84421.33333333333, ans=15.0 2024-06-19 23:30:27,854 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.331e+02 1.136e+03 1.301e+03 1.603e+03 4.722e+03, threshold=2.601e+03, percent-clipped=3.0 2024-06-19 23:30:32,999 INFO [train.py:1028] (1/2) Epoch 5, batch 5600, loss[loss=0.2933, simple_loss=0.3, pruned_loss=0.1433, over 13241.00 frames. ], tot_loss[loss=0.3166, simple_loss=0.3221, pruned_loss=0.1555, over 2569879.19 frames. ], batch size: 89, lr: 1.14e-02, grad_scale: 4.0 2024-06-19 23:30:42,940 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.00 vs. limit=6.0 2024-06-19 23:30:47,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=84494.66666666667, ans=0.125 2024-06-19 23:30:54,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=84513.0, ans=0.0 2024-06-19 23:30:58,121 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2024-06-19 23:31:01,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=84531.33333333333, ans=0.125 2024-06-19 23:31:07,898 INFO [train.py:1028] (1/2) Epoch 5, batch 5650, loss[loss=0.372, simple_loss=0.3578, pruned_loss=0.1931, over 12451.00 frames. ], tot_loss[loss=0.3159, simple_loss=0.3218, pruned_loss=0.1551, over 2574994.03 frames. ], batch size: 202, lr: 1.14e-02, grad_scale: 4.0 2024-06-19 23:31:20,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=84568.0, ans=0.0 2024-06-19 23:31:27,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=84568.0, ans=0.1 2024-06-19 23:31:39,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=84604.66666666667, ans=0.125 2024-06-19 23:31:40,200 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=27.01 vs. limit=22.5 2024-06-19 23:31:43,831 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.063e+02 1.196e+03 1.373e+03 1.555e+03 2.667e+03, threshold=2.746e+03, percent-clipped=1.0 2024-06-19 23:31:45,006 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.69 vs. limit=15.0 2024-06-19 23:31:49,061 INFO [train.py:1028] (1/2) Epoch 5, batch 5700, loss[loss=0.338, simple_loss=0.3372, pruned_loss=0.1694, over 13265.00 frames. ], tot_loss[loss=0.3143, simple_loss=0.3204, pruned_loss=0.1541, over 2579058.91 frames. ], batch size: 63, lr: 1.14e-02, grad_scale: 8.0 2024-06-19 23:32:03,423 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.40 vs. limit=15.0 2024-06-19 23:32:08,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=84696.33333333333, ans=0.0 2024-06-19 23:32:22,706 INFO [train.py:1028] (1/2) Epoch 5, batch 5750, loss[loss=0.3295, simple_loss=0.3215, pruned_loss=0.1688, over 12721.00 frames. ], tot_loss[loss=0.3167, simple_loss=0.3224, pruned_loss=0.1555, over 2579599.97 frames. ], batch size: 176, lr: 1.14e-02, grad_scale: 4.0 2024-06-19 23:32:26,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=84733.0, ans=0.125 2024-06-19 23:32:36,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=84769.66666666667, ans=0.2 2024-06-19 23:32:37,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=84769.66666666667, ans=0.125 2024-06-19 23:32:45,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=84788.0, ans=0.125 2024-06-19 23:32:47,359 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.31 vs. limit=10.0 2024-06-19 23:32:50,721 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.33 vs. limit=15.0 2024-06-19 23:32:52,845 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.243e+02 1.120e+03 1.334e+03 1.668e+03 3.505e+03, threshold=2.668e+03, percent-clipped=2.0 2024-06-19 23:32:56,023 INFO [train.py:1028] (1/2) Epoch 5, batch 5800, loss[loss=0.3533, simple_loss=0.3411, pruned_loss=0.1827, over 12780.00 frames. ], tot_loss[loss=0.3183, simple_loss=0.3236, pruned_loss=0.1566, over 2578627.96 frames. ], batch size: 176, lr: 1.14e-02, grad_scale: 2.0 2024-06-19 23:32:57,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=84824.66666666667, ans=0.0 2024-06-19 23:33:06,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=84843.0, ans=0.125 2024-06-19 23:33:08,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=84843.0, ans=0.025 2024-06-19 23:33:24,500 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.05 vs. limit=22.5 2024-06-19 23:33:24,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=84879.66666666667, ans=0.0 2024-06-19 23:33:28,889 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.13 vs. limit=15.0 2024-06-19 23:33:29,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84898.0, ans=0.1 2024-06-19 23:33:35,605 INFO [train.py:1028] (1/2) Epoch 5, batch 5850, loss[loss=0.3723, simple_loss=0.3547, pruned_loss=0.1949, over 12483.00 frames. ], tot_loss[loss=0.3214, simple_loss=0.3263, pruned_loss=0.1583, over 2576469.56 frames. ], batch size: 202, lr: 1.14e-02, grad_scale: 1.0 2024-06-19 23:33:39,936 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.02 vs. limit=22.5 2024-06-19 23:33:44,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=84934.66666666667, ans=0.125 2024-06-19 23:33:46,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=84934.66666666667, ans=0.0 2024-06-19 23:33:49,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=84953.0, ans=0.125 2024-06-19 23:33:58,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=84971.33333333333, ans=0.0 2024-06-19 23:34:06,268 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.258e+02 1.085e+03 1.305e+03 1.663e+03 4.351e+03, threshold=2.609e+03, percent-clipped=1.0 2024-06-19 23:34:09,182 INFO [train.py:1028] (1/2) Epoch 5, batch 5900, loss[loss=0.3033, simple_loss=0.3086, pruned_loss=0.149, over 13096.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3284, pruned_loss=0.1594, over 2576261.37 frames. ], batch size: 121, lr: 1.13e-02, grad_scale: 2.0 2024-06-19 23:34:09,617 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.78 vs. limit=15.0 2024-06-19 23:34:10,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=85008.0, ans=0.125 2024-06-19 23:34:42,268 INFO [train.py:1028] (1/2) Epoch 5, batch 5950, loss[loss=0.3029, simple_loss=0.3093, pruned_loss=0.1483, over 13084.00 frames. ], tot_loss[loss=0.3259, simple_loss=0.3306, pruned_loss=0.1606, over 2581392.16 frames. ], batch size: 121, lr: 1.13e-02, grad_scale: 2.0 2024-06-19 23:34:46,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=85099.66666666667, ans=0.1 2024-06-19 23:34:49,575 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.95 vs. limit=15.0 2024-06-19 23:34:54,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=85118.0, ans=0.0 2024-06-19 23:35:05,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=85154.66666666667, ans=0.1 2024-06-19 23:35:09,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=85173.0, ans=0.1 2024-06-19 23:35:09,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=85173.0, ans=0.125 2024-06-19 23:35:13,112 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.870e+02 7.865e+02 9.410e+02 1.036e+03 2.310e+03, threshold=1.882e+03, percent-clipped=0.0 2024-06-19 23:35:15,780 INFO [train.py:1028] (1/2) Epoch 5, batch 6000, loss[loss=0.4093, simple_loss=0.3799, pruned_loss=0.2193, over 12193.00 frames. ], tot_loss[loss=0.3271, simple_loss=0.3317, pruned_loss=0.1613, over 2574179.37 frames. ], batch size: 240, lr: 1.13e-02, grad_scale: 4.0 2024-06-19 23:35:15,781 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-19 23:35:22,041 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.4.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([3.8078, 3.2671, 1.8571, 3.4422], device='cuda:1') 2024-06-19 23:35:25,924 INFO [train.py:1060] (1/2) Epoch 5, validation: loss=0.2463, simple_loss=0.2986, pruned_loss=0.09699, over 351949.00 frames. 2024-06-19 23:35:25,925 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17340MB 2024-06-19 23:35:31,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85191.33333333333, ans=0.1 2024-06-19 23:35:35,579 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.52 vs. limit=15.0 2024-06-19 23:35:43,713 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.96 vs. limit=6.0 2024-06-19 23:35:51,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=85246.33333333333, ans=0.035 2024-06-19 23:35:57,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=85264.66666666667, ans=0.025 2024-06-19 23:35:59,524 INFO [train.py:1028] (1/2) Epoch 5, batch 6050, loss[loss=0.3106, simple_loss=0.328, pruned_loss=0.1466, over 12966.00 frames. ], tot_loss[loss=0.3288, simple_loss=0.3338, pruned_loss=0.1619, over 2577650.00 frames. ], batch size: 39, lr: 1.13e-02, grad_scale: 2.0 2024-06-19 23:36:02,390 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=85283.0, ans=0.04949747468305833 2024-06-19 23:36:05,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85301.33333333333, ans=0.1 2024-06-19 23:36:07,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=85301.33333333333, ans=0.125 2024-06-19 23:36:17,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=85319.66666666667, ans=0.125 2024-06-19 23:36:20,413 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.77 vs. limit=22.5 2024-06-19 23:36:27,844 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=15.0 2024-06-19 23:36:30,750 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.703e+02 7.655e+02 8.977e+02 1.084e+03 2.525e+03, threshold=1.795e+03, percent-clipped=3.0 2024-06-19 23:36:32,789 INFO [train.py:1028] (1/2) Epoch 5, batch 6100, loss[loss=0.3209, simple_loss=0.3255, pruned_loss=0.1582, over 13177.00 frames. ], tot_loss[loss=0.3292, simple_loss=0.3345, pruned_loss=0.162, over 2580389.55 frames. ], batch size: 121, lr: 1.13e-02, grad_scale: 4.0 2024-06-19 23:36:40,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=85393.0, ans=0.125 2024-06-19 23:36:46,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85411.33333333333, ans=0.1 2024-06-19 23:36:47,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=85411.33333333333, ans=0.2 2024-06-19 23:36:55,992 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=12.0 2024-06-19 23:36:56,526 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2024-06-19 23:36:58,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=85429.66666666667, ans=0.125 2024-06-19 23:37:05,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=85448.0, ans=0.0 2024-06-19 23:37:07,054 INFO [train.py:1028] (1/2) Epoch 5, batch 6150, loss[loss=0.3773, simple_loss=0.3551, pruned_loss=0.1997, over 11014.00 frames. ], tot_loss[loss=0.3323, simple_loss=0.3372, pruned_loss=0.1637, over 2578539.18 frames. ], batch size: 303, lr: 1.13e-02, grad_scale: 1.0 2024-06-19 23:37:11,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=85466.33333333333, ans=0.125 2024-06-19 23:37:12,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=85466.33333333333, ans=0.5 2024-06-19 23:37:21,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.60 vs. limit=15.0 2024-06-19 23:37:38,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=85521.33333333333, ans=0.025 2024-06-19 23:37:46,919 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.288e+02 1.186e+03 1.451e+03 1.754e+03 3.034e+03, threshold=2.901e+03, percent-clipped=21.0 2024-06-19 23:37:47,677 INFO [train.py:1028] (1/2) Epoch 5, batch 6200, loss[loss=0.3595, simple_loss=0.3643, pruned_loss=0.1774, over 13229.00 frames. ], tot_loss[loss=0.3343, simple_loss=0.3389, pruned_loss=0.1649, over 2577191.73 frames. ], batch size: 89, lr: 1.13e-02, grad_scale: 2.0 2024-06-19 23:37:48,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=85558.0, ans=0.125 2024-06-19 23:37:49,540 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.08 vs. limit=22.5 2024-06-19 23:37:50,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=85558.0, ans=0.5 2024-06-19 23:37:56,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=85576.33333333333, ans=0.05 2024-06-19 23:38:08,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=85613.0, ans=0.0 2024-06-19 23:38:13,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=85613.0, ans=0.125 2024-06-19 23:38:14,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=85631.33333333333, ans=0.125 2024-06-19 23:38:22,417 INFO [train.py:1028] (1/2) Epoch 5, batch 6250, loss[loss=0.3239, simple_loss=0.3319, pruned_loss=0.1579, over 13266.00 frames. ], tot_loss[loss=0.3368, simple_loss=0.3412, pruned_loss=0.1662, over 2571717.12 frames. ], batch size: 83, lr: 1.13e-02, grad_scale: 2.0 2024-06-19 23:38:34,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=85668.0, ans=0.125 2024-06-19 23:38:41,714 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.18 vs. limit=22.5 2024-06-19 23:38:42,258 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.02 vs. limit=12.0 2024-06-19 23:38:49,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=85723.0, ans=0.09899494936611666 2024-06-19 23:38:54,142 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.928e+02 1.007e+03 1.251e+03 1.504e+03 2.374e+03, threshold=2.503e+03, percent-clipped=0.0 2024-06-19 23:38:54,931 INFO [train.py:1028] (1/2) Epoch 5, batch 6300, loss[loss=0.3345, simple_loss=0.3395, pruned_loss=0.1648, over 11706.00 frames. ], tot_loss[loss=0.3392, simple_loss=0.3433, pruned_loss=0.1675, over 2567085.09 frames. ], batch size: 17, lr: 1.13e-02, grad_scale: 4.0 2024-06-19 23:39:00,564 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.72 vs. limit=12.0 2024-06-19 23:39:03,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=85759.66666666667, ans=0.0 2024-06-19 23:39:05,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=85759.66666666667, ans=0.125 2024-06-19 23:39:13,692 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=12.58 vs. limit=15.0 2024-06-19 23:39:18,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=85796.33333333333, ans=0.025 2024-06-19 23:39:32,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=85814.66666666667, ans=15.0 2024-06-19 23:39:34,571 INFO [train.py:1028] (1/2) Epoch 5, batch 6350, loss[loss=0.3938, simple_loss=0.3805, pruned_loss=0.2036, over 12587.00 frames. ], tot_loss[loss=0.3392, simple_loss=0.3446, pruned_loss=0.1669, over 2576117.08 frames. ], batch size: 202, lr: 1.13e-02, grad_scale: 4.0 2024-06-19 23:39:34,927 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2024-06-19 23:39:43,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=85851.33333333333, ans=0.0 2024-06-19 23:39:48,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=85869.66666666667, ans=0.1 2024-06-19 23:39:54,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=85888.0, ans=0.025 2024-06-19 23:39:58,942 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=4.207e+00 2024-06-19 23:40:04,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=85906.33333333333, ans=0.125 2024-06-19 23:40:07,128 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.677e+02 9.941e+02 1.140e+03 1.373e+03 2.975e+03, threshold=2.281e+03, percent-clipped=1.0 2024-06-19 23:40:07,156 INFO [train.py:1028] (1/2) Epoch 5, batch 6400, loss[loss=0.3417, simple_loss=0.3538, pruned_loss=0.1648, over 13225.00 frames. ], tot_loss[loss=0.3421, simple_loss=0.3473, pruned_loss=0.1685, over 2576495.72 frames. ], batch size: 67, lr: 1.13e-02, grad_scale: 4.0 2024-06-19 23:40:17,280 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.09 vs. limit=22.5 2024-06-19 23:40:20,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=85961.33333333333, ans=0.0 2024-06-19 23:40:24,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85961.33333333333, ans=0.1 2024-06-19 23:40:27,526 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.43 vs. limit=10.0 2024-06-19 23:40:29,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=85979.66666666667, ans=0.125 2024-06-19 23:40:39,238 INFO [train.py:1028] (1/2) Epoch 5, batch 6450, loss[loss=0.4143, simple_loss=0.3942, pruned_loss=0.2172, over 12540.00 frames. ], tot_loss[loss=0.3446, simple_loss=0.3496, pruned_loss=0.1697, over 2582146.48 frames. ], batch size: 202, lr: 1.13e-02, grad_scale: 4.0 2024-06-19 23:40:43,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=86016.33333333333, ans=0.2 2024-06-19 23:40:47,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=86034.66666666667, ans=0.125 2024-06-19 23:40:52,103 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:40:58,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=86071.33333333333, ans=0.125 2024-06-19 23:41:10,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=86108.0, ans=0.0 2024-06-19 23:41:10,818 INFO [train.py:1028] (1/2) Epoch 5, batch 6500, loss[loss=0.392, simple_loss=0.3679, pruned_loss=0.208, over 10788.00 frames. ], tot_loss[loss=0.3462, simple_loss=0.3515, pruned_loss=0.1704, over 2585283.56 frames. ], batch size: 304, lr: 1.13e-02, grad_scale: 4.0 2024-06-19 23:41:11,430 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.886e+02 1.068e+03 1.234e+03 1.425e+03 2.067e+03, threshold=2.467e+03, percent-clipped=0.0 2024-06-19 23:41:14,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=86108.0, ans=0.0 2024-06-19 23:41:17,987 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.998e+01 2024-06-19 23:41:34,842 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.84 vs. limit=22.5 2024-06-19 23:41:38,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=86163.0, ans=0.125 2024-06-19 23:41:44,211 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.51 vs. limit=10.0 2024-06-19 23:41:50,407 INFO [train.py:1028] (1/2) Epoch 5, batch 6550, loss[loss=0.3404, simple_loss=0.3625, pruned_loss=0.1591, over 12751.00 frames. ], tot_loss[loss=0.3468, simple_loss=0.3524, pruned_loss=0.1706, over 2589025.64 frames. ], batch size: 22, lr: 1.13e-02, grad_scale: 1.0 2024-06-19 23:41:58,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=86218.0, ans=0.0 2024-06-19 23:42:00,106 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.51 vs. limit=10.0 2024-06-19 23:42:00,124 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=11.55 vs. limit=12.0 2024-06-19 23:42:17,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=86273.0, ans=0.0 2024-06-19 23:42:22,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=86273.0, ans=0.125 2024-06-19 23:42:23,542 INFO [train.py:1028] (1/2) Epoch 5, batch 6600, loss[loss=0.3723, simple_loss=0.3699, pruned_loss=0.1873, over 13237.00 frames. ], tot_loss[loss=0.3457, simple_loss=0.3516, pruned_loss=0.1699, over 2590833.96 frames. ], batch size: 72, lr: 1.13e-02, grad_scale: 2.0 2024-06-19 23:42:24,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=86291.33333333333, ans=0.025 2024-06-19 23:42:25,655 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.611e+02 8.385e+02 1.022e+03 1.228e+03 2.669e+03, threshold=2.045e+03, percent-clipped=1.0 2024-06-19 23:42:34,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=86309.66666666667, ans=6.0 2024-06-19 23:42:56,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=86383.0, ans=0.0 2024-06-19 23:42:57,416 INFO [train.py:1028] (1/2) Epoch 5, batch 6650, loss[loss=0.3917, simple_loss=0.3845, pruned_loss=0.1995, over 12959.00 frames. ], tot_loss[loss=0.3504, simple_loss=0.3556, pruned_loss=0.1725, over 2585861.82 frames. ], batch size: 158, lr: 1.13e-02, grad_scale: 1.0 2024-06-19 23:43:00,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=86383.0, ans=0.125 2024-06-19 23:43:05,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=86401.33333333333, ans=0.125 2024-06-19 23:43:19,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=86438.0, ans=0.0 2024-06-19 23:43:20,176 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.49 vs. limit=6.0 2024-06-19 23:43:20,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=86438.0, ans=0.1 2024-06-19 23:43:24,153 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.95 vs. limit=12.0 2024-06-19 23:43:25,574 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.27 vs. limit=10.0 2024-06-19 23:43:38,963 INFO [train.py:1028] (1/2) Epoch 5, batch 6700, loss[loss=0.4035, simple_loss=0.3905, pruned_loss=0.2082, over 12733.00 frames. ], tot_loss[loss=0.3518, simple_loss=0.3572, pruned_loss=0.1732, over 2584824.99 frames. ], batch size: 176, lr: 1.13e-02, grad_scale: 2.0 2024-06-19 23:43:41,607 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.296e+02 9.031e+02 1.069e+03 1.271e+03 2.365e+03, threshold=2.138e+03, percent-clipped=1.0 2024-06-19 23:43:50,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=86493.0, ans=0.125 2024-06-19 23:44:06,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=86548.0, ans=0.125 2024-06-19 23:44:13,293 INFO [train.py:1028] (1/2) Epoch 5, batch 6750, loss[loss=0.4664, simple_loss=0.433, pruned_loss=0.2499, over 12166.00 frames. ], tot_loss[loss=0.3527, simple_loss=0.3578, pruned_loss=0.1738, over 2577436.25 frames. ], batch size: 240, lr: 1.12e-02, grad_scale: 2.0 2024-06-19 23:44:26,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=86603.0, ans=0.125 2024-06-19 23:44:27,822 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.83 vs. limit=15.0 2024-06-19 23:44:33,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=86621.33333333333, ans=0.0 2024-06-19 23:44:34,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=86621.33333333333, ans=0.2 2024-06-19 23:44:37,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=86621.33333333333, ans=0.125 2024-06-19 23:44:42,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=86639.66666666667, ans=0.0 2024-06-19 23:44:42,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=86639.66666666667, ans=0.125 2024-06-19 23:44:45,920 INFO [train.py:1028] (1/2) Epoch 5, batch 6800, loss[loss=0.3512, simple_loss=0.3583, pruned_loss=0.172, over 13251.00 frames. ], tot_loss[loss=0.354, simple_loss=0.3594, pruned_loss=0.1743, over 2579297.65 frames. ], batch size: 67, lr: 1.12e-02, grad_scale: 4.0 2024-06-19 23:44:48,354 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.581e+02 9.542e+02 1.164e+03 1.366e+03 1.896e+03, threshold=2.327e+03, percent-clipped=0.0 2024-06-19 23:44:49,399 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.83 vs. limit=6.0 2024-06-19 23:44:54,616 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.30 vs. limit=15.0 2024-06-19 23:44:56,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=86676.33333333333, ans=0.1 2024-06-19 23:45:01,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=86694.66666666667, ans=0.125 2024-06-19 23:45:12,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=86731.33333333333, ans=0.125 2024-06-19 23:45:18,488 INFO [train.py:1028] (1/2) Epoch 5, batch 6850, loss[loss=0.3594, simple_loss=0.3762, pruned_loss=0.1713, over 13218.00 frames. ], tot_loss[loss=0.3544, simple_loss=0.3601, pruned_loss=0.1743, over 2582312.30 frames. ], batch size: 63, lr: 1.12e-02, grad_scale: 4.0 2024-06-19 23:45:27,843 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.33 vs. limit=15.0 2024-06-19 23:45:39,518 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.00 vs. limit=10.0 2024-06-19 23:45:51,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=86823.0, ans=0.125 2024-06-19 23:45:57,469 INFO [train.py:1028] (1/2) Epoch 5, batch 6900, loss[loss=0.3789, simple_loss=0.3806, pruned_loss=0.1886, over 13325.00 frames. ], tot_loss[loss=0.3545, simple_loss=0.3605, pruned_loss=0.1742, over 2584469.19 frames. ], batch size: 49, lr: 1.12e-02, grad_scale: 2.0 2024-06-19 23:45:59,020 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.77 vs. limit=15.0 2024-06-19 23:46:02,029 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.509e+02 9.084e+02 1.101e+03 1.310e+03 2.113e+03, threshold=2.202e+03, percent-clipped=0.0 2024-06-19 23:46:04,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=86859.66666666667, ans=0.0 2024-06-19 23:46:11,684 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=2.487e+02 2024-06-19 23:46:12,729 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.21 vs. limit=15.0 2024-06-19 23:46:28,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=86914.66666666667, ans=0.1 2024-06-19 23:46:31,038 INFO [train.py:1028] (1/2) Epoch 5, batch 6950, loss[loss=0.3199, simple_loss=0.3457, pruned_loss=0.1471, over 11060.00 frames. ], tot_loss[loss=0.3536, simple_loss=0.3603, pruned_loss=0.1735, over 2579078.99 frames. ], batch size: 16, lr: 1.12e-02, grad_scale: 1.0 2024-06-19 23:46:32,223 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.42 vs. limit=15.0 2024-06-19 23:46:46,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=86969.66666666667, ans=0.2 2024-06-19 23:46:48,061 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=10.85 vs. limit=12.0 2024-06-19 23:46:48,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=86969.66666666667, ans=0.2 2024-06-19 23:46:49,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=86969.66666666667, ans=0.125 2024-06-19 23:46:58,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=87006.33333333333, ans=0.125 2024-06-19 23:47:02,590 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.75 vs. limit=10.0 2024-06-19 23:47:04,293 INFO [train.py:1028] (1/2) Epoch 5, batch 7000, loss[loss=0.3742, simple_loss=0.3664, pruned_loss=0.191, over 12951.00 frames. ], tot_loss[loss=0.3527, simple_loss=0.3598, pruned_loss=0.1728, over 2576109.41 frames. ], batch size: 158, lr: 1.12e-02, grad_scale: 2.0 2024-06-19 23:47:07,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=87024.66666666667, ans=0.0 2024-06-19 23:47:08,663 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.456e+02 9.861e+02 1.162e+03 1.346e+03 3.973e+03, threshold=2.324e+03, percent-clipped=6.0 2024-06-19 23:47:10,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=87043.0, ans=0.125 2024-06-19 23:47:23,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.37 vs. limit=22.5 2024-06-19 23:47:30,315 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.54 vs. limit=15.0 2024-06-19 23:47:31,028 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.70 vs. limit=12.0 2024-06-19 23:47:42,109 INFO [train.py:1028] (1/2) Epoch 5, batch 7050, loss[loss=0.3683, simple_loss=0.372, pruned_loss=0.1823, over 12768.00 frames. ], tot_loss[loss=0.354, simple_loss=0.3612, pruned_loss=0.1734, over 2582899.71 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 2.0 2024-06-19 23:47:42,654 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.36 vs. limit=12.0 2024-06-19 23:47:42,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=87116.33333333333, ans=0.0 2024-06-19 23:47:51,560 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.62 vs. limit=10.0 2024-06-19 23:47:57,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=87134.66666666667, ans=0.0 2024-06-19 23:48:03,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=87153.0, ans=0.125 2024-06-19 23:48:03,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=87153.0, ans=0.1 2024-06-19 23:48:07,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=87171.33333333333, ans=0.2 2024-06-19 23:48:08,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=87171.33333333333, ans=0.0 2024-06-19 23:48:09,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=87171.33333333333, ans=0.125 2024-06-19 23:48:11,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=87171.33333333333, ans=0.0 2024-06-19 23:48:18,569 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.30 vs. limit=15.0 2024-06-19 23:48:19,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=87208.0, ans=0.0 2024-06-19 23:48:19,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=87208.0, ans=0.125 2024-06-19 23:48:20,052 INFO [train.py:1028] (1/2) Epoch 5, batch 7100, loss[loss=0.3865, simple_loss=0.3892, pruned_loss=0.1919, over 13136.00 frames. ], tot_loss[loss=0.355, simple_loss=0.3621, pruned_loss=0.174, over 2575411.63 frames. ], batch size: 112, lr: 1.12e-02, grad_scale: 4.0 2024-06-19 23:48:24,792 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.122e+02 8.838e+02 1.024e+03 1.247e+03 3.216e+03, threshold=2.049e+03, percent-clipped=1.0 2024-06-19 23:48:25,264 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.56 vs. limit=10.0 2024-06-19 23:48:25,874 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2024-06-19 23:48:26,705 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.07 vs. limit=6.0 2024-06-19 23:48:40,831 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.63 vs. limit=15.0 2024-06-19 23:48:46,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=87281.33333333333, ans=10.0 2024-06-19 23:48:53,274 INFO [train.py:1028] (1/2) Epoch 5, batch 7150, loss[loss=0.4157, simple_loss=0.4023, pruned_loss=0.2146, over 12480.00 frames. ], tot_loss[loss=0.3568, simple_loss=0.3638, pruned_loss=0.1749, over 2573162.80 frames. ], batch size: 202, lr: 1.12e-02, grad_scale: 4.0 2024-06-19 23:49:15,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=87354.66666666667, ans=0.125 2024-06-19 23:49:18,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=87354.66666666667, ans=0.0 2024-06-19 23:49:25,876 INFO [train.py:1028] (1/2) Epoch 5, batch 7200, loss[loss=0.3624, simple_loss=0.3729, pruned_loss=0.1759, over 13171.00 frames. ], tot_loss[loss=0.3576, simple_loss=0.365, pruned_loss=0.1751, over 2577911.81 frames. ], batch size: 112, lr: 1.12e-02, grad_scale: 8.0 2024-06-19 23:49:29,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=87391.33333333333, ans=0.1 2024-06-19 23:49:30,281 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=12.0 2024-06-19 23:49:30,391 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.863e+02 8.283e+02 9.668e+02 1.107e+03 1.833e+03, threshold=1.934e+03, percent-clipped=0.0 2024-06-19 23:49:31,640 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.26 vs. limit=15.0 2024-06-19 23:49:36,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=87409.66666666667, ans=0.2 2024-06-19 23:49:50,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=87446.33333333333, ans=0.07 2024-06-19 23:49:52,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=87446.33333333333, ans=0.125 2024-06-19 23:49:59,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=87464.66666666667, ans=0.125 2024-06-19 23:49:59,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.16 vs. limit=15.0 2024-06-19 23:50:04,607 INFO [train.py:1028] (1/2) Epoch 5, batch 7250, loss[loss=0.3312, simple_loss=0.3541, pruned_loss=0.1541, over 12974.00 frames. ], tot_loss[loss=0.357, simple_loss=0.3652, pruned_loss=0.1744, over 2578144.03 frames. ], batch size: 36, lr: 1.12e-02, grad_scale: 4.0 2024-06-19 23:50:05,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=87483.0, ans=0.125 2024-06-19 23:50:15,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=87501.33333333333, ans=0.1 2024-06-19 23:50:17,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=87519.66666666667, ans=0.2 2024-06-19 23:50:17,626 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.23 vs. limit=15.0 2024-06-19 23:50:23,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=87538.0, ans=0.0 2024-06-19 23:50:25,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=87538.0, ans=0.05 2024-06-19 23:50:27,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=87538.0, ans=0.0 2024-06-19 23:50:33,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=87556.33333333333, ans=0.125 2024-06-19 23:50:36,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=87574.66666666667, ans=0.125 2024-06-19 23:50:37,417 INFO [train.py:1028] (1/2) Epoch 5, batch 7300, loss[loss=0.3349, simple_loss=0.3445, pruned_loss=0.1627, over 12917.00 frames. ], tot_loss[loss=0.3575, simple_loss=0.3657, pruned_loss=0.1747, over 2577612.79 frames. ], batch size: 36, lr: 1.12e-02, grad_scale: 4.0 2024-06-19 23:50:42,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=87574.66666666667, ans=0.0 2024-06-19 23:50:44,953 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.496e+02 8.648e+02 1.055e+03 1.180e+03 2.542e+03, threshold=2.110e+03, percent-clipped=4.0 2024-06-19 23:50:45,960 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=18.92 vs. limit=15.0 2024-06-19 23:50:46,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=87593.0, ans=0.125 2024-06-19 23:51:11,006 INFO [train.py:1028] (1/2) Epoch 5, batch 7350, loss[loss=0.3679, simple_loss=0.3727, pruned_loss=0.1815, over 13286.00 frames. ], tot_loss[loss=0.3586, simple_loss=0.3664, pruned_loss=0.1753, over 2578453.33 frames. ], batch size: 46, lr: 1.12e-02, grad_scale: 1.0 2024-06-19 23:51:13,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=87666.33333333333, ans=0.125 2024-06-19 23:51:17,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=87684.66666666667, ans=0.025 2024-06-19 23:51:21,314 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.80 vs. limit=15.0 2024-06-19 23:51:30,899 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.61 vs. limit=15.0 2024-06-19 23:51:31,754 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=5.888e+01 2024-06-19 23:51:44,401 INFO [train.py:1028] (1/2) Epoch 5, batch 7400, loss[loss=0.3755, simple_loss=0.3837, pruned_loss=0.1836, over 13246.00 frames. ], tot_loss[loss=0.3572, simple_loss=0.3657, pruned_loss=0.1744, over 2583608.87 frames. ], batch size: 63, lr: 1.12e-02, grad_scale: 2.0 2024-06-19 23:51:47,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=87758.0, ans=0.0 2024-06-19 23:51:57,582 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.130e+02 9.127e+02 1.084e+03 1.229e+03 2.601e+03, threshold=2.169e+03, percent-clipped=2.0 2024-06-19 23:52:07,497 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.92 vs. limit=15.0 2024-06-19 23:52:15,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.12 vs. limit=15.0 2024-06-19 23:52:17,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=87813.0, ans=0.125 2024-06-19 23:52:27,086 INFO [train.py:1028] (1/2) Epoch 5, batch 7450, loss[loss=0.325, simple_loss=0.346, pruned_loss=0.152, over 12964.00 frames. ], tot_loss[loss=0.3563, simple_loss=0.3652, pruned_loss=0.1737, over 2577866.14 frames. ], batch size: 30, lr: 1.12e-02, grad_scale: 1.0 2024-06-19 23:52:38,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=87868.0, ans=0.1 2024-06-19 23:52:40,775 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.13 vs. limit=15.0 2024-06-19 23:52:58,350 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=2.738e+02 2024-06-19 23:53:00,900 INFO [train.py:1028] (1/2) Epoch 5, batch 7500, loss[loss=0.387, simple_loss=0.3684, pruned_loss=0.2028, over 10715.00 frames. ], tot_loss[loss=0.3585, simple_loss=0.3669, pruned_loss=0.1751, over 2575156.73 frames. ], batch size: 303, lr: 1.12e-02, grad_scale: 2.0 2024-06-19 23:53:01,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=87941.33333333333, ans=0.0 2024-06-19 23:53:09,206 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.630e+02 1.095e+03 1.381e+03 1.649e+03 5.842e+03, threshold=2.762e+03, percent-clipped=6.0 2024-06-19 23:53:11,683 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=22.5 2024-06-19 23:53:15,645 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.80 vs. limit=15.0 2024-06-19 23:53:18,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=87978.0, ans=0.1 2024-06-19 23:53:21,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=87996.33333333333, ans=0.125 2024-06-19 23:53:38,553 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.34 vs. limit=22.5 2024-06-19 23:53:39,324 INFO [train.py:1028] (1/2) Epoch 5, batch 7550, loss[loss=0.3711, simple_loss=0.3677, pruned_loss=0.1873, over 12927.00 frames. ], tot_loss[loss=0.3605, simple_loss=0.3682, pruned_loss=0.1765, over 2574707.06 frames. ], batch size: 158, lr: 1.12e-02, grad_scale: 1.0 2024-06-19 23:53:39,440 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=1.075e+01 2024-06-19 23:53:41,926 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=4.399e+02 2024-06-19 23:53:47,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=88051.33333333333, ans=0.09899494936611666 2024-06-19 23:54:04,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=88088.0, ans=0.125 2024-06-19 23:54:11,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=88106.33333333333, ans=0.015 2024-06-19 23:54:18,464 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=22.5 2024-06-19 23:54:18,935 INFO [train.py:1028] (1/2) Epoch 5, batch 7600, loss[loss=0.3582, simple_loss=0.3707, pruned_loss=0.1728, over 13269.00 frames. ], tot_loss[loss=0.3622, simple_loss=0.3696, pruned_loss=0.1774, over 2575152.30 frames. ], batch size: 83, lr: 1.12e-02, grad_scale: 2.0 2024-06-19 23:54:27,852 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.439e+02 1.498e+03 1.714e+03 2.001e+03 4.296e+03, threshold=3.428e+03, percent-clipped=5.0 2024-06-19 23:54:28,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=88143.0, ans=0.125 2024-06-19 23:54:32,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=88161.33333333333, ans=0.125 2024-06-19 23:54:36,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.42 vs. limit=10.0 2024-06-19 23:54:38,719 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.04 vs. limit=6.0 2024-06-19 23:54:41,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=88179.66666666667, ans=0.125 2024-06-19 23:54:41,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=88179.66666666667, ans=0.2 2024-06-19 23:54:43,777 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=6.623e+02 2024-06-19 23:54:44,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=88179.66666666667, ans=0.125 2024-06-19 23:54:51,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=88198.0, ans=0.95 2024-06-19 23:54:52,515 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=23.63 vs. limit=15.0 2024-06-19 23:54:52,743 INFO [train.py:1028] (1/2) Epoch 5, batch 7650, loss[loss=0.3145, simple_loss=0.3455, pruned_loss=0.1418, over 12996.00 frames. ], tot_loss[loss=0.3616, simple_loss=0.3692, pruned_loss=0.177, over 2571218.54 frames. ], batch size: 33, lr: 1.11e-02, grad_scale: 2.0 2024-06-19 23:55:02,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=88234.66666666667, ans=0.0 2024-06-19 23:55:02,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=88234.66666666667, ans=0.0 2024-06-19 23:55:03,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=88234.66666666667, ans=0.1 2024-06-19 23:55:04,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88234.66666666667, ans=0.1 2024-06-19 23:55:06,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=88253.0, ans=0.1 2024-06-19 23:55:06,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=88253.0, ans=0.125 2024-06-19 23:55:26,101 INFO [train.py:1028] (1/2) Epoch 5, batch 7700, loss[loss=0.3803, simple_loss=0.3948, pruned_loss=0.1829, over 13303.00 frames. ], tot_loss[loss=0.3624, simple_loss=0.3698, pruned_loss=0.1775, over 2569099.84 frames. ], batch size: 63, lr: 1.11e-02, grad_scale: 4.0 2024-06-19 23:55:34,153 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.572e+02 1.418e+03 1.625e+03 2.054e+03 5.058e+03, threshold=3.250e+03, percent-clipped=2.0 2024-06-19 23:55:38,057 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.76 vs. limit=15.0 2024-06-19 23:55:39,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88344.66666666667, ans=0.1 2024-06-19 23:55:39,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=88344.66666666667, ans=0.0 2024-06-19 23:56:04,740 INFO [train.py:1028] (1/2) Epoch 5, batch 7750, loss[loss=0.3447, simple_loss=0.3625, pruned_loss=0.1635, over 13149.00 frames. ], tot_loss[loss=0.365, simple_loss=0.3716, pruned_loss=0.1792, over 2572975.10 frames. ], batch size: 72, lr: 1.11e-02, grad_scale: 2.0 2024-06-19 23:56:15,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=88418.0, ans=22.5 2024-06-19 23:56:18,990 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2024-06-19 23:56:21,124 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.82 vs. limit=22.5 2024-06-19 23:56:24,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=88454.66666666667, ans=0.125 2024-06-19 23:56:29,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88454.66666666667, ans=0.1 2024-06-19 23:56:31,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=88473.0, ans=0.125 2024-06-19 23:56:37,873 INFO [train.py:1028] (1/2) Epoch 5, batch 7800, loss[loss=0.3865, simple_loss=0.3855, pruned_loss=0.1938, over 13110.00 frames. ], tot_loss[loss=0.3652, simple_loss=0.3721, pruned_loss=0.1791, over 2578238.85 frames. ], batch size: 95, lr: 1.11e-02, grad_scale: 4.0 2024-06-19 23:56:41,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=88491.33333333333, ans=0.125 2024-06-19 23:56:46,666 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.082e+03 1.940e+03 2.273e+03 2.600e+03 4.681e+03, threshold=4.546e+03, percent-clipped=5.0 2024-06-19 23:57:00,811 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=9.287e+02 2024-06-19 23:57:05,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=88564.66666666667, ans=0.125 2024-06-19 23:57:06,572 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.93 vs. limit=15.0 2024-06-19 23:57:10,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=88564.66666666667, ans=0.0 2024-06-19 23:57:11,339 INFO [train.py:1028] (1/2) Epoch 5, batch 7850, loss[loss=0.2909, simple_loss=0.3222, pruned_loss=0.1298, over 11449.00 frames. ], tot_loss[loss=0.3667, simple_loss=0.3732, pruned_loss=0.1801, over 2571306.68 frames. ], batch size: 16, lr: 1.11e-02, grad_scale: 1.0 2024-06-19 23:57:12,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=88583.0, ans=0.125 2024-06-19 23:57:27,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=88619.66666666667, ans=0.0 2024-06-19 23:57:32,566 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.76 vs. limit=15.0 2024-06-19 23:57:37,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=88656.33333333333, ans=0.05 2024-06-19 23:57:44,055 INFO [train.py:1028] (1/2) Epoch 5, batch 7900, loss[loss=0.33, simple_loss=0.3502, pruned_loss=0.1549, over 13175.00 frames. ], tot_loss[loss=0.3674, simple_loss=0.3738, pruned_loss=0.1805, over 2570312.94 frames. ], batch size: 77, lr: 1.11e-02, grad_scale: 2.0 2024-06-19 23:57:44,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=88674.66666666667, ans=0.07 2024-06-19 23:57:49,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=88674.66666666667, ans=0.2 2024-06-19 23:58:01,127 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.112e+03 2.322e+03 2.611e+03 2.986e+03 5.679e+03, threshold=5.223e+03, percent-clipped=0.0 2024-06-19 23:58:02,885 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.06 vs. limit=15.0 2024-06-19 23:58:06,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=88711.33333333333, ans=0.5 2024-06-19 23:58:09,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=88729.66666666667, ans=0.0 2024-06-19 23:58:11,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=88729.66666666667, ans=0.125 2024-06-19 23:58:14,773 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.39 vs. limit=22.5 2024-06-19 23:58:21,298 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.23 vs. limit=15.0 2024-06-19 23:58:23,501 INFO [train.py:1028] (1/2) Epoch 5, batch 7950, loss[loss=0.3838, simple_loss=0.3679, pruned_loss=0.1998, over 10536.00 frames. ], tot_loss[loss=0.369, simple_loss=0.375, pruned_loss=0.1816, over 2574073.38 frames. ], batch size: 303, lr: 1.11e-02, grad_scale: 0.5 2024-06-19 23:58:23,929 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.41 vs. limit=15.0 2024-06-19 23:58:24,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=88766.33333333333, ans=0.125 2024-06-19 23:58:29,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=88784.66666666667, ans=0.02 2024-06-19 23:58:33,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=88784.66666666667, ans=0.0 2024-06-19 23:58:34,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=88784.66666666667, ans=0.125 2024-06-19 23:58:35,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=88784.66666666667, ans=0.0 2024-06-19 23:58:40,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=88803.0, ans=0.0 2024-06-19 23:58:40,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=88803.0, ans=0.07 2024-06-19 23:58:43,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88821.33333333333, ans=0.1 2024-06-19 23:58:43,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=88821.33333333333, ans=0.035 2024-06-19 23:58:49,499 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.52 vs. limit=22.5 2024-06-19 23:58:56,960 INFO [train.py:1028] (1/2) Epoch 5, batch 8000, loss[loss=0.3353, simple_loss=0.3501, pruned_loss=0.1603, over 12769.00 frames. ], tot_loss[loss=0.3696, simple_loss=0.3753, pruned_loss=0.1819, over 2570946.41 frames. ], batch size: 29, lr: 1.11e-02, grad_scale: 1.0 2024-06-19 23:59:03,707 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:59:08,654 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.385e+03 2.201e+03 2.738e+03 3.249e+03 1.176e+04, threshold=5.475e+03, percent-clipped=4.0 2024-06-19 23:59:08,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=88876.33333333333, ans=0.0 2024-06-19 23:59:15,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=88894.66666666667, ans=0.125 2024-06-19 23:59:18,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88913.0, ans=0.1 2024-06-19 23:59:29,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=88931.33333333333, ans=0.0 2024-06-19 23:59:29,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=88931.33333333333, ans=0.125 2024-06-19 23:59:31,146 INFO [train.py:1028] (1/2) Epoch 5, batch 8050, loss[loss=0.3591, simple_loss=0.3735, pruned_loss=0.1724, over 13201.00 frames. ], tot_loss[loss=0.3687, simple_loss=0.3749, pruned_loss=0.1812, over 2572210.82 frames. ], batch size: 83, lr: 1.11e-02, grad_scale: 1.0 2024-06-19 23:59:34,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=88949.66666666667, ans=0.0 2024-06-19 23:59:48,091 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.08 vs. limit=22.5 2024-06-19 23:59:53,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=89004.66666666667, ans=0.0 2024-06-20 00:00:04,916 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.48 vs. limit=12.0 2024-06-20 00:00:07,004 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.29 vs. limit=15.0 2024-06-20 00:00:07,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=89023.0, ans=0.125 2024-06-20 00:00:09,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=89023.0, ans=0.025 2024-06-20 00:00:10,540 INFO [train.py:1028] (1/2) Epoch 5, batch 8100, loss[loss=0.3564, simple_loss=0.3695, pruned_loss=0.1717, over 13186.00 frames. ], tot_loss[loss=0.3698, simple_loss=0.3758, pruned_loss=0.182, over 2576500.57 frames. ], batch size: 112, lr: 1.11e-02, grad_scale: 2.0 2024-06-20 00:00:14,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=89041.33333333333, ans=0.0 2024-06-20 00:00:19,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=89059.66666666667, ans=0.125 2024-06-20 00:00:23,178 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.036e+03 2.085e+03 2.552e+03 2.957e+03 4.737e+03, threshold=5.105e+03, percent-clipped=0.0 2024-06-20 00:00:29,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=89078.0, ans=0.1 2024-06-20 00:00:41,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=89114.66666666667, ans=0.2 2024-06-20 00:00:44,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=89114.66666666667, ans=0.125 2024-06-20 00:00:45,930 INFO [train.py:1028] (1/2) Epoch 5, batch 8150, loss[loss=0.3459, simple_loss=0.3514, pruned_loss=0.1703, over 13090.00 frames. ], tot_loss[loss=0.3693, simple_loss=0.3755, pruned_loss=0.1815, over 2579611.64 frames. ], batch size: 121, lr: 1.11e-02, grad_scale: 1.0 2024-06-20 00:00:48,545 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.45 vs. limit=22.5 2024-06-20 00:00:54,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=89151.33333333333, ans=0.1 2024-06-20 00:00:57,279 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.91 vs. limit=15.0 2024-06-20 00:00:57,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=89151.33333333333, ans=0.0 2024-06-20 00:01:00,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=89169.66666666667, ans=0.0 2024-06-20 00:01:05,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=89169.66666666667, ans=0.125 2024-06-20 00:01:06,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=89188.0, ans=0.1 2024-06-20 00:01:19,352 INFO [train.py:1028] (1/2) Epoch 5, batch 8200, loss[loss=0.4025, simple_loss=0.4005, pruned_loss=0.2022, over 13131.00 frames. ], tot_loss[loss=0.3683, simple_loss=0.375, pruned_loss=0.1808, over 2583426.14 frames. ], batch size: 112, lr: 1.11e-02, grad_scale: 2.0 2024-06-20 00:01:27,336 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:01:30,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=89243.0, ans=0.5 2024-06-20 00:01:32,581 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.148e+03 1.594e+03 1.937e+03 2.171e+03 4.125e+03, threshold=3.874e+03, percent-clipped=0.0 2024-06-20 00:01:34,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=89261.33333333333, ans=0.125 2024-06-20 00:01:43,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=89279.66666666667, ans=0.125 2024-06-20 00:01:53,266 INFO [train.py:1028] (1/2) Epoch 5, batch 8250, loss[loss=0.3373, simple_loss=0.3662, pruned_loss=0.1542, over 13246.00 frames. ], tot_loss[loss=0.3681, simple_loss=0.3749, pruned_loss=0.1807, over 2583643.43 frames. ], batch size: 52, lr: 1.11e-02, grad_scale: 2.0 2024-06-20 00:02:11,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=89353.0, ans=0.125 2024-06-20 00:02:16,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=89371.33333333333, ans=0.0 2024-06-20 00:02:29,107 INFO [train.py:1028] (1/2) Epoch 5, batch 8300, loss[loss=0.4075, simple_loss=0.4014, pruned_loss=0.2069, over 12988.00 frames. ], tot_loss[loss=0.368, simple_loss=0.3748, pruned_loss=0.1806, over 2580858.60 frames. ], batch size: 102, lr: 1.11e-02, grad_scale: 2.0 2024-06-20 00:02:34,055 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.86 vs. limit=15.0 2024-06-20 00:02:35,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=89426.33333333333, ans=0.125 2024-06-20 00:02:39,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=89426.33333333333, ans=0.125 2024-06-20 00:02:40,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=89426.33333333333, ans=0.035 2024-06-20 00:02:41,126 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=1.98 vs. limit=15.0 2024-06-20 00:02:41,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2024-06-20 00:02:41,977 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.296e+02 1.441e+03 1.644e+03 1.921e+03 4.770e+03, threshold=3.288e+03, percent-clipped=2.0 2024-06-20 00:02:44,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=89444.66666666667, ans=0.125 2024-06-20 00:02:46,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=89444.66666666667, ans=0.0 2024-06-20 00:02:46,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=89444.66666666667, ans=0.2 2024-06-20 00:02:50,811 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.39 vs. limit=15.0 2024-06-20 00:02:54,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=89463.0, ans=0.1 2024-06-20 00:03:01,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=89481.33333333333, ans=0.0 2024-06-20 00:03:02,447 INFO [train.py:1028] (1/2) Epoch 5, batch 8350, loss[loss=0.3838, simple_loss=0.3947, pruned_loss=0.1864, over 13170.00 frames. ], tot_loss[loss=0.3682, simple_loss=0.3755, pruned_loss=0.1805, over 2581300.94 frames. ], batch size: 112, lr: 1.11e-02, grad_scale: 2.0 2024-06-20 00:03:03,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=89499.66666666667, ans=0.0 2024-06-20 00:03:04,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=89499.66666666667, ans=0.02 2024-06-20 00:03:19,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=89536.33333333333, ans=0.2 2024-06-20 00:03:20,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=89536.33333333333, ans=0.2 2024-06-20 00:03:21,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=89536.33333333333, ans=0.1 2024-06-20 00:03:29,001 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.14 vs. limit=15.0 2024-06-20 00:03:35,958 INFO [train.py:1028] (1/2) Epoch 5, batch 8400, loss[loss=0.345, simple_loss=0.349, pruned_loss=0.1706, over 12939.00 frames. ], tot_loss[loss=0.367, simple_loss=0.3744, pruned_loss=0.1798, over 2576723.72 frames. ], batch size: 39, lr: 1.11e-02, grad_scale: 4.0 2024-06-20 00:03:37,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=89591.33333333333, ans=0.125 2024-06-20 00:03:53,173 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.857e+02 1.754e+03 2.142e+03 2.602e+03 5.238e+03, threshold=4.285e+03, percent-clipped=5.0 2024-06-20 00:04:02,208 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.84 vs. limit=15.0 2024-06-20 00:04:02,809 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=18.24 vs. limit=15.0 2024-06-20 00:04:13,874 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:04:14,923 INFO [train.py:1028] (1/2) Epoch 5, batch 8450, loss[loss=0.3861, simple_loss=0.3893, pruned_loss=0.1914, over 13162.00 frames. ], tot_loss[loss=0.3696, simple_loss=0.3766, pruned_loss=0.1812, over 2579249.81 frames. ], batch size: 112, lr: 1.11e-02, grad_scale: 1.0 2024-06-20 00:04:34,616 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.48 vs. limit=15.0 2024-06-20 00:04:43,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=89756.33333333333, ans=0.025 2024-06-20 00:04:47,648 INFO [train.py:1028] (1/2) Epoch 5, batch 8500, loss[loss=0.3299, simple_loss=0.3451, pruned_loss=0.1573, over 12795.00 frames. ], tot_loss[loss=0.3706, simple_loss=0.3775, pruned_loss=0.1818, over 2577966.64 frames. ], batch size: 29, lr: 1.11e-02, grad_scale: 2.0 2024-06-20 00:04:51,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=89774.66666666667, ans=0.1 2024-06-20 00:05:01,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=89811.33333333333, ans=0.125 2024-06-20 00:05:02,448 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.611e+02 1.623e+03 1.909e+03 2.315e+03 3.191e+03, threshold=3.818e+03, percent-clipped=0.0 2024-06-20 00:05:03,465 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=6.420e-01 2024-06-20 00:05:13,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=89829.66666666667, ans=0.1 2024-06-20 00:05:13,627 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.17 vs. limit=15.0 2024-06-20 00:05:21,145 INFO [train.py:1028] (1/2) Epoch 5, batch 8550, loss[loss=0.3476, simple_loss=0.3597, pruned_loss=0.1678, over 12615.00 frames. ], tot_loss[loss=0.3685, simple_loss=0.376, pruned_loss=0.1805, over 2576359.60 frames. ], batch size: 22, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:05:21,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=89866.33333333333, ans=0.0 2024-06-20 00:05:32,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=89884.66666666667, ans=0.125 2024-06-20 00:05:34,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=89903.0, ans=0.0 2024-06-20 00:05:38,574 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.65 vs. limit=15.0 2024-06-20 00:05:38,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=89903.0, ans=0.025 2024-06-20 00:05:42,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=89921.33333333333, ans=0.125 2024-06-20 00:05:50,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=89939.66666666667, ans=0.125 2024-06-20 00:05:58,204 INFO [train.py:1028] (1/2) Epoch 5, batch 8600, loss[loss=0.3603, simple_loss=0.366, pruned_loss=0.1773, over 13087.00 frames. ], tot_loss[loss=0.3701, simple_loss=0.3772, pruned_loss=0.1815, over 2574275.16 frames. ], batch size: 121, lr: 1.10e-02, grad_scale: 4.0 2024-06-20 00:06:11,424 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.34 vs. limit=15.0 2024-06-20 00:06:13,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=89976.33333333333, ans=0.0 2024-06-20 00:06:17,111 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.359e+02 1.529e+03 1.779e+03 2.066e+03 2.906e+03, threshold=3.559e+03, percent-clipped=0.0 2024-06-20 00:06:18,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=89994.66666666667, ans=0.125 2024-06-20 00:06:19,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=89994.66666666667, ans=0.0 2024-06-20 00:06:19,362 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.85 vs. limit=15.0 2024-06-20 00:06:19,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=89994.66666666667, ans=0.125 2024-06-20 00:06:24,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=90013.0, ans=0.125 2024-06-20 00:06:24,917 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.37 vs. limit=15.0 2024-06-20 00:06:31,598 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.47 vs. limit=15.0 2024-06-20 00:06:35,085 INFO [train.py:1028] (1/2) Epoch 5, batch 8650, loss[loss=0.3493, simple_loss=0.3585, pruned_loss=0.17, over 13037.00 frames. ], tot_loss[loss=0.3693, simple_loss=0.3769, pruned_loss=0.1809, over 2577136.22 frames. ], batch size: 102, lr: 1.10e-02, grad_scale: 1.0 2024-06-20 00:06:36,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=90049.66666666667, ans=0.0 2024-06-20 00:06:38,139 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.52 vs. limit=10.0 2024-06-20 00:06:49,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=90086.33333333333, ans=0.0 2024-06-20 00:06:50,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=90086.33333333333, ans=0.1 2024-06-20 00:06:51,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=90086.33333333333, ans=0.0 2024-06-20 00:06:51,118 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:06:53,611 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:06:56,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=90104.66666666667, ans=0.125 2024-06-20 00:07:03,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=90123.0, ans=0.1 2024-06-20 00:07:08,025 INFO [train.py:1028] (1/2) Epoch 5, batch 8700, loss[loss=0.3684, simple_loss=0.3825, pruned_loss=0.1772, over 13152.00 frames. ], tot_loss[loss=0.3712, simple_loss=0.3782, pruned_loss=0.1821, over 2573327.99 frames. ], batch size: 59, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:07:11,039 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.54 vs. limit=10.0 2024-06-20 00:07:18,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=90159.66666666667, ans=0.09899494936611666 2024-06-20 00:07:20,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=90178.0, ans=0.125 2024-06-20 00:07:23,798 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.033e+03 1.559e+03 1.828e+03 2.182e+03 3.915e+03, threshold=3.656e+03, percent-clipped=3.0 2024-06-20 00:07:24,282 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.01 vs. limit=15.0 2024-06-20 00:07:26,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=90178.0, ans=0.0 2024-06-20 00:07:34,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90214.66666666667, ans=0.1 2024-06-20 00:07:41,063 INFO [train.py:1028] (1/2) Epoch 5, batch 8750, loss[loss=0.3463, simple_loss=0.3521, pruned_loss=0.1703, over 13043.00 frames. ], tot_loss[loss=0.3716, simple_loss=0.3784, pruned_loss=0.1824, over 2569816.73 frames. ], batch size: 121, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:07:42,156 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.88 vs. limit=10.0 2024-06-20 00:07:50,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=90233.0, ans=0.2 2024-06-20 00:08:03,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=90269.66666666667, ans=0.125 2024-06-20 00:08:04,606 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.45 vs. limit=15.0 2024-06-20 00:08:09,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=90288.0, ans=0.1 2024-06-20 00:08:10,099 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.74 vs. limit=15.0 2024-06-20 00:08:11,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90288.0, ans=0.1 2024-06-20 00:08:11,289 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:08:14,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90306.33333333333, ans=0.1 2024-06-20 00:08:17,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=90306.33333333333, ans=0.0 2024-06-20 00:08:22,508 INFO [train.py:1028] (1/2) Epoch 5, batch 8800, loss[loss=0.3399, simple_loss=0.3556, pruned_loss=0.1621, over 13246.00 frames. ], tot_loss[loss=0.373, simple_loss=0.3795, pruned_loss=0.1832, over 2574143.06 frames. ], batch size: 72, lr: 1.10e-02, grad_scale: 4.0 2024-06-20 00:08:24,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=90324.66666666667, ans=0.125 2024-06-20 00:08:25,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=90324.66666666667, ans=0.0 2024-06-20 00:08:34,695 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.30 vs. limit=15.0 2024-06-20 00:08:35,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=90361.33333333333, ans=0.2 2024-06-20 00:08:38,404 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.591e+02 1.580e+03 1.812e+03 2.178e+03 3.237e+03, threshold=3.624e+03, percent-clipped=0.0 2024-06-20 00:08:46,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=90379.66666666667, ans=0.125 2024-06-20 00:08:54,736 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:08:55,967 INFO [train.py:1028] (1/2) Epoch 5, batch 8850, loss[loss=0.436, simple_loss=0.4189, pruned_loss=0.2266, over 12483.00 frames. ], tot_loss[loss=0.3744, simple_loss=0.3802, pruned_loss=0.1843, over 2562758.67 frames. ], batch size: 202, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:09:00,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=90416.33333333333, ans=0.125 2024-06-20 00:09:01,331 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.93 vs. limit=15.0 2024-06-20 00:09:02,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=90434.66666666667, ans=0.05 2024-06-20 00:09:14,864 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.59 vs. limit=15.0 2024-06-20 00:09:19,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=90471.33333333333, ans=0.2 2024-06-20 00:09:29,496 INFO [train.py:1028] (1/2) Epoch 5, batch 8900, loss[loss=0.3906, simple_loss=0.3972, pruned_loss=0.192, over 12874.00 frames. ], tot_loss[loss=0.3751, simple_loss=0.3808, pruned_loss=0.1848, over 2561164.17 frames. ], batch size: 33, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:09:47,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=90544.66666666667, ans=0.0 2024-06-20 00:09:49,347 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.38 vs. limit=15.0 2024-06-20 00:09:49,569 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.875e+02 1.655e+03 1.930e+03 2.365e+03 3.999e+03, threshold=3.860e+03, percent-clipped=2.0 2024-06-20 00:09:50,068 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.88 vs. limit=15.0 2024-06-20 00:10:08,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=90581.33333333333, ans=0.025 2024-06-20 00:10:08,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=90599.66666666667, ans=0.05 2024-06-20 00:10:09,174 INFO [train.py:1028] (1/2) Epoch 5, batch 8950, loss[loss=0.3985, simple_loss=0.384, pruned_loss=0.2064, over 12561.00 frames. ], tot_loss[loss=0.3741, simple_loss=0.3804, pruned_loss=0.1839, over 2562998.94 frames. ], batch size: 202, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:10:32,215 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.93 vs. limit=15.0 2024-06-20 00:10:33,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=90654.66666666667, ans=0.125 2024-06-20 00:10:43,087 INFO [train.py:1028] (1/2) Epoch 5, batch 9000, loss[loss=0.35, simple_loss=0.3635, pruned_loss=0.1683, over 13328.00 frames. ], tot_loss[loss=0.3721, simple_loss=0.3796, pruned_loss=0.1823, over 2569694.43 frames. ], batch size: 46, lr: 1.10e-02, grad_scale: 4.0 2024-06-20 00:10:43,088 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 00:10:50,908 INFO [train.py:1060] (1/2) Epoch 5, validation: loss=0.2399, simple_loss=0.2944, pruned_loss=0.0927, over 351949.00 frames. 2024-06-20 00:10:50,909 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17340MB 2024-06-20 00:10:55,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=90691.33333333333, ans=0.2 2024-06-20 00:11:08,282 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.025e+03 1.496e+03 1.864e+03 2.244e+03 3.253e+03, threshold=3.728e+03, percent-clipped=0.0 2024-06-20 00:11:15,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.18 vs. limit=10.0 2024-06-20 00:11:21,940 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.27 vs. limit=22.5 2024-06-20 00:11:24,196 INFO [train.py:1028] (1/2) Epoch 5, batch 9050, loss[loss=0.3395, simple_loss=0.3555, pruned_loss=0.1617, over 11572.00 frames. ], tot_loss[loss=0.3732, simple_loss=0.3807, pruned_loss=0.1828, over 2569063.35 frames. ], batch size: 17, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:11:32,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=90801.33333333333, ans=0.0 2024-06-20 00:11:32,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=90801.33333333333, ans=0.125 2024-06-20 00:11:39,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90819.66666666667, ans=0.1 2024-06-20 00:11:39,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=90819.66666666667, ans=0.0 2024-06-20 00:11:39,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=90819.66666666667, ans=0.125 2024-06-20 00:11:45,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=90838.0, ans=10.0 2024-06-20 00:11:56,776 INFO [train.py:1028] (1/2) Epoch 5, batch 9100, loss[loss=0.3665, simple_loss=0.3819, pruned_loss=0.1755, over 13228.00 frames. ], tot_loss[loss=0.373, simple_loss=0.3808, pruned_loss=0.1826, over 2569330.26 frames. ], batch size: 72, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:11:57,821 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.14 vs. limit=15.0 2024-06-20 00:12:00,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=90874.66666666667, ans=0.125 2024-06-20 00:12:14,864 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.275e+02 1.278e+03 1.542e+03 1.838e+03 2.613e+03, threshold=3.085e+03, percent-clipped=0.0 2024-06-20 00:12:18,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=90929.66666666667, ans=0.0 2024-06-20 00:12:24,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=90948.0, ans=0.07 2024-06-20 00:12:27,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=90948.0, ans=0.2 2024-06-20 00:12:29,020 INFO [train.py:1028] (1/2) Epoch 5, batch 9150, loss[loss=0.3314, simple_loss=0.3555, pruned_loss=0.1536, over 13171.00 frames. ], tot_loss[loss=0.3724, simple_loss=0.3804, pruned_loss=0.1822, over 2570663.37 frames. ], batch size: 77, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:12:35,029 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.25 vs. limit=15.0 2024-06-20 00:12:40,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=90984.66666666667, ans=0.125 2024-06-20 00:12:41,473 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.21 vs. limit=15.0 2024-06-20 00:12:54,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=91021.33333333333, ans=0.125 2024-06-20 00:13:03,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=91058.0, ans=0.035 2024-06-20 00:13:04,345 INFO [train.py:1028] (1/2) Epoch 5, batch 9200, loss[loss=0.3297, simple_loss=0.3562, pruned_loss=0.1516, over 12920.00 frames. ], tot_loss[loss=0.3698, simple_loss=0.3789, pruned_loss=0.1803, over 2573600.74 frames. ], batch size: 36, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:13:14,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=91076.33333333333, ans=0.125 2024-06-20 00:13:19,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=91076.33333333333, ans=0.0 2024-06-20 00:13:24,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=91094.66666666667, ans=0.1 2024-06-20 00:13:25,642 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.805e+02 1.447e+03 1.625e+03 1.984e+03 3.493e+03, threshold=3.250e+03, percent-clipped=1.0 2024-06-20 00:13:35,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=91131.33333333333, ans=0.0 2024-06-20 00:13:39,181 INFO [train.py:1028] (1/2) Epoch 5, batch 9250, loss[loss=0.377, simple_loss=0.3914, pruned_loss=0.1813, over 13217.00 frames. ], tot_loss[loss=0.3682, simple_loss=0.3779, pruned_loss=0.1792, over 2575674.60 frames. ], batch size: 67, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:13:42,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=91149.66666666667, ans=0.0 2024-06-20 00:13:47,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=91168.0, ans=0.0 2024-06-20 00:13:48,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=91168.0, ans=0.125 2024-06-20 00:13:50,815 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=2.184e+02 2024-06-20 00:14:04,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=91223.0, ans=0.0 2024-06-20 00:14:10,989 INFO [train.py:1028] (1/2) Epoch 5, batch 9300, loss[loss=0.3895, simple_loss=0.3918, pruned_loss=0.1936, over 12953.00 frames. ], tot_loss[loss=0.3681, simple_loss=0.3779, pruned_loss=0.1792, over 2572176.09 frames. ], batch size: 39, lr: 1.10e-02, grad_scale: 4.0 2024-06-20 00:14:13,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=91241.33333333333, ans=0.125 2024-06-20 00:14:20,310 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.66 vs. limit=15.0 2024-06-20 00:14:22,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=91259.66666666667, ans=0.125 2024-06-20 00:14:24,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=91278.0, ans=0.0 2024-06-20 00:14:29,961 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.124e+03 1.794e+03 2.134e+03 2.597e+03 3.646e+03, threshold=4.268e+03, percent-clipped=3.0 2024-06-20 00:14:30,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=91296.33333333333, ans=0.0 2024-06-20 00:14:34,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=91296.33333333333, ans=0.125 2024-06-20 00:14:42,301 INFO [train.py:1028] (1/2) Epoch 5, batch 9350, loss[loss=0.3392, simple_loss=0.3594, pruned_loss=0.1595, over 12462.00 frames. ], tot_loss[loss=0.3691, simple_loss=0.3786, pruned_loss=0.1798, over 2568711.91 frames. ], batch size: 22, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:14:45,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=91333.0, ans=0.125 2024-06-20 00:14:59,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=91369.66666666667, ans=0.125 2024-06-20 00:15:10,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=91406.33333333333, ans=0.125 2024-06-20 00:15:12,444 INFO [train.py:1028] (1/2) Epoch 5, batch 9400, loss[loss=0.3724, simple_loss=0.3886, pruned_loss=0.1781, over 13274.00 frames. ], tot_loss[loss=0.3698, simple_loss=0.3788, pruned_loss=0.1804, over 2567908.95 frames. ], batch size: 52, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:15:13,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=91424.66666666667, ans=0.2 2024-06-20 00:15:21,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=91443.0, ans=0.05 2024-06-20 00:15:31,258 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.06 vs. limit=22.5 2024-06-20 00:15:32,116 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.090e+03 1.739e+03 2.062e+03 2.444e+03 4.574e+03, threshold=4.125e+03, percent-clipped=1.0 2024-06-20 00:15:34,853 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.00 vs. limit=15.0 2024-06-20 00:15:43,369 INFO [train.py:1028] (1/2) Epoch 5, batch 9450, loss[loss=0.3737, simple_loss=0.3802, pruned_loss=0.1836, over 12642.00 frames. ], tot_loss[loss=0.3714, simple_loss=0.3799, pruned_loss=0.1815, over 2568053.51 frames. ], batch size: 22, lr: 1.09e-02, grad_scale: 1.0 2024-06-20 00:15:45,968 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:15:47,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=91516.33333333333, ans=0.1 2024-06-20 00:15:49,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=91534.66666666667, ans=0.125 2024-06-20 00:15:53,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=91534.66666666667, ans=0.0 2024-06-20 00:15:59,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=91553.0, ans=0.2 2024-06-20 00:15:59,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=91553.0, ans=0.125 2024-06-20 00:16:00,049 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.64 vs. limit=22.5 2024-06-20 00:16:01,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=91571.33333333333, ans=0.125 2024-06-20 00:16:02,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=91571.33333333333, ans=0.95 2024-06-20 00:16:03,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=91571.33333333333, ans=0.0 2024-06-20 00:16:03,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=91571.33333333333, ans=0.125 2024-06-20 00:16:13,363 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.63 vs. limit=22.5 2024-06-20 00:16:15,478 INFO [train.py:1028] (1/2) Epoch 5, batch 9500, loss[loss=0.3689, simple_loss=0.3844, pruned_loss=0.1767, over 13203.00 frames. ], tot_loss[loss=0.3695, simple_loss=0.3787, pruned_loss=0.1802, over 2575694.70 frames. ], batch size: 43, lr: 1.09e-02, grad_scale: 2.0 2024-06-20 00:16:16,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=91608.0, ans=0.04949747468305833 2024-06-20 00:16:16,675 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=3.405e+02 2024-06-20 00:16:17,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=91608.0, ans=0.0 2024-06-20 00:16:19,283 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.08 vs. limit=15.0 2024-06-20 00:16:21,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=91626.33333333333, ans=0.125 2024-06-20 00:16:26,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=91626.33333333333, ans=0.0 2024-06-20 00:16:32,447 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.41 vs. limit=15.0 2024-06-20 00:16:34,508 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=17.03 vs. limit=15.0 2024-06-20 00:16:35,599 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:16:37,261 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.047e+03 1.499e+03 1.754e+03 2.022e+03 3.467e+03, threshold=3.507e+03, percent-clipped=0.0 2024-06-20 00:16:37,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=91663.0, ans=0.125 2024-06-20 00:16:40,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=91663.0, ans=0.1 2024-06-20 00:16:48,804 INFO [train.py:1028] (1/2) Epoch 5, batch 9550, loss[loss=0.3516, simple_loss=0.3643, pruned_loss=0.1695, over 13043.00 frames. ], tot_loss[loss=0.3707, simple_loss=0.3794, pruned_loss=0.181, over 2571509.51 frames. ], batch size: 39, lr: 1.09e-02, grad_scale: 2.0 2024-06-20 00:16:49,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=91699.66666666667, ans=0.125 2024-06-20 00:16:55,552 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.53 vs. limit=10.0 2024-06-20 00:16:58,610 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.91 vs. limit=15.0 2024-06-20 00:17:03,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=91736.33333333333, ans=0.0 2024-06-20 00:17:15,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=91773.0, ans=0.125 2024-06-20 00:17:19,690 INFO [train.py:1028] (1/2) Epoch 5, batch 9600, loss[loss=0.4289, simple_loss=0.4102, pruned_loss=0.2238, over 10711.00 frames. ], tot_loss[loss=0.3709, simple_loss=0.3792, pruned_loss=0.1813, over 2570388.11 frames. ], batch size: 304, lr: 1.09e-02, grad_scale: 4.0 2024-06-20 00:17:21,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=91791.33333333333, ans=0.125 2024-06-20 00:17:21,813 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:17:22,490 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:17:34,781 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=22.5 2024-06-20 00:17:38,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=91846.33333333333, ans=0.2 2024-06-20 00:17:39,560 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.798e+02 1.541e+03 1.790e+03 2.144e+03 3.440e+03, threshold=3.580e+03, percent-clipped=0.0 2024-06-20 00:17:43,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=91864.66666666667, ans=0.125 2024-06-20 00:17:46,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=91864.66666666667, ans=0.0 2024-06-20 00:17:48,533 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.67 vs. limit=15.0 2024-06-20 00:17:50,495 INFO [train.py:1028] (1/2) Epoch 5, batch 9650, loss[loss=0.3726, simple_loss=0.3705, pruned_loss=0.1873, over 13156.00 frames. ], tot_loss[loss=0.3734, simple_loss=0.3806, pruned_loss=0.1831, over 2560193.80 frames. ], batch size: 132, lr: 1.09e-02, grad_scale: 4.0 2024-06-20 00:17:58,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=91901.33333333333, ans=0.2 2024-06-20 00:17:59,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=91901.33333333333, ans=0.025 2024-06-20 00:18:01,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=91901.33333333333, ans=0.1 2024-06-20 00:18:11,343 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.16 vs. limit=15.0 2024-06-20 00:18:21,655 INFO [train.py:1028] (1/2) Epoch 5, batch 9700, loss[loss=0.3737, simple_loss=0.3704, pruned_loss=0.1885, over 12982.00 frames. ], tot_loss[loss=0.373, simple_loss=0.3797, pruned_loss=0.1831, over 2555645.44 frames. ], batch size: 144, lr: 1.09e-02, grad_scale: 2.0 2024-06-20 00:18:21,929 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.54 vs. limit=10.0 2024-06-20 00:18:36,227 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.65 vs. limit=6.0 2024-06-20 00:18:44,934 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.066e+03 1.514e+03 1.825e+03 2.118e+03 3.438e+03, threshold=3.650e+03, percent-clipped=0.0 2024-06-20 00:18:48,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=92048.0, ans=0.0 2024-06-20 00:18:54,088 INFO [train.py:1028] (1/2) Epoch 5, batch 9750, loss[loss=0.3511, simple_loss=0.3607, pruned_loss=0.1707, over 13147.00 frames. ], tot_loss[loss=0.3707, simple_loss=0.378, pruned_loss=0.1817, over 2551731.11 frames. ], batch size: 132, lr: 1.09e-02, grad_scale: 0.5 2024-06-20 00:18:59,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=92066.33333333333, ans=0.2 2024-06-20 00:19:00,927 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.78 vs. limit=10.0 2024-06-20 00:19:06,453 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.10 vs. limit=6.0 2024-06-20 00:19:09,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=92103.0, ans=0.1 2024-06-20 00:19:18,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=92121.33333333333, ans=0.125 2024-06-20 00:19:22,108 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.02 vs. limit=15.0 2024-06-20 00:19:26,415 INFO [train.py:1028] (1/2) Epoch 5, batch 9800, loss[loss=0.3811, simple_loss=0.3988, pruned_loss=0.1817, over 12926.00 frames. ], tot_loss[loss=0.3701, simple_loss=0.3775, pruned_loss=0.1814, over 2545446.77 frames. ], batch size: 39, lr: 1.09e-02, grad_scale: 1.0 2024-06-20 00:19:28,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=92158.0, ans=0.2 2024-06-20 00:19:31,045 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.99 vs. limit=15.0 2024-06-20 00:19:34,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=92176.33333333333, ans=0.125 2024-06-20 00:19:38,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=92194.66666666667, ans=0.0 2024-06-20 00:19:39,905 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.61 vs. limit=15.0 2024-06-20 00:19:41,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=92194.66666666667, ans=0.125 2024-06-20 00:19:48,107 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.31 vs. limit=6.0 2024-06-20 00:19:48,366 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.579e+02 1.085e+03 1.322e+03 1.560e+03 2.640e+03, threshold=2.645e+03, percent-clipped=0.0 2024-06-20 00:19:56,121 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.96 vs. limit=6.0 2024-06-20 00:19:56,900 INFO [train.py:1028] (1/2) Epoch 5, batch 9850, loss[loss=0.3811, simple_loss=0.3849, pruned_loss=0.1886, over 13035.00 frames. ], tot_loss[loss=0.3684, simple_loss=0.3762, pruned_loss=0.1803, over 2538722.42 frames. ], batch size: 102, lr: 1.09e-02, grad_scale: 1.0 2024-06-20 00:19:57,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=92249.66666666667, ans=0.0 2024-06-20 00:19:57,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=92249.66666666667, ans=0.125 2024-06-20 00:19:59,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten.whitening_limit, batch_count=92249.66666666667, ans=15.0 2024-06-20 00:20:14,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=92286.33333333333, ans=0.0 2024-06-20 00:20:17,476 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.10 vs. limit=15.0 2024-06-20 00:20:29,360 INFO [train.py:1028] (1/2) Epoch 5, batch 9900, loss[loss=0.3709, simple_loss=0.3782, pruned_loss=0.1819, over 12939.00 frames. ], tot_loss[loss=0.3675, simple_loss=0.3749, pruned_loss=0.1801, over 2530534.00 frames. ], batch size: 39, lr: 1.09e-02, grad_scale: 2.0 2024-06-20 00:20:32,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=92341.33333333333, ans=0.125 2024-06-20 00:20:47,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=92378.0, ans=0.125 2024-06-20 00:20:47,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=92378.0, ans=0.2 2024-06-20 00:20:51,849 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=15.0 2024-06-20 00:20:51,971 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.070e+02 1.144e+03 1.320e+03 1.530e+03 5.096e+03, threshold=2.641e+03, percent-clipped=5.0 2024-06-20 00:20:54,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=92414.66666666667, ans=0.125 2024-06-20 00:20:57,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.26 vs. limit=15.0 2024-06-20 00:21:00,487 INFO [train.py:1028] (1/2) Epoch 5, batch 9950, loss[loss=0.3474, simple_loss=0.3647, pruned_loss=0.1651, over 12758.00 frames. ], tot_loss[loss=0.367, simple_loss=0.3738, pruned_loss=0.1801, over 2523741.00 frames. ], batch size: 29, lr: 1.09e-02, grad_scale: 2.0 2024-06-20 00:21:10,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=92451.33333333333, ans=10.0 2024-06-20 00:21:10,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=92451.33333333333, ans=0.125 2024-06-20 00:21:15,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=92469.66666666667, ans=0.125 2024-06-20 00:21:18,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=92469.66666666667, ans=0.125 2024-06-20 00:21:19,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=92488.0, ans=0.0 2024-06-20 00:21:32,785 INFO [train.py:1028] (1/2) Epoch 5, batch 10000, loss[loss=0.3493, simple_loss=0.3813, pruned_loss=0.1586, over 12493.00 frames. ], tot_loss[loss=0.3692, simple_loss=0.3753, pruned_loss=0.1815, over 2487322.81 frames. ], batch size: 22, lr: 1.09e-02, grad_scale: 2.0 2024-06-20 00:21:33,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=92524.66666666667, ans=0.125 2024-06-20 00:21:38,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=92524.66666666667, ans=0.5 2024-06-20 00:21:43,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=92543.0, ans=0.1 2024-06-20 00:21:56,598 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.006e+02 1.104e+03 1.322e+03 1.552e+03 3.840e+03, threshold=2.643e+03, percent-clipped=5.0 2024-06-20 00:21:59,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=92598.0, ans=0.1 2024-06-20 00:21:59,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=92598.0, ans=0.125 2024-06-20 00:22:04,831 INFO [train.py:1028] (1/2) Epoch 5, batch 10050, loss[loss=0.3561, simple_loss=0.3738, pruned_loss=0.1691, over 12609.00 frames. ], tot_loss[loss=0.3711, simple_loss=0.376, pruned_loss=0.1831, over 2444583.55 frames. ], batch size: 22, lr: 1.09e-02, grad_scale: 2.0 2024-06-20 00:22:18,156 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.57 vs. limit=22.5 2024-06-20 00:22:19,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=92653.0, ans=22.5 2024-06-20 00:22:29,150 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.89 vs. limit=15.0 2024-06-20 00:22:34,602 INFO [train.py:1028] (1/2) Epoch 5, batch 10100, loss[loss=0.3157, simple_loss=0.3331, pruned_loss=0.1492, over 10771.00 frames. ], tot_loss[loss=0.3685, simple_loss=0.3747, pruned_loss=0.1812, over 2427065.25 frames. ], batch size: 16, lr: 1.09e-02, grad_scale: 4.0 2024-06-20 00:22:43,396 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.69 vs. limit=15.0 2024-06-20 00:24:51,480 INFO [train.py:1028] (1/2) Epoch 6, batch 0, loss[loss=0.3321, simple_loss=0.3509, pruned_loss=0.1566, over 12901.00 frames. ], tot_loss[loss=0.3321, simple_loss=0.3509, pruned_loss=0.1566, over 12901.00 frames. ], batch size: 36, lr: 1.02e-02, grad_scale: 8.0 2024-06-20 00:24:51,481 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 00:24:58,683 INFO [train.py:1060] (1/2) Epoch 6, validation: loss=0.2433, simple_loss=0.2974, pruned_loss=0.09461, over 351949.00 frames. 2024-06-20 00:24:58,683 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 00:25:01,985 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.34 vs. limit=22.5 2024-06-20 00:25:07,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=92759.33333333333, ans=0.125 2024-06-20 00:25:09,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=92759.33333333333, ans=0.125 2024-06-20 00:25:12,345 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.989e+02 1.139e+03 1.266e+03 1.488e+03 4.107e+03, threshold=2.532e+03, percent-clipped=1.0 2024-06-20 00:25:17,655 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.84 vs. limit=15.0 2024-06-20 00:25:17,979 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:25:23,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=92796.0, ans=0.2 2024-06-20 00:25:25,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=92814.33333333333, ans=0.0 2024-06-20 00:25:26,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=92814.33333333333, ans=0.025 2024-06-20 00:25:27,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=92814.33333333333, ans=0.0 2024-06-20 00:25:34,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=92832.66666666667, ans=0.1 2024-06-20 00:25:35,478 INFO [train.py:1028] (1/2) Epoch 6, batch 50, loss[loss=0.3456, simple_loss=0.3587, pruned_loss=0.1663, over 12683.00 frames. ], tot_loss[loss=0.3472, simple_loss=0.355, pruned_loss=0.1696, over 573797.35 frames. ], batch size: 29, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:25:46,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=92851.0, ans=6.0 2024-06-20 00:25:53,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=92869.33333333333, ans=6.0 2024-06-20 00:25:55,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.75 vs. limit=15.0 2024-06-20 00:26:10,164 INFO [train.py:1028] (1/2) Epoch 6, batch 100, loss[loss=0.3141, simple_loss=0.3335, pruned_loss=0.1473, over 13291.00 frames. ], tot_loss[loss=0.3405, simple_loss=0.3498, pruned_loss=0.1656, over 1016698.60 frames. ], batch size: 46, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:26:17,942 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=92942.66666666667, ans=0.0 2024-06-20 00:26:23,361 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.162e+03 1.710e+03 2.043e+03 2.456e+03 4.156e+03, threshold=4.087e+03, percent-clipped=23.0 2024-06-20 00:26:30,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=92979.33333333333, ans=0.0 2024-06-20 00:26:36,173 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.24 vs. limit=15.0 2024-06-20 00:26:40,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=92997.66666666667, ans=0.125 2024-06-20 00:26:41,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=93016.0, ans=0.0 2024-06-20 00:26:42,087 INFO [train.py:1028] (1/2) Epoch 6, batch 150, loss[loss=0.2997, simple_loss=0.3306, pruned_loss=0.1343, over 12637.00 frames. ], tot_loss[loss=0.3378, simple_loss=0.3482, pruned_loss=0.1637, over 1364499.08 frames. ], batch size: 29, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:26:54,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=93052.66666666667, ans=0.0 2024-06-20 00:27:06,530 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.87 vs. limit=6.0 2024-06-20 00:27:08,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=93089.33333333333, ans=0.2 2024-06-20 00:27:13,817 INFO [train.py:1028] (1/2) Epoch 6, batch 200, loss[loss=0.3727, simple_loss=0.3623, pruned_loss=0.1916, over 12512.00 frames. ], tot_loss[loss=0.3359, simple_loss=0.3469, pruned_loss=0.1625, over 1634746.78 frames. ], batch size: 202, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:27:15,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=93107.66666666667, ans=0.0 2024-06-20 00:27:23,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=93126.0, ans=0.125 2024-06-20 00:27:24,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=93126.0, ans=0.0 2024-06-20 00:27:27,695 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.052e+03 1.606e+03 1.902e+03 2.292e+03 3.851e+03, threshold=3.805e+03, percent-clipped=0.0 2024-06-20 00:27:27,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=93144.33333333333, ans=0.05 2024-06-20 00:27:31,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=93144.33333333333, ans=0.5 2024-06-20 00:27:39,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=93162.66666666667, ans=0.0 2024-06-20 00:27:39,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=93162.66666666667, ans=0.0 2024-06-20 00:27:40,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=93162.66666666667, ans=0.95 2024-06-20 00:27:43,726 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.60 vs. limit=6.0 2024-06-20 00:27:47,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=93181.0, ans=0.2 2024-06-20 00:27:48,508 INFO [train.py:1028] (1/2) Epoch 6, batch 250, loss[loss=0.3158, simple_loss=0.3193, pruned_loss=0.1562, over 13022.00 frames. ], tot_loss[loss=0.3357, simple_loss=0.3465, pruned_loss=0.1624, over 1846525.93 frames. ], batch size: 144, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:27:52,128 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.39 vs. limit=15.0 2024-06-20 00:27:52,775 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.43 vs. limit=15.0 2024-06-20 00:27:55,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=93217.66666666667, ans=0.09899494936611666 2024-06-20 00:27:59,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=93217.66666666667, ans=10.0 2024-06-20 00:28:01,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=93236.0, ans=0.0 2024-06-20 00:28:03,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=93236.0, ans=0.125 2024-06-20 00:28:14,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=93254.33333333333, ans=0.025 2024-06-20 00:28:18,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=93272.66666666667, ans=0.2 2024-06-20 00:28:23,554 INFO [train.py:1028] (1/2) Epoch 6, batch 300, loss[loss=0.3488, simple_loss=0.3485, pruned_loss=0.1745, over 13174.00 frames. ], tot_loss[loss=0.3359, simple_loss=0.3466, pruned_loss=0.1626, over 2008926.30 frames. ], batch size: 112, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:28:39,216 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.877e+02 2.032e+03 2.416e+03 2.827e+03 4.360e+03, threshold=4.831e+03, percent-clipped=3.0 2024-06-20 00:28:39,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93327.66666666667, ans=0.1 2024-06-20 00:28:54,888 INFO [train.py:1028] (1/2) Epoch 6, batch 350, loss[loss=0.3262, simple_loss=0.3445, pruned_loss=0.154, over 12932.00 frames. ], tot_loss[loss=0.3341, simple_loss=0.3452, pruned_loss=0.1615, over 2138547.64 frames. ], batch size: 33, lr: 1.01e-02, grad_scale: 1.0 2024-06-20 00:28:56,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=93382.66666666667, ans=0.0 2024-06-20 00:29:05,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=93401.0, ans=0.2 2024-06-20 00:29:05,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=93401.0, ans=0.0 2024-06-20 00:29:10,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93419.33333333333, ans=0.1 2024-06-20 00:29:13,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=93437.66666666667, ans=0.125 2024-06-20 00:29:16,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=93437.66666666667, ans=0.125 2024-06-20 00:29:18,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=93437.66666666667, ans=0.125 2024-06-20 00:29:29,367 INFO [train.py:1028] (1/2) Epoch 6, batch 400, loss[loss=0.3214, simple_loss=0.3356, pruned_loss=0.1536, over 13273.00 frames. ], tot_loss[loss=0.3328, simple_loss=0.3443, pruned_loss=0.1606, over 2239222.74 frames. ], batch size: 63, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:29:35,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=93492.66666666667, ans=0.125 2024-06-20 00:29:44,932 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.309e+02 1.605e+03 1.845e+03 2.027e+03 2.972e+03, threshold=3.690e+03, percent-clipped=0.0 2024-06-20 00:29:52,686 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=12.0 2024-06-20 00:30:00,381 INFO [train.py:1028] (1/2) Epoch 6, batch 450, loss[loss=0.3005, simple_loss=0.3271, pruned_loss=0.1369, over 13282.00 frames. ], tot_loss[loss=0.3311, simple_loss=0.3434, pruned_loss=0.1594, over 2313301.80 frames. ], batch size: 67, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:30:07,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=93584.33333333333, ans=0.125 2024-06-20 00:30:08,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=93584.33333333333, ans=0.125 2024-06-20 00:30:21,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=93602.66666666667, ans=0.125 2024-06-20 00:30:22,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=93602.66666666667, ans=0.125 2024-06-20 00:30:25,203 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.90 vs. limit=6.0 2024-06-20 00:30:28,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=93621.0, ans=0.1 2024-06-20 00:30:29,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93621.0, ans=0.1 2024-06-20 00:30:37,200 INFO [train.py:1028] (1/2) Epoch 6, batch 500, loss[loss=0.294, simple_loss=0.3099, pruned_loss=0.1391, over 13138.00 frames. ], tot_loss[loss=0.3308, simple_loss=0.3435, pruned_loss=0.159, over 2376880.24 frames. ], batch size: 121, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:30:39,051 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.34 vs. limit=15.0 2024-06-20 00:30:49,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=93676.0, ans=0.0 2024-06-20 00:30:51,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93694.33333333333, ans=0.1 2024-06-20 00:30:53,026 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.34 vs. limit=22.5 2024-06-20 00:30:53,979 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.459e+02 1.224e+03 1.470e+03 1.758e+03 2.835e+03, threshold=2.940e+03, percent-clipped=0.0 2024-06-20 00:30:54,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=93694.33333333333, ans=0.125 2024-06-20 00:30:58,081 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.30 vs. limit=15.0 2024-06-20 00:31:09,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=93749.33333333333, ans=0.0 2024-06-20 00:31:10,263 INFO [train.py:1028] (1/2) Epoch 6, batch 550, loss[loss=0.3492, simple_loss=0.356, pruned_loss=0.1712, over 12912.00 frames. ], tot_loss[loss=0.3301, simple_loss=0.3432, pruned_loss=0.1585, over 2420686.07 frames. ], batch size: 158, lr: 1.01e-02, grad_scale: 1.0 2024-06-20 00:31:18,428 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=22.5 2024-06-20 00:31:19,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=93767.66666666667, ans=0.025 2024-06-20 00:31:21,065 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.08 vs. limit=15.0 2024-06-20 00:31:22,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=93786.0, ans=0.125 2024-06-20 00:31:30,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=93804.33333333333, ans=0.2 2024-06-20 00:31:39,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=93822.66666666667, ans=0.0 2024-06-20 00:31:40,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=93822.66666666667, ans=0.125 2024-06-20 00:31:45,009 INFO [train.py:1028] (1/2) Epoch 6, batch 600, loss[loss=0.3189, simple_loss=0.3273, pruned_loss=0.1553, over 13052.00 frames. ], tot_loss[loss=0.3289, simple_loss=0.3424, pruned_loss=0.1577, over 2459677.94 frames. ], batch size: 144, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:31:50,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=93841.0, ans=0.125 2024-06-20 00:31:55,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93859.33333333333, ans=0.1 2024-06-20 00:32:00,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=93877.66666666667, ans=0.0 2024-06-20 00:32:02,626 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.196e+02 1.254e+03 1.386e+03 1.678e+03 4.024e+03, threshold=2.772e+03, percent-clipped=1.0 2024-06-20 00:32:05,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=93896.0, ans=0.0 2024-06-20 00:32:11,063 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.76 vs. limit=15.0 2024-06-20 00:32:17,822 INFO [train.py:1028] (1/2) Epoch 6, batch 650, loss[loss=0.3318, simple_loss=0.3517, pruned_loss=0.1559, over 13226.00 frames. ], tot_loss[loss=0.3283, simple_loss=0.3425, pruned_loss=0.1571, over 2491052.61 frames. ], batch size: 59, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:32:25,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=93932.66666666667, ans=0.125 2024-06-20 00:32:25,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=93932.66666666667, ans=0.125 2024-06-20 00:32:32,404 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=15.33 vs. limit=15.0 2024-06-20 00:32:39,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=93969.33333333333, ans=0.125 2024-06-20 00:32:44,187 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.52 vs. limit=6.0 2024-06-20 00:32:52,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=94006.0, ans=0.0 2024-06-20 00:32:52,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=94006.0, ans=0.0 2024-06-20 00:32:53,789 INFO [train.py:1028] (1/2) Epoch 6, batch 700, loss[loss=0.341, simple_loss=0.3523, pruned_loss=0.1649, over 13292.00 frames. ], tot_loss[loss=0.3266, simple_loss=0.3408, pruned_loss=0.1562, over 2513846.81 frames. ], batch size: 46, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:32:59,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.47 vs. limit=22.5 2024-06-20 00:33:05,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=94061.0, ans=0.2 2024-06-20 00:33:09,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=94061.0, ans=0.125 2024-06-20 00:33:11,281 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.574e+02 1.248e+03 1.504e+03 1.761e+03 3.876e+03, threshold=3.008e+03, percent-clipped=3.0 2024-06-20 00:33:24,127 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=15.0 2024-06-20 00:33:25,634 INFO [train.py:1028] (1/2) Epoch 6, batch 750, loss[loss=0.3146, simple_loss=0.3374, pruned_loss=0.1459, over 13259.00 frames. ], tot_loss[loss=0.3267, simple_loss=0.3411, pruned_loss=0.1561, over 2528728.74 frames. ], batch size: 63, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:33:26,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94116.0, ans=0.1 2024-06-20 00:33:39,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=94152.66666666667, ans=0.0 2024-06-20 00:33:48,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=94171.0, ans=0.125 2024-06-20 00:34:00,688 INFO [train.py:1028] (1/2) Epoch 6, batch 800, loss[loss=0.3115, simple_loss=0.3302, pruned_loss=0.1464, over 12896.00 frames. ], tot_loss[loss=0.3275, simple_loss=0.3416, pruned_loss=0.1567, over 2541493.47 frames. ], batch size: 36, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:34:16,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=94244.33333333333, ans=0.0 2024-06-20 00:34:17,266 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.87 vs. limit=15.0 2024-06-20 00:34:19,353 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.132e+02 1.242e+03 1.526e+03 1.877e+03 2.567e+03, threshold=3.051e+03, percent-clipped=0.0 2024-06-20 00:34:34,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=94281.0, ans=0.0 2024-06-20 00:34:37,238 INFO [train.py:1028] (1/2) Epoch 6, batch 850, loss[loss=0.3128, simple_loss=0.3247, pruned_loss=0.1504, over 13185.00 frames. ], tot_loss[loss=0.3256, simple_loss=0.3403, pruned_loss=0.1555, over 2551603.50 frames. ], batch size: 95, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:34:44,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=94317.66666666667, ans=0.125 2024-06-20 00:34:47,376 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=94317.66666666667, ans=0.025 2024-06-20 00:34:50,797 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.53 vs. limit=15.0 2024-06-20 00:34:51,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=94336.0, ans=0.0 2024-06-20 00:34:58,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=94354.33333333333, ans=0.0 2024-06-20 00:35:04,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=94372.66666666667, ans=0.125 2024-06-20 00:35:06,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=94372.66666666667, ans=0.125 2024-06-20 00:35:08,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=94391.0, ans=0.0 2024-06-20 00:35:09,438 INFO [train.py:1028] (1/2) Epoch 6, batch 900, loss[loss=0.2968, simple_loss=0.3217, pruned_loss=0.1359, over 12961.00 frames. ], tot_loss[loss=0.326, simple_loss=0.3404, pruned_loss=0.1557, over 2557242.08 frames. ], batch size: 36, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:35:09,905 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.50 vs. limit=10.0 2024-06-20 00:35:14,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=94391.0, ans=0.1 2024-06-20 00:35:19,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=94409.33333333333, ans=0.125 2024-06-20 00:35:26,296 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.00 vs. limit=15.0 2024-06-20 00:35:28,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94446.0, ans=0.1 2024-06-20 00:35:29,246 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.399e+02 9.491e+02 1.186e+03 1.373e+03 2.094e+03, threshold=2.372e+03, percent-clipped=0.0 2024-06-20 00:35:34,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=94446.0, ans=0.1 2024-06-20 00:35:35,068 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.42 vs. limit=15.0 2024-06-20 00:35:37,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=94464.33333333333, ans=0.125 2024-06-20 00:35:40,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=94464.33333333333, ans=0.125 2024-06-20 00:35:45,309 INFO [train.py:1028] (1/2) Epoch 6, batch 950, loss[loss=0.3319, simple_loss=0.352, pruned_loss=0.1559, over 12961.00 frames. ], tot_loss[loss=0.3273, simple_loss=0.3414, pruned_loss=0.1566, over 2559701.60 frames. ], batch size: 39, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:35:49,599 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.52 vs. limit=10.0 2024-06-20 00:35:51,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=94501.0, ans=0.125 2024-06-20 00:35:57,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=94519.33333333333, ans=0.1 2024-06-20 00:36:03,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=94519.33333333333, ans=0.125 2024-06-20 00:36:08,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=94537.66666666667, ans=0.125 2024-06-20 00:36:10,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=94556.0, ans=0.05 2024-06-20 00:36:13,393 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=16.47 vs. limit=15.0 2024-06-20 00:36:16,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94574.33333333333, ans=0.1 2024-06-20 00:36:17,065 INFO [train.py:1028] (1/2) Epoch 6, batch 1000, loss[loss=0.3784, simple_loss=0.382, pruned_loss=0.1874, over 13310.00 frames. ], tot_loss[loss=0.3281, simple_loss=0.3417, pruned_loss=0.1572, over 2562922.14 frames. ], batch size: 49, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:36:17,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=94574.33333333333, ans=0.125 2024-06-20 00:36:18,641 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.94 vs. limit=15.0 2024-06-20 00:36:29,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=94592.66666666667, ans=0.1 2024-06-20 00:36:29,970 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.20 vs. limit=12.0 2024-06-20 00:36:32,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=94611.0, ans=0.0 2024-06-20 00:36:33,977 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.03 vs. limit=15.0 2024-06-20 00:36:35,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=94611.0, ans=0.125 2024-06-20 00:36:35,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=94611.0, ans=0.125 2024-06-20 00:36:36,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=94611.0, ans=0.125 2024-06-20 00:36:39,518 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.808e+02 8.900e+02 9.853e+02 1.075e+03 2.399e+03, threshold=1.971e+03, percent-clipped=1.0 2024-06-20 00:36:50,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=94647.66666666667, ans=0.0 2024-06-20 00:36:52,441 INFO [train.py:1028] (1/2) Epoch 6, batch 1050, loss[loss=0.2972, simple_loss=0.3232, pruned_loss=0.1356, over 13184.00 frames. ], tot_loss[loss=0.3284, simple_loss=0.3422, pruned_loss=0.1573, over 2565024.51 frames. ], batch size: 77, lr: 1.00e-02, grad_scale: 4.0 2024-06-20 00:37:00,225 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:37:02,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=94684.33333333333, ans=0.0 2024-06-20 00:37:08,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=94702.66666666667, ans=0.125 2024-06-20 00:37:24,386 INFO [train.py:1028] (1/2) Epoch 6, batch 1100, loss[loss=0.3512, simple_loss=0.3638, pruned_loss=0.1693, over 13292.00 frames. ], tot_loss[loss=0.3285, simple_loss=0.3426, pruned_loss=0.1572, over 2570011.02 frames. ], batch size: 52, lr: 1.00e-02, grad_scale: 8.0 2024-06-20 00:37:27,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=94757.66666666667, ans=0.04949747468305833 2024-06-20 00:37:27,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=94757.66666666667, ans=0.125 2024-06-20 00:37:27,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=94757.66666666667, ans=0.0 2024-06-20 00:37:46,737 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.709e+02 9.361e+02 1.069e+03 1.325e+03 2.009e+03, threshold=2.138e+03, percent-clipped=1.0 2024-06-20 00:37:49,253 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.52 vs. limit=10.0 2024-06-20 00:37:59,702 INFO [train.py:1028] (1/2) Epoch 6, batch 1150, loss[loss=0.3011, simple_loss=0.3326, pruned_loss=0.1349, over 13306.00 frames. ], tot_loss[loss=0.3295, simple_loss=0.3435, pruned_loss=0.1578, over 2571136.33 frames. ], batch size: 52, lr: 1.00e-02, grad_scale: 2.0 2024-06-20 00:38:02,861 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=11.13 vs. limit=12.0 2024-06-20 00:38:21,292 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.49 vs. limit=22.5 2024-06-20 00:38:22,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=94904.33333333333, ans=0.07 2024-06-20 00:38:27,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=94922.66666666667, ans=0.125 2024-06-20 00:38:32,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=94922.66666666667, ans=0.125 2024-06-20 00:38:33,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=94922.66666666667, ans=0.125 2024-06-20 00:38:34,848 INFO [train.py:1028] (1/2) Epoch 6, batch 1200, loss[loss=0.3061, simple_loss=0.3269, pruned_loss=0.1427, over 13183.00 frames. ], tot_loss[loss=0.3299, simple_loss=0.3436, pruned_loss=0.1581, over 2572813.21 frames. ], batch size: 77, lr: 1.00e-02, grad_scale: 4.0 2024-06-20 00:38:38,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94941.0, ans=0.1 2024-06-20 00:38:44,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=94959.33333333333, ans=0.125 2024-06-20 00:38:55,920 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.126e+02 1.133e+03 1.284e+03 1.495e+03 2.434e+03, threshold=2.569e+03, percent-clipped=1.0 2024-06-20 00:38:56,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94996.0, ans=0.1 2024-06-20 00:39:00,683 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2024-06-20 00:39:06,715 INFO [train.py:1028] (1/2) Epoch 6, batch 1250, loss[loss=0.3104, simple_loss=0.3258, pruned_loss=0.1475, over 13153.00 frames. ], tot_loss[loss=0.3285, simple_loss=0.3427, pruned_loss=0.1571, over 2582949.08 frames. ], batch size: 112, lr: 1.00e-02, grad_scale: 2.0 2024-06-20 00:39:16,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=95051.0, ans=15.0 2024-06-20 00:39:22,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=95069.33333333333, ans=0.1 2024-06-20 00:39:30,502 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.54 vs. limit=22.5 2024-06-20 00:39:35,546 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.72 vs. limit=10.0 2024-06-20 00:39:41,962 INFO [train.py:1028] (1/2) Epoch 6, batch 1300, loss[loss=0.3476, simple_loss=0.3491, pruned_loss=0.1731, over 12731.00 frames. ], tot_loss[loss=0.3284, simple_loss=0.3426, pruned_loss=0.1571, over 2582620.06 frames. ], batch size: 176, lr: 1.00e-02, grad_scale: 4.0 2024-06-20 00:39:42,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=95124.33333333333, ans=0.2 2024-06-20 00:39:47,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=95124.33333333333, ans=0.125 2024-06-20 00:39:50,496 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.35 vs. limit=22.5 2024-06-20 00:40:01,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95179.33333333333, ans=0.1 2024-06-20 00:40:02,739 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.915e+02 1.024e+03 1.223e+03 1.462e+03 2.953e+03, threshold=2.447e+03, percent-clipped=1.0 2024-06-20 00:40:03,259 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.46 vs. limit=15.0 2024-06-20 00:40:12,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=95197.66666666667, ans=0.025 2024-06-20 00:40:13,827 INFO [train.py:1028] (1/2) Epoch 6, batch 1350, loss[loss=0.3305, simple_loss=0.35, pruned_loss=0.1555, over 13226.00 frames. ], tot_loss[loss=0.3285, simple_loss=0.3429, pruned_loss=0.157, over 2583992.04 frames. ], batch size: 59, lr: 1.00e-02, grad_scale: 4.0 2024-06-20 00:40:41,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=95271.0, ans=0.2 2024-06-20 00:40:45,811 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:40:46,685 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.93 vs. limit=15.0 2024-06-20 00:40:48,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=95289.33333333333, ans=0.125 2024-06-20 00:40:49,877 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.66 vs. limit=15.0 2024-06-20 00:40:50,142 INFO [train.py:1028] (1/2) Epoch 6, batch 1400, loss[loss=0.3485, simple_loss=0.3524, pruned_loss=0.1723, over 12388.00 frames. ], tot_loss[loss=0.3286, simple_loss=0.3427, pruned_loss=0.1572, over 2585136.18 frames. ], batch size: 25, lr: 1.00e-02, grad_scale: 8.0 2024-06-20 00:40:52,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=95307.66666666667, ans=0.125 2024-06-20 00:40:54,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=95307.66666666667, ans=0.125 2024-06-20 00:41:06,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=95326.0, ans=0.2 2024-06-20 00:41:12,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=95344.33333333333, ans=0.0 2024-06-20 00:41:16,579 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.646e+02 1.005e+03 1.172e+03 1.351e+03 2.205e+03, threshold=2.344e+03, percent-clipped=0.0 2024-06-20 00:41:20,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=95381.0, ans=0.125 2024-06-20 00:41:22,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=95381.0, ans=0.0 2024-06-20 00:41:26,983 INFO [train.py:1028] (1/2) Epoch 6, batch 1450, loss[loss=0.3335, simple_loss=0.3436, pruned_loss=0.1617, over 13116.00 frames. ], tot_loss[loss=0.3276, simple_loss=0.3419, pruned_loss=0.1567, over 2585210.94 frames. ], batch size: 121, lr: 1.00e-02, grad_scale: 4.0 2024-06-20 00:41:39,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=95436.0, ans=0.2 2024-06-20 00:41:52,874 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.59 vs. limit=15.0 2024-06-20 00:41:54,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95454.33333333333, ans=0.1 2024-06-20 00:41:57,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=95472.66666666667, ans=0.125 2024-06-20 00:42:02,057 INFO [train.py:1028] (1/2) Epoch 6, batch 1500, loss[loss=0.3146, simple_loss=0.326, pruned_loss=0.1516, over 13188.00 frames. ], tot_loss[loss=0.3283, simple_loss=0.3423, pruned_loss=0.1572, over 2587531.49 frames. ], batch size: 83, lr: 1.00e-02, grad_scale: 4.0 2024-06-20 00:42:14,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=95527.66666666667, ans=0.125 2024-06-20 00:42:14,808 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=1.783e+00 2024-06-20 00:42:18,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=95527.66666666667, ans=0.0 2024-06-20 00:42:21,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=95546.0, ans=0.1 2024-06-20 00:42:24,239 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.613e+02 1.105e+03 1.268e+03 1.432e+03 2.252e+03, threshold=2.535e+03, percent-clipped=0.0 2024-06-20 00:42:26,799 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.59 vs. limit=22.5 2024-06-20 00:42:35,714 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:42:36,789 INFO [train.py:1028] (1/2) Epoch 6, batch 1550, loss[loss=0.3263, simple_loss=0.3364, pruned_loss=0.158, over 13052.00 frames. ], tot_loss[loss=0.3283, simple_loss=0.3421, pruned_loss=0.1573, over 2583267.08 frames. ], batch size: 102, lr: 1.00e-02, grad_scale: 4.0 2024-06-20 00:42:37,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=95582.66666666667, ans=0.125 2024-06-20 00:42:40,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=95582.66666666667, ans=0.125 2024-06-20 00:42:40,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=95582.66666666667, ans=0.125 2024-06-20 00:42:42,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=95582.66666666667, ans=0.125 2024-06-20 00:42:48,456 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.99 vs. limit=15.0 2024-06-20 00:43:03,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=95656.0, ans=0.125 2024-06-20 00:43:03,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=95656.0, ans=0.0 2024-06-20 00:43:07,377 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.54 vs. limit=15.0 2024-06-20 00:43:09,514 INFO [train.py:1028] (1/2) Epoch 6, batch 1600, loss[loss=0.3397, simple_loss=0.3538, pruned_loss=0.1627, over 13127.00 frames. ], tot_loss[loss=0.327, simple_loss=0.3415, pruned_loss=0.1563, over 2579389.83 frames. ], batch size: 77, lr: 1.00e-02, grad_scale: 8.0 2024-06-20 00:43:15,034 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.14 vs. limit=15.0 2024-06-20 00:43:34,858 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.271e+02 9.513e+02 1.099e+03 1.225e+03 2.280e+03, threshold=2.199e+03, percent-clipped=0.0 2024-06-20 00:43:36,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=95729.33333333333, ans=0.125 2024-06-20 00:43:44,661 INFO [train.py:1028] (1/2) Epoch 6, batch 1650, loss[loss=0.3292, simple_loss=0.3402, pruned_loss=0.1591, over 13158.00 frames. ], tot_loss[loss=0.3275, simple_loss=0.3417, pruned_loss=0.1567, over 2575275.27 frames. ], batch size: 95, lr: 9.99e-03, grad_scale: 4.0 2024-06-20 00:43:45,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2024-06-20 00:43:51,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95784.33333333333, ans=0.1 2024-06-20 00:43:52,724 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.05 vs. limit=22.5 2024-06-20 00:44:03,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=95821.0, ans=0.125 2024-06-20 00:44:06,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=95821.0, ans=10.0 2024-06-20 00:44:11,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=95839.33333333333, ans=0.1 2024-06-20 00:44:16,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=95839.33333333333, ans=0.2 2024-06-20 00:44:17,639 INFO [train.py:1028] (1/2) Epoch 6, batch 1700, loss[loss=0.3547, simple_loss=0.3749, pruned_loss=0.1673, over 12395.00 frames. ], tot_loss[loss=0.3264, simple_loss=0.3413, pruned_loss=0.1558, over 2579987.83 frames. ], batch size: 25, lr: 9.99e-03, grad_scale: 2.0 2024-06-20 00:44:19,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=95857.66666666667, ans=0.0 2024-06-20 00:44:24,155 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=2.34 vs. limit=15.0 2024-06-20 00:44:37,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95894.33333333333, ans=0.1 2024-06-20 00:44:45,099 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.963e+02 9.664e+02 1.209e+03 1.481e+03 7.673e+03, threshold=2.417e+03, percent-clipped=3.0 2024-06-20 00:44:52,401 INFO [train.py:1028] (1/2) Epoch 6, batch 1750, loss[loss=0.3361, simple_loss=0.3533, pruned_loss=0.1595, over 12473.00 frames. ], tot_loss[loss=0.3272, simple_loss=0.342, pruned_loss=0.1562, over 2580145.31 frames. ], batch size: 22, lr: 9.98e-03, grad_scale: 1.0 2024-06-20 00:44:58,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.28 vs. limit=15.0 2024-06-20 00:45:19,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=96022.66666666667, ans=0.2 2024-06-20 00:45:24,207 INFO [train.py:1028] (1/2) Epoch 6, batch 1800, loss[loss=0.339, simple_loss=0.3484, pruned_loss=0.1648, over 13195.00 frames. ], tot_loss[loss=0.3278, simple_loss=0.3422, pruned_loss=0.1567, over 2580923.96 frames. ], batch size: 67, lr: 9.98e-03, grad_scale: 2.0 2024-06-20 00:45:25,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=96041.0, ans=0.025 2024-06-20 00:45:32,235 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.53 vs. limit=22.5 2024-06-20 00:45:45,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=96096.0, ans=0.125 2024-06-20 00:45:51,901 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.788e+02 1.255e+03 1.422e+03 1.636e+03 2.353e+03, threshold=2.844e+03, percent-clipped=0.0 2024-06-20 00:45:53,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=96114.33333333333, ans=0.025 2024-06-20 00:45:58,985 INFO [train.py:1028] (1/2) Epoch 6, batch 1850, loss[loss=0.3365, simple_loss=0.3489, pruned_loss=0.162, over 13208.00 frames. ], tot_loss[loss=0.3285, simple_loss=0.3429, pruned_loss=0.157, over 2582677.23 frames. ], batch size: 83, lr: 9.97e-03, grad_scale: 2.0 2024-06-20 00:46:01,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=96132.66666666667, ans=0.125 2024-06-20 00:46:34,261 INFO [train.py:1028] (1/2) Epoch 6, batch 1900, loss[loss=0.2911, simple_loss=0.3072, pruned_loss=0.1375, over 13138.00 frames. ], tot_loss[loss=0.3285, simple_loss=0.3427, pruned_loss=0.1572, over 2585789.09 frames. ], batch size: 95, lr: 9.97e-03, grad_scale: 4.0 2024-06-20 00:46:35,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=96224.33333333333, ans=0.125 2024-06-20 00:46:39,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=96224.33333333333, ans=0.125 2024-06-20 00:46:40,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=96242.66666666667, ans=0.2 2024-06-20 00:46:41,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=96242.66666666667, ans=0.125 2024-06-20 00:47:00,485 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.301e+02 1.223e+03 1.406e+03 1.603e+03 3.193e+03, threshold=2.812e+03, percent-clipped=1.0 2024-06-20 00:47:05,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=96297.66666666667, ans=0.125 2024-06-20 00:47:06,613 INFO [train.py:1028] (1/2) Epoch 6, batch 1950, loss[loss=0.3269, simple_loss=0.351, pruned_loss=0.1514, over 13281.00 frames. ], tot_loss[loss=0.3277, simple_loss=0.342, pruned_loss=0.1568, over 2591370.70 frames. ], batch size: 52, lr: 9.96e-03, grad_scale: 2.0 2024-06-20 00:47:07,286 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.61 vs. limit=10.0 2024-06-20 00:47:12,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=96316.0, ans=0.2 2024-06-20 00:47:26,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=96352.66666666667, ans=0.125 2024-06-20 00:47:28,775 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.86 vs. limit=15.0 2024-06-20 00:47:30,087 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.59 vs. limit=15.0 2024-06-20 00:47:31,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=96371.0, ans=0.2 2024-06-20 00:47:35,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=96389.33333333333, ans=0.0 2024-06-20 00:47:42,545 INFO [train.py:1028] (1/2) Epoch 6, batch 2000, loss[loss=0.3308, simple_loss=0.3531, pruned_loss=0.1542, over 12419.00 frames. ], tot_loss[loss=0.3287, simple_loss=0.3424, pruned_loss=0.1576, over 2587277.81 frames. ], batch size: 22, lr: 9.96e-03, grad_scale: 4.0 2024-06-20 00:47:48,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=96426.0, ans=0.1 2024-06-20 00:47:51,350 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=1.189e+00 2024-06-20 00:48:03,496 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.75 vs. limit=22.5 2024-06-20 00:48:04,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=96462.66666666667, ans=0.125 2024-06-20 00:48:08,791 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.157e+02 1.131e+03 1.258e+03 1.415e+03 2.587e+03, threshold=2.515e+03, percent-clipped=0.0 2024-06-20 00:48:10,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=96481.0, ans=0.0 2024-06-20 00:48:10,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=96481.0, ans=0.125 2024-06-20 00:48:14,263 INFO [train.py:1028] (1/2) Epoch 6, batch 2050, loss[loss=0.32, simple_loss=0.337, pruned_loss=0.1515, over 12695.00 frames. ], tot_loss[loss=0.3286, simple_loss=0.3422, pruned_loss=0.1575, over 2581863.96 frames. ], batch size: 29, lr: 9.95e-03, grad_scale: 2.0 2024-06-20 00:48:20,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=96517.66666666667, ans=0.0 2024-06-20 00:48:24,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=96517.66666666667, ans=0.0 2024-06-20 00:48:31,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=96536.0, ans=0.0 2024-06-20 00:48:34,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=96536.0, ans=0.5 2024-06-20 00:48:42,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=96572.66666666667, ans=0.025 2024-06-20 00:48:48,833 INFO [train.py:1028] (1/2) Epoch 6, batch 2100, loss[loss=0.3192, simple_loss=0.3415, pruned_loss=0.1484, over 13179.00 frames. ], tot_loss[loss=0.3289, simple_loss=0.343, pruned_loss=0.1574, over 2585585.15 frames. ], batch size: 59, lr: 9.95e-03, grad_scale: 4.0 2024-06-20 00:48:49,921 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.97 vs. limit=15.0 2024-06-20 00:48:52,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=96591.0, ans=0.1 2024-06-20 00:48:55,044 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.66 vs. limit=15.0 2024-06-20 00:48:58,571 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=1.98 vs. limit=15.0 2024-06-20 00:48:59,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=96609.33333333333, ans=0.05 2024-06-20 00:49:05,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=96627.66666666667, ans=0.125 2024-06-20 00:49:13,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=96646.0, ans=0.0 2024-06-20 00:49:15,409 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=1.98 vs. limit=15.0 2024-06-20 00:49:15,607 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.879e+02 1.045e+03 1.221e+03 1.457e+03 2.375e+03, threshold=2.443e+03, percent-clipped=0.0 2024-06-20 00:49:16,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=96664.33333333333, ans=0.0 2024-06-20 00:49:18,143 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=14.23 vs. limit=15.0 2024-06-20 00:49:20,611 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:49:21,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=96682.66666666667, ans=0.125 2024-06-20 00:49:21,648 INFO [train.py:1028] (1/2) Epoch 6, batch 2150, loss[loss=0.3184, simple_loss=0.3402, pruned_loss=0.1483, over 13241.00 frames. ], tot_loss[loss=0.3276, simple_loss=0.3422, pruned_loss=0.1565, over 2587892.25 frames. ], batch size: 52, lr: 9.95e-03, grad_scale: 4.0 2024-06-20 00:49:42,026 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.02 vs. limit=15.0 2024-06-20 00:49:44,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=96737.66666666667, ans=0.0 2024-06-20 00:49:47,757 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.79 vs. limit=10.0 2024-06-20 00:49:49,658 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.09 vs. limit=15.0 2024-06-20 00:49:51,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=96756.0, ans=0.125 2024-06-20 00:49:56,970 INFO [train.py:1028] (1/2) Epoch 6, batch 2200, loss[loss=0.3424, simple_loss=0.3458, pruned_loss=0.1695, over 13230.00 frames. ], tot_loss[loss=0.3273, simple_loss=0.3418, pruned_loss=0.1564, over 2587485.41 frames. ], batch size: 83, lr: 9.94e-03, grad_scale: 8.0 2024-06-20 00:49:58,398 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=3.260e+02 2024-06-20 00:50:11,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.93 vs. limit=15.0 2024-06-20 00:50:12,051 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.99 vs. limit=15.0 2024-06-20 00:50:19,579 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=19.77 vs. limit=15.0 2024-06-20 00:50:23,748 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.685e+02 9.284e+02 1.080e+03 1.260e+03 2.663e+03, threshold=2.161e+03, percent-clipped=1.0 2024-06-20 00:50:31,913 INFO [train.py:1028] (1/2) Epoch 6, batch 2250, loss[loss=0.3311, simple_loss=0.3597, pruned_loss=0.1512, over 13288.00 frames. ], tot_loss[loss=0.3262, simple_loss=0.3413, pruned_loss=0.1555, over 2585657.84 frames. ], batch size: 63, lr: 9.94e-03, grad_scale: 4.0 2024-06-20 00:50:38,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=96884.33333333333, ans=0.0 2024-06-20 00:50:46,101 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.66 vs. limit=5.0 2024-06-20 00:50:49,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=96902.66666666667, ans=0.125 2024-06-20 00:50:52,029 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=96921.0, ans=0.125 2024-06-20 00:50:58,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=96939.33333333333, ans=0.125 2024-06-20 00:51:00,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=96939.33333333333, ans=0.125 2024-06-20 00:51:03,899 INFO [train.py:1028] (1/2) Epoch 6, batch 2300, loss[loss=0.3202, simple_loss=0.3415, pruned_loss=0.1495, over 12931.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.3404, pruned_loss=0.1545, over 2580352.69 frames. ], batch size: 33, lr: 9.93e-03, grad_scale: 4.0 2024-06-20 00:51:05,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=96957.66666666667, ans=0.125 2024-06-20 00:51:11,983 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.56 vs. limit=10.0 2024-06-20 00:51:17,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=96994.33333333333, ans=0.125 2024-06-20 00:51:21,154 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.89 vs. limit=15.0 2024-06-20 00:51:33,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=97031.0, ans=0.125 2024-06-20 00:51:33,566 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.47 vs. limit=6.0 2024-06-20 00:51:34,527 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.140e+02 9.588e+02 1.125e+03 1.295e+03 2.878e+03, threshold=2.249e+03, percent-clipped=1.0 2024-06-20 00:51:38,709 INFO [train.py:1028] (1/2) Epoch 6, batch 2350, loss[loss=0.3088, simple_loss=0.3299, pruned_loss=0.1439, over 13189.00 frames. ], tot_loss[loss=0.3254, simple_loss=0.3407, pruned_loss=0.155, over 2584180.00 frames. ], batch size: 67, lr: 9.93e-03, grad_scale: 4.0 2024-06-20 00:51:40,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=97049.33333333333, ans=0.0 2024-06-20 00:51:42,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=97049.33333333333, ans=0.125 2024-06-20 00:51:48,700 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.82 vs. limit=10.0 2024-06-20 00:51:50,239 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.39 vs. limit=15.0 2024-06-20 00:51:52,334 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.99 vs. limit=15.0 2024-06-20 00:51:54,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=97086.0, ans=0.125 2024-06-20 00:51:55,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=97086.0, ans=0.125 2024-06-20 00:52:02,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=97104.33333333333, ans=0.0 2024-06-20 00:52:03,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=97104.33333333333, ans=0.125 2024-06-20 00:52:10,964 INFO [train.py:1028] (1/2) Epoch 6, batch 2400, loss[loss=0.3378, simple_loss=0.3549, pruned_loss=0.1604, over 13279.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.3399, pruned_loss=0.1548, over 2586929.03 frames. ], batch size: 46, lr: 9.92e-03, grad_scale: 2.0 2024-06-20 00:52:11,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=97141.0, ans=0.125 2024-06-20 00:52:27,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=97177.66666666667, ans=0.0 2024-06-20 00:52:28,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=97177.66666666667, ans=0.125 2024-06-20 00:52:33,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=97196.0, ans=0.125 2024-06-20 00:52:35,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=97196.0, ans=0.125 2024-06-20 00:52:38,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=97196.0, ans=0.125 2024-06-20 00:52:40,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=97214.33333333333, ans=0.0 2024-06-20 00:52:43,044 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.036e+02 1.197e+03 1.416e+03 1.654e+03 3.018e+03, threshold=2.832e+03, percent-clipped=7.0 2024-06-20 00:52:45,989 INFO [train.py:1028] (1/2) Epoch 6, batch 2450, loss[loss=0.3044, simple_loss=0.3245, pruned_loss=0.1421, over 13241.00 frames. ], tot_loss[loss=0.3249, simple_loss=0.3393, pruned_loss=0.1552, over 2583014.21 frames. ], batch size: 63, lr: 9.92e-03, grad_scale: 2.0 2024-06-20 00:53:00,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=97269.33333333333, ans=0.125 2024-06-20 00:53:10,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=97287.66666666667, ans=0.1 2024-06-20 00:53:16,000 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.95 vs. limit=10.0 2024-06-20 00:53:16,089 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=2.45 vs. limit=15.0 2024-06-20 00:53:18,074 INFO [train.py:1028] (1/2) Epoch 6, batch 2500, loss[loss=0.2992, simple_loss=0.3184, pruned_loss=0.14, over 13237.00 frames. ], tot_loss[loss=0.3242, simple_loss=0.3387, pruned_loss=0.1548, over 2586886.75 frames. ], batch size: 83, lr: 9.91e-03, grad_scale: 2.0 2024-06-20 00:53:19,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=97324.33333333333, ans=0.05 2024-06-20 00:53:27,627 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=1.635e+00 2024-06-20 00:53:30,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=97342.66666666667, ans=0.1 2024-06-20 00:53:38,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=97361.0, ans=0.0 2024-06-20 00:53:38,460 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=10.0 2024-06-20 00:53:42,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=97379.33333333333, ans=0.1 2024-06-20 00:53:43,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97379.33333333333, ans=0.1 2024-06-20 00:53:51,063 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.813e+02 1.158e+03 1.296e+03 1.557e+03 3.109e+03, threshold=2.592e+03, percent-clipped=1.0 2024-06-20 00:53:53,725 INFO [train.py:1028] (1/2) Epoch 6, batch 2550, loss[loss=0.3022, simple_loss=0.335, pruned_loss=0.1347, over 12685.00 frames. ], tot_loss[loss=0.3216, simple_loss=0.3364, pruned_loss=0.1534, over 2585640.48 frames. ], batch size: 22, lr: 9.91e-03, grad_scale: 2.0 2024-06-20 00:53:58,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=97416.0, ans=0.125 2024-06-20 00:53:59,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=97416.0, ans=0.125 2024-06-20 00:54:04,914 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.33 vs. limit=22.5 2024-06-20 00:54:07,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=97452.66666666667, ans=0.125 2024-06-20 00:54:12,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=97471.0, ans=0.04949747468305833 2024-06-20 00:54:20,685 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.29 vs. limit=22.5 2024-06-20 00:54:24,957 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=8.435e-01 2024-06-20 00:54:28,695 INFO [train.py:1028] (1/2) Epoch 6, batch 2600, loss[loss=0.2979, simple_loss=0.3207, pruned_loss=0.1375, over 13297.00 frames. ], tot_loss[loss=0.3209, simple_loss=0.3353, pruned_loss=0.1532, over 2586607.13 frames. ], batch size: 52, lr: 9.90e-03, grad_scale: 4.0 2024-06-20 00:54:30,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=97507.66666666667, ans=0.09899494936611666 2024-06-20 00:54:35,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=97526.0, ans=0.2 2024-06-20 00:54:52,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=97562.66666666667, ans=0.125 2024-06-20 00:54:58,767 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.595e+02 9.546e+02 1.128e+03 1.358e+03 2.681e+03, threshold=2.255e+03, percent-clipped=1.0 2024-06-20 00:54:58,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=97581.0, ans=0.0 2024-06-20 00:55:01,361 INFO [train.py:1028] (1/2) Epoch 6, batch 2650, loss[loss=0.3287, simple_loss=0.3312, pruned_loss=0.1631, over 12977.00 frames. ], tot_loss[loss=0.3187, simple_loss=0.3332, pruned_loss=0.1521, over 2586352.13 frames. ], batch size: 144, lr: 9.90e-03, grad_scale: 4.0 2024-06-20 00:55:22,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=97654.33333333333, ans=0.125 2024-06-20 00:55:30,356 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=12.71 vs. limit=12.0 2024-06-20 00:55:35,597 INFO [train.py:1028] (1/2) Epoch 6, batch 2700, loss[loss=0.2779, simple_loss=0.2979, pruned_loss=0.1289, over 13209.00 frames. ], tot_loss[loss=0.3181, simple_loss=0.332, pruned_loss=0.1521, over 2585378.06 frames. ], batch size: 89, lr: 9.89e-03, grad_scale: 4.0 2024-06-20 00:55:36,589 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.19 vs. limit=15.0 2024-06-20 00:55:40,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=97691.0, ans=0.2 2024-06-20 00:55:44,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=97709.33333333333, ans=0.125 2024-06-20 00:55:49,762 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.47 vs. limit=15.0 2024-06-20 00:55:53,117 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=16.52 vs. limit=15.0 2024-06-20 00:55:57,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=97746.0, ans=0.1 2024-06-20 00:56:04,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=97764.33333333333, ans=0.0 2024-06-20 00:56:09,172 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.884e+02 1.128e+03 1.362e+03 1.723e+03 2.860e+03, threshold=2.723e+03, percent-clipped=7.0 2024-06-20 00:56:11,275 INFO [train.py:1028] (1/2) Epoch 6, batch 2750, loss[loss=0.3053, simple_loss=0.3194, pruned_loss=0.1456, over 13229.00 frames. ], tot_loss[loss=0.3156, simple_loss=0.3301, pruned_loss=0.1506, over 2583756.31 frames. ], batch size: 43, lr: 9.89e-03, grad_scale: 4.0 2024-06-20 00:56:12,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=97782.66666666667, ans=0.125 2024-06-20 00:56:16,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=97782.66666666667, ans=0.125 2024-06-20 00:56:34,530 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.85 vs. limit=6.0 2024-06-20 00:56:36,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=97837.66666666667, ans=10.0 2024-06-20 00:56:37,032 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2024-06-20 00:56:37,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=97856.0, ans=0.2 2024-06-20 00:56:39,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=97856.0, ans=0.1 2024-06-20 00:56:44,699 INFO [train.py:1028] (1/2) Epoch 6, batch 2800, loss[loss=0.3539, simple_loss=0.3469, pruned_loss=0.1805, over 10986.00 frames. ], tot_loss[loss=0.3154, simple_loss=0.3295, pruned_loss=0.1506, over 2580472.39 frames. ], batch size: 304, lr: 9.89e-03, grad_scale: 8.0 2024-06-20 00:56:44,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=97874.33333333333, ans=0.125 2024-06-20 00:56:52,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=97892.66666666667, ans=0.125 2024-06-20 00:56:58,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=97892.66666666667, ans=0.07 2024-06-20 00:57:11,553 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.92 vs. limit=15.0 2024-06-20 00:57:18,849 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.360e+02 1.298e+03 1.508e+03 1.762e+03 2.773e+03, threshold=3.017e+03, percent-clipped=1.0 2024-06-20 00:57:19,599 INFO [train.py:1028] (1/2) Epoch 6, batch 2850, loss[loss=0.312, simple_loss=0.3325, pruned_loss=0.1458, over 13286.00 frames. ], tot_loss[loss=0.3157, simple_loss=0.3293, pruned_loss=0.151, over 2577429.19 frames. ], batch size: 49, lr: 9.88e-03, grad_scale: 2.0 2024-06-20 00:57:20,547 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=14.90 vs. limit=15.0 2024-06-20 00:57:29,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=97984.33333333333, ans=0.125 2024-06-20 00:57:34,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=98002.66666666667, ans=0.0 2024-06-20 00:57:46,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=98039.33333333333, ans=0.0 2024-06-20 00:57:50,709 INFO [train.py:1028] (1/2) Epoch 6, batch 2900, loss[loss=0.3076, simple_loss=0.3266, pruned_loss=0.1443, over 13143.00 frames. ], tot_loss[loss=0.314, simple_loss=0.3277, pruned_loss=0.1501, over 2585487.43 frames. ], batch size: 55, lr: 9.88e-03, grad_scale: 4.0 2024-06-20 00:58:02,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=98076.0, ans=0.125 2024-06-20 00:58:11,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=98094.33333333333, ans=0.2 2024-06-20 00:58:18,270 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=18.25 vs. limit=15.0 2024-06-20 00:58:22,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=98131.0, ans=0.0 2024-06-20 00:58:23,882 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.07 vs. limit=10.0 2024-06-20 00:58:25,913 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.122e+02 9.233e+02 1.132e+03 1.336e+03 1.963e+03, threshold=2.264e+03, percent-clipped=0.0 2024-06-20 00:58:26,766 INFO [train.py:1028] (1/2) Epoch 6, batch 2950, loss[loss=0.2898, simple_loss=0.3082, pruned_loss=0.1357, over 13280.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3266, pruned_loss=0.149, over 2578873.18 frames. ], batch size: 43, lr: 9.87e-03, grad_scale: 4.0 2024-06-20 00:58:28,125 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:58:34,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=98167.66666666667, ans=0.125 2024-06-20 00:58:40,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=98186.0, ans=0.0 2024-06-20 00:58:54,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=98204.33333333333, ans=0.2 2024-06-20 00:58:58,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=98222.66666666667, ans=0.0 2024-06-20 00:59:03,024 INFO [train.py:1028] (1/2) Epoch 6, batch 3000, loss[loss=0.3161, simple_loss=0.3315, pruned_loss=0.1503, over 13187.00 frames. ], tot_loss[loss=0.3107, simple_loss=0.3251, pruned_loss=0.1481, over 2578544.89 frames. ], batch size: 59, lr: 9.87e-03, grad_scale: 8.0 2024-06-20 00:59:03,025 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 00:59:10,981 INFO [train.py:1060] (1/2) Epoch 6, validation: loss=0.232, simple_loss=0.2887, pruned_loss=0.08766, over 351949.00 frames. 2024-06-20 00:59:10,982 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 00:59:15,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=98241.0, ans=0.125 2024-06-20 00:59:18,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=98259.33333333333, ans=0.125 2024-06-20 00:59:26,835 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=9.643e+01 2024-06-20 00:59:32,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=98296.0, ans=0.125 2024-06-20 00:59:34,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=98296.0, ans=0.025 2024-06-20 00:59:44,094 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.452e+02 7.112e+02 8.266e+02 9.922e+02 2.231e+03, threshold=1.653e+03, percent-clipped=0.0 2024-06-20 00:59:44,130 INFO [train.py:1028] (1/2) Epoch 6, batch 3050, loss[loss=0.3259, simple_loss=0.3405, pruned_loss=0.1556, over 13288.00 frames. ], tot_loss[loss=0.3108, simple_loss=0.3247, pruned_loss=0.1484, over 2577743.21 frames. ], batch size: 46, lr: 9.86e-03, grad_scale: 4.0 2024-06-20 00:59:45,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=98332.66666666667, ans=0.025 2024-06-20 00:59:47,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=98332.66666666667, ans=0.125 2024-06-20 00:59:50,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=98351.0, ans=0.09899494936611666 2024-06-20 01:00:03,607 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.70 vs. limit=22.5 2024-06-20 01:00:04,237 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.82 vs. limit=22.5 2024-06-20 01:00:06,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=98369.33333333333, ans=0.0 2024-06-20 01:00:08,253 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.19 vs. limit=15.0 2024-06-20 01:00:18,001 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.40 vs. limit=12.0 2024-06-20 01:00:19,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=98406.0, ans=0.125 2024-06-20 01:00:20,689 INFO [train.py:1028] (1/2) Epoch 6, batch 3100, loss[loss=0.2886, simple_loss=0.3017, pruned_loss=0.1377, over 13046.00 frames. ], tot_loss[loss=0.3091, simple_loss=0.3235, pruned_loss=0.1473, over 2579887.83 frames. ], batch size: 144, lr: 9.86e-03, grad_scale: 8.0 2024-06-20 01:00:21,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=98424.33333333333, ans=0.0 2024-06-20 01:00:33,722 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.71 vs. limit=15.0 2024-06-20 01:00:34,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=98461.0, ans=0.125 2024-06-20 01:00:39,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=98461.0, ans=0.125 2024-06-20 01:00:42,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=98479.33333333333, ans=0.125 2024-06-20 01:00:44,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=98479.33333333333, ans=0.125 2024-06-20 01:00:48,264 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.48 vs. limit=15.0 2024-06-20 01:00:53,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=98497.66666666667, ans=0.0 2024-06-20 01:00:57,342 INFO [train.py:1028] (1/2) Epoch 6, batch 3150, loss[loss=0.323, simple_loss=0.3295, pruned_loss=0.1582, over 12928.00 frames. ], tot_loss[loss=0.3079, simple_loss=0.3225, pruned_loss=0.1467, over 2581967.46 frames. ], batch size: 158, lr: 9.85e-03, grad_scale: 2.0 2024-06-20 01:00:58,537 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.873e+02 8.652e+02 1.048e+03 1.238e+03 2.093e+03, threshold=2.096e+03, percent-clipped=1.0 2024-06-20 01:01:00,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=98516.0, ans=0.2 2024-06-20 01:01:01,091 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.44 vs. limit=15.0 2024-06-20 01:01:06,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=98534.33333333333, ans=0.125 2024-06-20 01:01:17,373 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.10 vs. limit=15.0 2024-06-20 01:01:24,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=98589.33333333333, ans=0.0 2024-06-20 01:01:30,666 INFO [train.py:1028] (1/2) Epoch 6, batch 3200, loss[loss=0.3001, simple_loss=0.3183, pruned_loss=0.1409, over 13156.00 frames. ], tot_loss[loss=0.3071, simple_loss=0.3218, pruned_loss=0.1462, over 2583143.57 frames. ], batch size: 55, lr: 9.85e-03, grad_scale: 2.0 2024-06-20 01:01:32,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=98607.66666666667, ans=0.2 2024-06-20 01:01:43,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=98644.33333333333, ans=0.0 2024-06-20 01:02:01,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=98681.0, ans=0.125 2024-06-20 01:02:06,567 INFO [train.py:1028] (1/2) Epoch 6, batch 3250, loss[loss=0.2854, simple_loss=0.3074, pruned_loss=0.1316, over 13264.00 frames. ], tot_loss[loss=0.3066, simple_loss=0.3211, pruned_loss=0.1461, over 2586989.37 frames. ], batch size: 72, lr: 9.85e-03, grad_scale: 2.0 2024-06-20 01:02:07,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=98699.33333333333, ans=0.0 2024-06-20 01:02:08,517 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.328e+02 1.125e+03 1.346e+03 1.559e+03 2.738e+03, threshold=2.692e+03, percent-clipped=3.0 2024-06-20 01:02:13,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=98717.66666666667, ans=0.125 2024-06-20 01:02:16,233 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.42 vs. limit=15.0 2024-06-20 01:02:18,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=98717.66666666667, ans=0.125 2024-06-20 01:02:30,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=98754.33333333333, ans=0.125 2024-06-20 01:02:37,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=98772.66666666667, ans=0.1 2024-06-20 01:02:41,027 INFO [train.py:1028] (1/2) Epoch 6, batch 3300, loss[loss=0.3573, simple_loss=0.3517, pruned_loss=0.1814, over 12740.00 frames. ], tot_loss[loss=0.3061, simple_loss=0.3206, pruned_loss=0.1458, over 2582944.76 frames. ], batch size: 176, lr: 9.84e-03, grad_scale: 2.0 2024-06-20 01:02:43,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=98791.0, ans=0.0 2024-06-20 01:02:43,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=98791.0, ans=0.125 2024-06-20 01:03:05,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=98846.0, ans=0.1 2024-06-20 01:03:10,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=98864.33333333333, ans=0.0 2024-06-20 01:03:12,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=98864.33333333333, ans=0.0 2024-06-20 01:03:13,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=98864.33333333333, ans=0.2 2024-06-20 01:03:14,782 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=33.21 vs. limit=15.0 2024-06-20 01:03:17,080 INFO [train.py:1028] (1/2) Epoch 6, batch 3350, loss[loss=0.3257, simple_loss=0.3254, pruned_loss=0.163, over 12920.00 frames. ], tot_loss[loss=0.3071, simple_loss=0.3209, pruned_loss=0.1467, over 2578262.51 frames. ], batch size: 158, lr: 9.84e-03, grad_scale: 2.0 2024-06-20 01:03:19,712 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.617e+02 9.980e+02 1.171e+03 1.408e+03 2.435e+03, threshold=2.343e+03, percent-clipped=0.0 2024-06-20 01:03:19,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=98882.66666666667, ans=0.125 2024-06-20 01:03:22,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=98882.66666666667, ans=0.125 2024-06-20 01:03:26,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=98901.0, ans=0.125 2024-06-20 01:03:32,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=98919.33333333333, ans=0.0 2024-06-20 01:03:35,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=98919.33333333333, ans=0.125 2024-06-20 01:03:35,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=98919.33333333333, ans=0.125 2024-06-20 01:03:46,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=98956.0, ans=0.1 2024-06-20 01:03:51,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=98956.0, ans=0.0 2024-06-20 01:03:52,409 INFO [train.py:1028] (1/2) Epoch 6, batch 3400, loss[loss=0.3142, simple_loss=0.3284, pruned_loss=0.15, over 12399.00 frames. ], tot_loss[loss=0.3071, simple_loss=0.3206, pruned_loss=0.1468, over 2576353.75 frames. ], batch size: 22, lr: 9.83e-03, grad_scale: 4.0 2024-06-20 01:04:00,686 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=14.49 vs. limit=15.0 2024-06-20 01:04:03,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=98992.66666666667, ans=0.0 2024-06-20 01:04:09,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.58 vs. limit=6.0 2024-06-20 01:04:13,423 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.22 vs. limit=10.0 2024-06-20 01:04:15,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=99029.33333333333, ans=0.0 2024-06-20 01:04:18,302 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.35 vs. limit=22.5 2024-06-20 01:04:21,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.65 vs. limit=15.0 2024-06-20 01:04:22,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=99047.66666666667, ans=0.035 2024-06-20 01:04:25,311 INFO [train.py:1028] (1/2) Epoch 6, batch 3450, loss[loss=0.352, simple_loss=0.3499, pruned_loss=0.1771, over 12758.00 frames. ], tot_loss[loss=0.3058, simple_loss=0.3198, pruned_loss=0.146, over 2576275.03 frames. ], batch size: 176, lr: 9.83e-03, grad_scale: 4.0 2024-06-20 01:04:27,886 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.766e+02 1.070e+03 1.267e+03 1.523e+03 2.267e+03, threshold=2.535e+03, percent-clipped=0.0 2024-06-20 01:04:29,179 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=99066.0, ans=0.2 2024-06-20 01:04:32,991 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=1.308e+00 2024-06-20 01:04:33,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=99084.33333333333, ans=0.04949747468305833 2024-06-20 01:04:49,804 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.94 vs. limit=6.0 2024-06-20 01:04:56,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=99139.33333333333, ans=0.1 2024-06-20 01:05:01,033 INFO [train.py:1028] (1/2) Epoch 6, batch 3500, loss[loss=0.3044, simple_loss=0.3265, pruned_loss=0.1412, over 12956.00 frames. ], tot_loss[loss=0.3038, simple_loss=0.3182, pruned_loss=0.1447, over 2574781.37 frames. ], batch size: 33, lr: 9.82e-03, grad_scale: 8.0 2024-06-20 01:05:03,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=99157.66666666667, ans=0.125 2024-06-20 01:05:04,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=99157.66666666667, ans=0.125 2024-06-20 01:05:08,119 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.94 vs. limit=10.0 2024-06-20 01:05:09,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=99176.0, ans=0.125 2024-06-20 01:05:13,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=99176.0, ans=0.0 2024-06-20 01:05:20,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=99212.66666666667, ans=0.0 2024-06-20 01:05:27,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=99231.0, ans=0.125 2024-06-20 01:05:34,131 INFO [train.py:1028] (1/2) Epoch 6, batch 3550, loss[loss=0.3001, simple_loss=0.318, pruned_loss=0.1411, over 13153.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.318, pruned_loss=0.1442, over 2575853.03 frames. ], batch size: 95, lr: 9.82e-03, grad_scale: 8.0 2024-06-20 01:05:36,626 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.414e+02 7.417e+02 8.185e+02 9.702e+02 1.522e+03, threshold=1.637e+03, percent-clipped=0.0 2024-06-20 01:05:39,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=99267.66666666667, ans=0.125 2024-06-20 01:05:40,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=99267.66666666667, ans=15.0 2024-06-20 01:05:46,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=99286.0, ans=0.1 2024-06-20 01:05:56,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=99304.33333333333, ans=0.0 2024-06-20 01:06:06,398 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.79 vs. limit=22.5 2024-06-20 01:06:09,890 INFO [train.py:1028] (1/2) Epoch 6, batch 3600, loss[loss=0.2912, simple_loss=0.3134, pruned_loss=0.1345, over 13227.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3173, pruned_loss=0.1438, over 2579875.90 frames. ], batch size: 49, lr: 9.81e-03, grad_scale: 8.0 2024-06-20 01:06:15,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=99341.0, ans=0.125 2024-06-20 01:06:15,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=99341.0, ans=0.125 2024-06-20 01:06:27,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=99377.66666666667, ans=0.0 2024-06-20 01:06:40,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=99414.33333333333, ans=0.125 2024-06-20 01:06:41,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=99414.33333333333, ans=6.0 2024-06-20 01:06:42,946 INFO [train.py:1028] (1/2) Epoch 6, batch 3650, loss[loss=0.3175, simple_loss=0.3309, pruned_loss=0.1521, over 13012.00 frames. ], tot_loss[loss=0.3017, simple_loss=0.3167, pruned_loss=0.1433, over 2577791.29 frames. ], batch size: 102, lr: 9.81e-03, grad_scale: 8.0 2024-06-20 01:06:43,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=99432.66666666667, ans=0.2 2024-06-20 01:06:49,026 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.186e+02 7.368e+02 8.742e+02 1.006e+03 1.542e+03, threshold=1.748e+03, percent-clipped=0.0 2024-06-20 01:07:00,221 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.01 vs. limit=15.0 2024-06-20 01:07:05,196 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=2.396e+02 2024-06-20 01:07:17,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=99506.0, ans=0.125 2024-06-20 01:07:18,204 INFO [train.py:1028] (1/2) Epoch 6, batch 3700, loss[loss=0.286, simple_loss=0.3091, pruned_loss=0.1314, over 13281.00 frames. ], tot_loss[loss=0.2997, simple_loss=0.3151, pruned_loss=0.1422, over 2582742.00 frames. ], batch size: 72, lr: 9.81e-03, grad_scale: 8.0 2024-06-20 01:07:22,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=99524.33333333333, ans=0.5 2024-06-20 01:07:23,805 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2024-06-20 01:07:31,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=99561.0, ans=0.125 2024-06-20 01:07:33,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=99561.0, ans=0.0 2024-06-20 01:07:38,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=99579.33333333333, ans=0.125 2024-06-20 01:07:43,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=99597.66666666667, ans=0.125 2024-06-20 01:07:48,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=99597.66666666667, ans=0.2 2024-06-20 01:07:50,426 INFO [train.py:1028] (1/2) Epoch 6, batch 3750, loss[loss=0.3485, simple_loss=0.3552, pruned_loss=0.1708, over 12437.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3144, pruned_loss=0.1417, over 2585368.78 frames. ], batch size: 22, lr: 9.80e-03, grad_scale: 4.0 2024-06-20 01:07:57,438 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.954e+02 6.943e+02 8.097e+02 9.691e+02 1.370e+03, threshold=1.619e+03, percent-clipped=0.0 2024-06-20 01:07:59,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=99634.33333333333, ans=0.0 2024-06-20 01:08:00,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=99634.33333333333, ans=0.125 2024-06-20 01:08:00,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=99634.33333333333, ans=0.0 2024-06-20 01:08:01,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=99634.33333333333, ans=0.0 2024-06-20 01:08:07,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.03 vs. limit=22.5 2024-06-20 01:08:16,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=99671.0, ans=0.95 2024-06-20 01:08:20,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=99689.33333333333, ans=0.125 2024-06-20 01:08:22,372 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:08:24,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=99689.33333333333, ans=0.0 2024-06-20 01:08:26,784 INFO [train.py:1028] (1/2) Epoch 6, batch 3800, loss[loss=0.275, simple_loss=0.2991, pruned_loss=0.1254, over 13242.00 frames. ], tot_loss[loss=0.2984, simple_loss=0.3141, pruned_loss=0.1413, over 2584006.91 frames. ], batch size: 83, lr: 9.80e-03, grad_scale: 8.0 2024-06-20 01:08:42,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=99744.33333333333, ans=0.1 2024-06-20 01:09:02,377 INFO [train.py:1028] (1/2) Epoch 6, batch 3850, loss[loss=0.3046, simple_loss=0.3131, pruned_loss=0.148, over 13069.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3131, pruned_loss=0.1403, over 2584573.37 frames. ], batch size: 144, lr: 9.79e-03, grad_scale: 8.0 2024-06-20 01:09:06,249 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.717e+02 7.554e+02 8.331e+02 9.832e+02 1.639e+03, threshold=1.666e+03, percent-clipped=2.0 2024-06-20 01:09:06,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=99799.33333333333, ans=0.0 2024-06-20 01:09:07,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=99799.33333333333, ans=0.1 2024-06-20 01:09:07,875 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=22.24 vs. limit=15.0 2024-06-20 01:09:08,571 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.10 vs. limit=10.0 2024-06-20 01:09:12,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=99817.66666666667, ans=0.125 2024-06-20 01:09:14,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=99817.66666666667, ans=22.5 2024-06-20 01:09:23,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=99854.33333333333, ans=0.05 2024-06-20 01:09:25,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=99854.33333333333, ans=0.125 2024-06-20 01:09:34,666 INFO [train.py:1028] (1/2) Epoch 6, batch 3900, loss[loss=0.3037, simple_loss=0.3105, pruned_loss=0.1485, over 13234.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3127, pruned_loss=0.1402, over 2587882.83 frames. ], batch size: 83, lr: 9.79e-03, grad_scale: 8.0 2024-06-20 01:09:52,558 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=18.57 vs. limit=15.0 2024-06-20 01:10:00,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=99964.33333333333, ans=0.035 2024-06-20 01:10:06,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=99964.33333333333, ans=0.1 2024-06-20 01:10:08,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=99964.33333333333, ans=0.0 2024-06-20 01:10:09,104 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.12 vs. limit=10.0 2024-06-20 01:10:09,908 INFO [train.py:1028] (1/2) Epoch 6, batch 3950, loss[loss=0.3175, simple_loss=0.3185, pruned_loss=0.1583, over 13132.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.3112, pruned_loss=0.1393, over 2589237.54 frames. ], batch size: 132, lr: 9.78e-03, grad_scale: 8.0 2024-06-20 01:10:13,609 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.561e+02 8.249e+02 9.979e+02 1.174e+03 2.298e+03, threshold=1.996e+03, percent-clipped=2.0 2024-06-20 01:10:17,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=100001.0, ans=0.125 2024-06-20 01:10:41,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=100074.33333333333, ans=0.0 2024-06-20 01:10:42,438 INFO [train.py:1028] (1/2) Epoch 6, batch 4000, loss[loss=0.2963, simple_loss=0.3144, pruned_loss=0.1391, over 12936.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3108, pruned_loss=0.1394, over 2583243.47 frames. ], batch size: 39, lr: 9.78e-03, grad_scale: 4.0 2024-06-20 01:11:09,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=100129.33333333333, ans=0.0 2024-06-20 01:11:13,709 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.86 vs. limit=15.0 2024-06-20 01:11:18,309 INFO [train.py:1028] (1/2) Epoch 6, batch 4050, loss[loss=0.3333, simple_loss=0.3219, pruned_loss=0.1723, over 10942.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3112, pruned_loss=0.14, over 2580363.61 frames. ], batch size: 304, lr: 9.77e-03, grad_scale: 4.0 2024-06-20 01:11:23,349 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.151e+02 7.981e+02 9.352e+02 1.132e+03 2.256e+03, threshold=1.870e+03, percent-clipped=1.0 2024-06-20 01:11:33,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=100202.66666666667, ans=0.125 2024-06-20 01:11:51,284 INFO [train.py:1028] (1/2) Epoch 6, batch 4100, loss[loss=0.3009, simple_loss=0.3165, pruned_loss=0.1427, over 12964.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.311, pruned_loss=0.1402, over 2576473.69 frames. ], batch size: 102, lr: 9.77e-03, grad_scale: 8.0 2024-06-20 01:12:23,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=100331.0, ans=0.125 2024-06-20 01:12:28,166 INFO [train.py:1028] (1/2) Epoch 6, batch 4150, loss[loss=0.2729, simple_loss=0.2989, pruned_loss=0.1235, over 13088.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3095, pruned_loss=0.1392, over 2576066.63 frames. ], batch size: 55, lr: 9.77e-03, grad_scale: 8.0 2024-06-20 01:12:31,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=100349.33333333333, ans=0.2 2024-06-20 01:12:32,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=100349.33333333333, ans=0.125 2024-06-20 01:12:32,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=100349.33333333333, ans=0.09899494936611666 2024-06-20 01:12:33,407 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.320e+02 6.239e+02 7.485e+02 8.636e+02 1.277e+03, threshold=1.497e+03, percent-clipped=0.0 2024-06-20 01:12:41,871 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2024-06-20 01:12:44,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=100386.0, ans=0.2 2024-06-20 01:12:46,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=100386.0, ans=0.125 2024-06-20 01:12:48,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=100404.33333333333, ans=0.025 2024-06-20 01:13:04,530 INFO [train.py:1028] (1/2) Epoch 6, batch 4200, loss[loss=0.2729, simple_loss=0.2828, pruned_loss=0.1315, over 13011.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.3088, pruned_loss=0.1388, over 2579812.21 frames. ], batch size: 102, lr: 9.76e-03, grad_scale: 8.0 2024-06-20 01:13:10,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=100441.0, ans=0.125 2024-06-20 01:13:11,649 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.55 vs. limit=22.5 2024-06-20 01:13:13,494 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.51 vs. limit=10.0 2024-06-20 01:13:14,047 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:13:36,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=100514.33333333333, ans=0.0 2024-06-20 01:13:37,936 INFO [train.py:1028] (1/2) Epoch 6, batch 4250, loss[loss=0.2591, simple_loss=0.2903, pruned_loss=0.114, over 13346.00 frames. ], tot_loss[loss=0.293, simple_loss=0.309, pruned_loss=0.1385, over 2581343.86 frames. ], batch size: 46, lr: 9.76e-03, grad_scale: 4.0 2024-06-20 01:13:43,828 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.087e+02 7.802e+02 8.816e+02 9.960e+02 1.586e+03, threshold=1.763e+03, percent-clipped=1.0 2024-06-20 01:13:47,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.13 vs. limit=15.0 2024-06-20 01:13:48,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=100551.0, ans=0.125 2024-06-20 01:13:54,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=100569.33333333333, ans=0.1 2024-06-20 01:13:55,060 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.68 vs. limit=12.0 2024-06-20 01:14:12,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=100606.0, ans=0.0 2024-06-20 01:14:13,783 INFO [train.py:1028] (1/2) Epoch 6, batch 4300, loss[loss=0.2832, simple_loss=0.3074, pruned_loss=0.1295, over 13188.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.309, pruned_loss=0.1382, over 2580723.02 frames. ], batch size: 59, lr: 9.75e-03, grad_scale: 8.0 2024-06-20 01:14:15,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=100624.33333333333, ans=0.025 2024-06-20 01:14:17,412 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.48 vs. limit=10.0 2024-06-20 01:14:42,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=100697.66666666667, ans=0.125 2024-06-20 01:14:46,252 INFO [train.py:1028] (1/2) Epoch 6, batch 4350, loss[loss=0.2867, simple_loss=0.3158, pruned_loss=0.1288, over 13165.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3084, pruned_loss=0.1381, over 2584856.29 frames. ], batch size: 59, lr: 9.75e-03, grad_scale: 2.0 2024-06-20 01:14:55,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=100734.33333333333, ans=0.0 2024-06-20 01:14:56,900 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.685e+02 9.767e+02 1.180e+03 1.401e+03 3.558e+03, threshold=2.361e+03, percent-clipped=7.0 2024-06-20 01:14:57,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=100734.33333333333, ans=0.0 2024-06-20 01:14:58,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=100734.33333333333, ans=0.125 2024-06-20 01:14:59,836 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.98 vs. limit=15.0 2024-06-20 01:15:01,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=100734.33333333333, ans=0.1 2024-06-20 01:15:04,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=100752.66666666667, ans=0.05 2024-06-20 01:15:08,049 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=12.0 2024-06-20 01:15:13,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=100771.0, ans=0.125 2024-06-20 01:15:18,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=100789.33333333333, ans=0.0 2024-06-20 01:15:22,798 INFO [train.py:1028] (1/2) Epoch 6, batch 4400, loss[loss=0.284, simple_loss=0.2979, pruned_loss=0.1351, over 13230.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3087, pruned_loss=0.1384, over 2585084.89 frames. ], batch size: 83, lr: 9.74e-03, grad_scale: 4.0 2024-06-20 01:15:32,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=100826.0, ans=0.2 2024-06-20 01:15:35,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=100844.33333333333, ans=0.1 2024-06-20 01:15:39,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=100844.33333333333, ans=0.125 2024-06-20 01:15:44,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=100862.66666666667, ans=0.125 2024-06-20 01:15:44,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=100862.66666666667, ans=0.2 2024-06-20 01:15:50,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=100881.0, ans=0.125 2024-06-20 01:15:55,800 INFO [train.py:1028] (1/2) Epoch 6, batch 4450, loss[loss=0.3003, simple_loss=0.3255, pruned_loss=0.1375, over 13053.00 frames. ], tot_loss[loss=0.2933, simple_loss=0.309, pruned_loss=0.1387, over 2580662.29 frames. ], batch size: 33, lr: 9.74e-03, grad_scale: 4.0 2024-06-20 01:16:00,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=100899.33333333333, ans=0.125 2024-06-20 01:16:03,118 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.177e+02 8.814e+02 1.069e+03 1.192e+03 3.153e+03, threshold=2.139e+03, percent-clipped=1.0 2024-06-20 01:16:04,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.44 vs. limit=22.5 2024-06-20 01:16:17,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=100936.0, ans=0.1 2024-06-20 01:16:21,709 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=3.169e-01 2024-06-20 01:16:28,920 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2024-06-20 01:16:30,970 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.19 vs. limit=22.5 2024-06-20 01:16:31,841 INFO [train.py:1028] (1/2) Epoch 6, batch 4500, loss[loss=0.2648, simple_loss=0.2949, pruned_loss=0.1173, over 13198.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3077, pruned_loss=0.1377, over 2585285.59 frames. ], batch size: 89, lr: 9.73e-03, grad_scale: 8.0 2024-06-20 01:16:38,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=101009.33333333333, ans=0.125 2024-06-20 01:16:40,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=101009.33333333333, ans=0.2 2024-06-20 01:16:40,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=101009.33333333333, ans=0.0 2024-06-20 01:16:47,113 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=15.0 2024-06-20 01:16:52,201 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.29 vs. limit=15.0 2024-06-20 01:17:05,155 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2024-06-20 01:17:07,948 INFO [train.py:1028] (1/2) Epoch 6, batch 4550, loss[loss=0.2624, simple_loss=0.2913, pruned_loss=0.1168, over 13255.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3063, pruned_loss=0.1368, over 2589080.90 frames. ], batch size: 52, lr: 9.73e-03, grad_scale: 8.0 2024-06-20 01:17:12,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=101082.66666666667, ans=0.0 2024-06-20 01:17:14,806 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.372e+02 1.112e+03 1.273e+03 1.464e+03 2.363e+03, threshold=2.546e+03, percent-clipped=1.0 2024-06-20 01:17:15,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=101101.0, ans=0.2 2024-06-20 01:17:23,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=101119.33333333333, ans=0.125 2024-06-20 01:17:26,178 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.57 vs. limit=10.0 2024-06-20 01:17:30,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=101137.66666666667, ans=0.125 2024-06-20 01:17:35,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=101156.0, ans=0.0 2024-06-20 01:17:39,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=101156.0, ans=0.125 2024-06-20 01:17:40,149 INFO [train.py:1028] (1/2) Epoch 6, batch 4600, loss[loss=0.3543, simple_loss=0.3495, pruned_loss=0.1796, over 12547.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.307, pruned_loss=0.1372, over 2584663.27 frames. ], batch size: 202, lr: 9.73e-03, grad_scale: 8.0 2024-06-20 01:17:41,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=101174.33333333333, ans=15.0 2024-06-20 01:17:42,552 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.09 vs. limit=22.5 2024-06-20 01:17:48,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=101192.66666666667, ans=0.1 2024-06-20 01:17:57,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=101211.0, ans=0.125 2024-06-20 01:18:00,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=101229.33333333333, ans=0.125 2024-06-20 01:18:08,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=101229.33333333333, ans=0.0 2024-06-20 01:18:13,096 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.33 vs. limit=10.0 2024-06-20 01:18:14,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=101247.66666666667, ans=0.125 2024-06-20 01:18:16,535 INFO [train.py:1028] (1/2) Epoch 6, batch 4650, loss[loss=0.3027, simple_loss=0.3076, pruned_loss=0.1489, over 13111.00 frames. ], tot_loss[loss=0.29, simple_loss=0.306, pruned_loss=0.137, over 2587653.85 frames. ], batch size: 132, lr: 9.72e-03, grad_scale: 2.0 2024-06-20 01:18:17,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=101266.0, ans=0.1 2024-06-20 01:18:25,546 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.841e+02 9.136e+02 1.077e+03 1.233e+03 1.735e+03, threshold=2.153e+03, percent-clipped=0.0 2024-06-20 01:18:26,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=101284.33333333333, ans=0.1 2024-06-20 01:18:32,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=101302.66666666667, ans=0.1 2024-06-20 01:18:33,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=101302.66666666667, ans=0.125 2024-06-20 01:18:46,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=101339.33333333333, ans=0.2 2024-06-20 01:18:52,997 INFO [train.py:1028] (1/2) Epoch 6, batch 4700, loss[loss=0.2955, simple_loss=0.3273, pruned_loss=0.1318, over 12334.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3061, pruned_loss=0.1368, over 2583435.88 frames. ], batch size: 25, lr: 9.72e-03, grad_scale: 4.0 2024-06-20 01:19:23,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=101431.0, ans=0.025 2024-06-20 01:19:24,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101431.0, ans=0.1 2024-06-20 01:19:25,970 INFO [train.py:1028] (1/2) Epoch 6, batch 4750, loss[loss=0.3158, simple_loss=0.3195, pruned_loss=0.156, over 12522.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3061, pruned_loss=0.1373, over 2579973.60 frames. ], batch size: 202, lr: 9.71e-03, grad_scale: 2.0 2024-06-20 01:19:34,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.99 vs. limit=15.0 2024-06-20 01:19:35,845 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.428e+02 8.777e+02 1.007e+03 1.182e+03 2.243e+03, threshold=2.013e+03, percent-clipped=1.0 2024-06-20 01:19:46,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=101504.33333333333, ans=10.0 2024-06-20 01:19:49,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=101504.33333333333, ans=0.0 2024-06-20 01:19:50,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.45 vs. limit=15.0 2024-06-20 01:19:52,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=101522.66666666667, ans=0.0 2024-06-20 01:19:58,997 INFO [train.py:1028] (1/2) Epoch 6, batch 4800, loss[loss=0.2893, simple_loss=0.3091, pruned_loss=0.1348, over 13259.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3059, pruned_loss=0.137, over 2577495.27 frames. ], batch size: 63, lr: 9.71e-03, grad_scale: 4.0 2024-06-20 01:19:59,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=101541.0, ans=0.125 2024-06-20 01:20:07,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=101541.0, ans=0.0 2024-06-20 01:20:07,957 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=28.12 vs. limit=22.5 2024-06-20 01:20:17,775 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.46 vs. limit=15.0 2024-06-20 01:20:19,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=101577.66666666667, ans=0.0 2024-06-20 01:20:20,506 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.08 vs. limit=22.5 2024-06-20 01:20:25,075 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.52 vs. limit=12.0 2024-06-20 01:20:34,758 INFO [train.py:1028] (1/2) Epoch 6, batch 4850, loss[loss=0.2788, simple_loss=0.2931, pruned_loss=0.1323, over 13272.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3063, pruned_loss=0.1372, over 2575876.36 frames. ], batch size: 89, lr: 9.70e-03, grad_scale: 2.0 2024-06-20 01:20:39,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=101632.66666666667, ans=0.125 2024-06-20 01:20:48,451 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.975e+02 9.063e+02 1.146e+03 1.329e+03 2.365e+03, threshold=2.292e+03, percent-clipped=3.0 2024-06-20 01:21:02,402 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2024-06-20 01:21:11,502 INFO [train.py:1028] (1/2) Epoch 6, batch 4900, loss[loss=0.3183, simple_loss=0.3278, pruned_loss=0.1544, over 13252.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.3064, pruned_loss=0.1373, over 2576253.05 frames. ], batch size: 59, lr: 9.70e-03, grad_scale: 4.0 2024-06-20 01:21:22,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=101742.66666666667, ans=0.125 2024-06-20 01:21:36,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=101779.33333333333, ans=0.09899494936611666 2024-06-20 01:21:41,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101797.66666666667, ans=0.1 2024-06-20 01:21:42,031 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.39 vs. limit=15.0 2024-06-20 01:21:44,232 INFO [train.py:1028] (1/2) Epoch 6, batch 4950, loss[loss=0.3105, simple_loss=0.3067, pruned_loss=0.1571, over 11072.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3074, pruned_loss=0.1386, over 2570159.59 frames. ], batch size: 304, lr: 9.70e-03, grad_scale: 4.0 2024-06-20 01:21:45,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=101816.0, ans=0.05 2024-06-20 01:21:47,967 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.44 vs. limit=22.5 2024-06-20 01:21:55,128 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 8.228e+02 1.276e+03 1.546e+03 1.810e+03 3.035e+03, threshold=3.093e+03, percent-clipped=7.0 2024-06-20 01:22:05,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=101852.66666666667, ans=0.2 2024-06-20 01:22:05,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=101852.66666666667, ans=0.125 2024-06-20 01:22:06,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=101871.0, ans=0.1 2024-06-20 01:22:07,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=101871.0, ans=0.125 2024-06-20 01:22:13,106 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.65 vs. limit=22.5 2024-06-20 01:22:14,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=101889.33333333333, ans=0.125 2024-06-20 01:22:19,898 INFO [train.py:1028] (1/2) Epoch 6, batch 5000, loss[loss=0.2942, simple_loss=0.3072, pruned_loss=0.1406, over 13173.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.3062, pruned_loss=0.1374, over 2574443.17 frames. ], batch size: 95, lr: 9.69e-03, grad_scale: 4.0 2024-06-20 01:22:23,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=101907.66666666667, ans=0.1 2024-06-20 01:22:40,298 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.15 vs. limit=15.0 2024-06-20 01:22:56,218 INFO [train.py:1028] (1/2) Epoch 6, batch 5050, loss[loss=0.2945, simple_loss=0.3142, pruned_loss=0.1374, over 12964.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3068, pruned_loss=0.1373, over 2574209.60 frames. ], batch size: 36, lr: 9.69e-03, grad_scale: 2.0 2024-06-20 01:22:57,022 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:22:57,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=101999.33333333333, ans=0.125 2024-06-20 01:23:07,518 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.998e+02 1.271e+03 1.495e+03 1.762e+03 2.864e+03, threshold=2.991e+03, percent-clipped=0.0 2024-06-20 01:23:28,474 INFO [train.py:1028] (1/2) Epoch 6, batch 5100, loss[loss=0.2952, simple_loss=0.3163, pruned_loss=0.137, over 12916.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.307, pruned_loss=0.1381, over 2569370.09 frames. ], batch size: 39, lr: 9.68e-03, grad_scale: 2.0 2024-06-20 01:23:42,122 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.63 vs. limit=5.0 2024-06-20 01:23:50,925 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.66 vs. limit=10.0 2024-06-20 01:23:58,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=102164.33333333333, ans=0.1 2024-06-20 01:23:59,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=102164.33333333333, ans=0.125 2024-06-20 01:24:03,700 INFO [train.py:1028] (1/2) Epoch 6, batch 5150, loss[loss=0.2887, simple_loss=0.2955, pruned_loss=0.1409, over 13058.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3059, pruned_loss=0.1377, over 2571658.11 frames. ], batch size: 132, lr: 9.68e-03, grad_scale: 2.0 2024-06-20 01:24:04,772 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.91 vs. limit=15.0 2024-06-20 01:24:06,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102182.66666666667, ans=0.1 2024-06-20 01:24:06,300 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.25 vs. limit=15.0 2024-06-20 01:24:13,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102201.0, ans=0.1 2024-06-20 01:24:16,275 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 6.962e+02 1.438e+03 1.698e+03 1.975e+03 4.116e+03, threshold=3.396e+03, percent-clipped=1.0 2024-06-20 01:24:21,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=102219.33333333333, ans=0.0 2024-06-20 01:24:29,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=102256.0, ans=0.125 2024-06-20 01:24:30,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=102256.0, ans=0.09899494936611666 2024-06-20 01:24:35,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2024-06-20 01:24:35,905 INFO [train.py:1028] (1/2) Epoch 6, batch 5200, loss[loss=0.2697, simple_loss=0.2894, pruned_loss=0.125, over 13157.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3058, pruned_loss=0.1375, over 2574099.14 frames. ], batch size: 95, lr: 9.67e-03, grad_scale: 4.0 2024-06-20 01:24:40,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=102274.33333333333, ans=0.0 2024-06-20 01:24:44,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=102274.33333333333, ans=0.0 2024-06-20 01:25:00,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.69 vs. limit=15.0 2024-06-20 01:25:01,252 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.85 vs. limit=15.0 2024-06-20 01:25:11,107 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.81 vs. limit=15.0 2024-06-20 01:25:12,029 INFO [train.py:1028] (1/2) Epoch 6, batch 5250, loss[loss=0.2776, simple_loss=0.2935, pruned_loss=0.1308, over 13260.00 frames. ], tot_loss[loss=0.2922, simple_loss=0.3073, pruned_loss=0.1385, over 2570137.16 frames. ], batch size: 52, lr: 9.67e-03, grad_scale: 2.0 2024-06-20 01:25:12,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=102366.0, ans=0.125 2024-06-20 01:25:16,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=102366.0, ans=0.05 2024-06-20 01:25:16,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=102366.0, ans=0.2 2024-06-20 01:25:17,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=102366.0, ans=0.07 2024-06-20 01:25:25,464 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.309e+02 1.320e+03 1.559e+03 1.852e+03 2.798e+03, threshold=3.117e+03, percent-clipped=0.0 2024-06-20 01:25:38,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=102439.33333333333, ans=0.0 2024-06-20 01:25:42,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=102439.33333333333, ans=0.125 2024-06-20 01:25:44,641 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=26.74 vs. limit=22.5 2024-06-20 01:25:45,473 INFO [train.py:1028] (1/2) Epoch 6, batch 5300, loss[loss=0.3113, simple_loss=0.3189, pruned_loss=0.1518, over 13012.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3071, pruned_loss=0.1385, over 2566807.54 frames. ], batch size: 144, lr: 9.67e-03, grad_scale: 4.0 2024-06-20 01:25:49,715 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.44 vs. limit=10.0 2024-06-20 01:26:01,140 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:26:04,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=102494.33333333333, ans=0.125 2024-06-20 01:26:15,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=102531.0, ans=0.2 2024-06-20 01:26:15,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=102531.0, ans=0.2 2024-06-20 01:26:22,313 INFO [train.py:1028] (1/2) Epoch 6, batch 5350, loss[loss=0.3074, simple_loss=0.3228, pruned_loss=0.146, over 11134.00 frames. ], tot_loss[loss=0.2913, simple_loss=0.3066, pruned_loss=0.138, over 2572654.19 frames. ], batch size: 16, lr: 9.66e-03, grad_scale: 2.0 2024-06-20 01:26:40,283 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 9.071e+02 1.349e+03 1.608e+03 1.884e+03 2.769e+03, threshold=3.216e+03, percent-clipped=0.0 2024-06-20 01:26:50,989 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=12.0 2024-06-20 01:26:55,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=102622.66666666667, ans=0.0 2024-06-20 01:26:57,306 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.75 vs. limit=6.0 2024-06-20 01:26:58,309 INFO [train.py:1028] (1/2) Epoch 6, batch 5400, loss[loss=0.3401, simple_loss=0.3271, pruned_loss=0.1766, over 12272.00 frames. ], tot_loss[loss=0.2931, simple_loss=0.3076, pruned_loss=0.1393, over 2565720.44 frames. ], batch size: 240, lr: 9.66e-03, grad_scale: 2.0 2024-06-20 01:27:02,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=102641.0, ans=0.2 2024-06-20 01:27:06,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.76 vs. limit=15.0 2024-06-20 01:27:13,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=102659.33333333333, ans=0.125 2024-06-20 01:27:14,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=102659.33333333333, ans=0.125 2024-06-20 01:27:15,021 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.19 vs. limit=22.5 2024-06-20 01:27:19,563 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.95 vs. limit=15.0 2024-06-20 01:27:23,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=102696.0, ans=0.125 2024-06-20 01:27:27,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=102696.0, ans=0.04949747468305833 2024-06-20 01:27:36,454 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=25.33 vs. limit=15.0 2024-06-20 01:27:36,690 INFO [train.py:1028] (1/2) Epoch 6, batch 5450, loss[loss=0.3036, simple_loss=0.323, pruned_loss=0.1421, over 12482.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3068, pruned_loss=0.1382, over 2570189.84 frames. ], batch size: 25, lr: 9.65e-03, grad_scale: 1.0 2024-06-20 01:27:44,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=102751.0, ans=0.0 2024-06-20 01:27:46,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=102751.0, ans=0.125 2024-06-20 01:27:55,968 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.928e+02 9.785e+02 1.164e+03 1.397e+03 5.624e+03, threshold=2.327e+03, percent-clipped=2.0 2024-06-20 01:27:56,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=102769.33333333333, ans=0.2 2024-06-20 01:28:10,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=102806.0, ans=0.125 2024-06-20 01:28:13,643 INFO [train.py:1028] (1/2) Epoch 6, batch 5500, loss[loss=0.3388, simple_loss=0.3326, pruned_loss=0.1724, over 12129.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3058, pruned_loss=0.1375, over 2562854.67 frames. ], batch size: 240, lr: 9.65e-03, grad_scale: 2.0 2024-06-20 01:28:24,653 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.47 vs. limit=22.5 2024-06-20 01:28:33,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=102861.0, ans=0.025 2024-06-20 01:28:34,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=102861.0, ans=0.125 2024-06-20 01:28:35,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=102861.0, ans=0.0 2024-06-20 01:28:35,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=102861.0, ans=0.0 2024-06-20 01:28:45,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=102897.66666666667, ans=0.0 2024-06-20 01:28:48,045 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.97 vs. limit=15.0 2024-06-20 01:28:49,902 INFO [train.py:1028] (1/2) Epoch 6, batch 5550, loss[loss=0.2887, simple_loss=0.3018, pruned_loss=0.1378, over 13186.00 frames. ], tot_loss[loss=0.2889, simple_loss=0.3051, pruned_loss=0.1363, over 2566553.28 frames. ], batch size: 43, lr: 9.65e-03, grad_scale: 2.0 2024-06-20 01:28:50,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=102916.0, ans=0.2 2024-06-20 01:28:54,338 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.96 vs. limit=15.0 2024-06-20 01:28:58,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=102934.33333333333, ans=0.125 2024-06-20 01:28:58,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=102934.33333333333, ans=0.125 2024-06-20 01:28:58,962 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.55 vs. limit=15.0 2024-06-20 01:29:05,519 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 7.860e+02 1.026e+03 1.226e+03 1.535e+03 6.286e+03, threshold=2.452e+03, percent-clipped=5.0 2024-06-20 01:29:12,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=102971.0, ans=0.0 2024-06-20 01:29:16,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=102989.33333333333, ans=0.125 2024-06-20 01:29:17,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102989.33333333333, ans=0.1 2024-06-20 01:29:19,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=102989.33333333333, ans=0.04949747468305833 2024-06-20 01:29:20,532 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=5.311e+02 2024-06-20 01:29:22,239 INFO [train.py:1028] (1/2) Epoch 6, batch 5600, loss[loss=0.2674, simple_loss=0.2897, pruned_loss=0.1225, over 13277.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3046, pruned_loss=0.1362, over 2568602.21 frames. ], batch size: 89, lr: 9.64e-03, grad_scale: 2.0 2024-06-20 01:29:22,673 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.76 vs. limit=15.0 2024-06-20 01:29:25,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=103007.66666666667, ans=0.0 2024-06-20 01:29:44,504 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.71 vs. limit=10.0 2024-06-20 01:29:59,000 INFO [train.py:1028] (1/2) Epoch 6, batch 5650, loss[loss=0.3411, simple_loss=0.3322, pruned_loss=0.175, over 12501.00 frames. ], tot_loss[loss=0.2874, simple_loss=0.3038, pruned_loss=0.1355, over 2573516.15 frames. ], batch size: 202, lr: 9.64e-03, grad_scale: 2.0 2024-06-20 01:30:05,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=103117.66666666667, ans=0.1 2024-06-20 01:30:12,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=103136.0, ans=0.125 2024-06-20 01:30:13,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=103136.0, ans=0.0 2024-06-20 01:30:15,278 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.115e+02 8.754e+02 9.782e+02 1.242e+03 3.604e+03, threshold=1.956e+03, percent-clipped=2.0 2024-06-20 01:30:16,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=103136.0, ans=0.125 2024-06-20 01:30:26,375 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.98 vs. limit=15.0 2024-06-20 01:30:28,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=103172.66666666667, ans=0.0 2024-06-20 01:30:31,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=103172.66666666667, ans=0.125 2024-06-20 01:30:32,845 INFO [train.py:1028] (1/2) Epoch 6, batch 5700, loss[loss=0.2578, simple_loss=0.2863, pruned_loss=0.1146, over 13300.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3038, pruned_loss=0.1358, over 2576719.05 frames. ], batch size: 63, lr: 9.63e-03, grad_scale: 4.0 2024-06-20 01:30:32,942 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=103191.0, ans=0.0 2024-06-20 01:30:53,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=103227.66666666667, ans=0.2 2024-06-20 01:30:57,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=103246.0, ans=0.125 2024-06-20 01:30:58,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.47 vs. limit=10.0 2024-06-20 01:31:09,039 INFO [train.py:1028] (1/2) Epoch 6, batch 5750, loss[loss=0.3233, simple_loss=0.3269, pruned_loss=0.1598, over 12727.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.3053, pruned_loss=0.1368, over 2577983.70 frames. ], batch size: 176, lr: 9.63e-03, grad_scale: 1.0 2024-06-20 01:31:11,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=103282.66666666667, ans=0.125 2024-06-20 01:31:13,860 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.41 vs. limit=15.0 2024-06-20 01:31:26,249 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.927e+02 9.370e+02 1.129e+03 1.328e+03 3.062e+03, threshold=2.258e+03, percent-clipped=3.0 2024-06-20 01:31:32,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=103337.66666666667, ans=0.125 2024-06-20 01:31:41,792 INFO [train.py:1028] (1/2) Epoch 6, batch 5800, loss[loss=0.3005, simple_loss=0.3063, pruned_loss=0.1473, over 12799.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3078, pruned_loss=0.1388, over 2577329.05 frames. ], batch size: 176, lr: 9.62e-03, grad_scale: 2.0 2024-06-20 01:31:52,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=103392.66666666667, ans=0.0 2024-06-20 01:31:53,815 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.00 vs. limit=15.0 2024-06-20 01:32:02,801 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.62 vs. limit=15.0 2024-06-20 01:32:12,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=103447.66666666667, ans=0.0 2024-06-20 01:32:12,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=103447.66666666667, ans=0.1 2024-06-20 01:32:12,660 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.15 vs. limit=15.0 2024-06-20 01:32:18,808 INFO [train.py:1028] (1/2) Epoch 6, batch 5850, loss[loss=0.3253, simple_loss=0.3277, pruned_loss=0.1615, over 12570.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3107, pruned_loss=0.1407, over 2576823.38 frames. ], batch size: 202, lr: 9.62e-03, grad_scale: 2.0 2024-06-20 01:32:21,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=103466.0, ans=0.125 2024-06-20 01:32:24,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=103466.0, ans=0.2 2024-06-20 01:32:25,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=103484.33333333333, ans=0.0 2024-06-20 01:32:37,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.16 vs. limit=22.5 2024-06-20 01:32:39,330 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.489e+02 1.111e+03 1.278e+03 1.486e+03 2.956e+03, threshold=2.557e+03, percent-clipped=1.0 2024-06-20 01:32:39,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=103502.66666666667, ans=0.2 2024-06-20 01:32:43,206 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.84 vs. limit=15.0 2024-06-20 01:32:46,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=103521.0, ans=0.0 2024-06-20 01:32:55,379 INFO [train.py:1028] (1/2) Epoch 6, batch 5900, loss[loss=0.2604, simple_loss=0.2751, pruned_loss=0.1229, over 13107.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3128, pruned_loss=0.1415, over 2577493.67 frames. ], batch size: 121, lr: 9.62e-03, grad_scale: 4.0 2024-06-20 01:33:03,408 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.23 vs. limit=15.0 2024-06-20 01:33:05,463 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.79 vs. limit=15.0 2024-06-20 01:33:06,112 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.75 vs. limit=15.0 2024-06-20 01:33:09,374 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.75 vs. limit=15.0 2024-06-20 01:33:28,995 INFO [train.py:1028] (1/2) Epoch 6, batch 5950, loss[loss=0.3083, simple_loss=0.3177, pruned_loss=0.1494, over 13108.00 frames. ], tot_loss[loss=0.2984, simple_loss=0.3136, pruned_loss=0.1416, over 2581576.03 frames. ], batch size: 121, lr: 9.61e-03, grad_scale: 4.0 2024-06-20 01:33:30,772 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.53 vs. limit=12.0 2024-06-20 01:33:37,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=103667.66666666667, ans=0.0 2024-06-20 01:33:41,062 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.70 vs. limit=22.5 2024-06-20 01:33:42,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=103686.0, ans=0.0 2024-06-20 01:33:45,660 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.269e+02 8.909e+02 1.064e+03 1.243e+03 1.858e+03, threshold=2.127e+03, percent-clipped=0.0 2024-06-20 01:33:48,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=103704.33333333333, ans=0.0 2024-06-20 01:33:51,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=103704.33333333333, ans=0.125 2024-06-20 01:34:04,619 INFO [train.py:1028] (1/2) Epoch 6, batch 6000, loss[loss=0.368, simple_loss=0.3531, pruned_loss=0.1914, over 12264.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.315, pruned_loss=0.1426, over 2574837.90 frames. ], batch size: 240, lr: 9.61e-03, grad_scale: 8.0 2024-06-20 01:34:04,620 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 01:34:10,853 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.4.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([5.0308, 4.0093, 2.8646, 4.5217], device='cuda:1') 2024-06-20 01:34:11,170 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.2.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([1.1402, 1.0450, 1.3596, 2.3964], device='cuda:1') 2024-06-20 01:34:12,613 INFO [train.py:1060] (1/2) Epoch 6, validation: loss=0.2312, simple_loss=0.2879, pruned_loss=0.08728, over 351949.00 frames. 2024-06-20 01:34:12,614 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 01:34:16,942 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.64 vs. limit=15.0 2024-06-20 01:34:19,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=103759.33333333333, ans=0.2 2024-06-20 01:34:25,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=103777.66666666667, ans=0.2 2024-06-20 01:34:32,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=103796.0, ans=0.0 2024-06-20 01:34:33,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=103796.0, ans=0.125 2024-06-20 01:34:33,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.85 vs. limit=22.5 2024-06-20 01:34:46,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=103814.33333333333, ans=0.1 2024-06-20 01:34:51,372 INFO [train.py:1028] (1/2) Epoch 6, batch 6050, loss[loss=0.283, simple_loss=0.3123, pruned_loss=0.1268, over 12953.00 frames. ], tot_loss[loss=0.3011, simple_loss=0.3165, pruned_loss=0.1429, over 2577794.29 frames. ], batch size: 39, lr: 9.60e-03, grad_scale: 8.0 2024-06-20 01:35:08,724 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.774e+02 7.397e+02 8.831e+02 1.017e+03 1.469e+03, threshold=1.766e+03, percent-clipped=0.0 2024-06-20 01:35:20,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=103906.0, ans=0.125 2024-06-20 01:35:24,966 INFO [train.py:1028] (1/2) Epoch 6, batch 6100, loss[loss=0.2915, simple_loss=0.3055, pruned_loss=0.1387, over 13112.00 frames. ], tot_loss[loss=0.3027, simple_loss=0.3181, pruned_loss=0.1436, over 2579822.29 frames. ], batch size: 121, lr: 9.60e-03, grad_scale: 8.0 2024-06-20 01:35:41,974 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.43 vs. limit=22.5 2024-06-20 01:35:44,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=103961.0, ans=0.2 2024-06-20 01:35:48,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=103979.33333333333, ans=0.05 2024-06-20 01:35:53,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=103997.66666666667, ans=0.1 2024-06-20 01:35:58,623 INFO [train.py:1028] (1/2) Epoch 6, batch 6150, loss[loss=0.3388, simple_loss=0.3327, pruned_loss=0.1724, over 10827.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.3204, pruned_loss=0.1447, over 2578518.33 frames. ], batch size: 303, lr: 9.59e-03, grad_scale: 8.0 2024-06-20 01:36:01,775 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.69 vs. limit=15.0 2024-06-20 01:36:15,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=104052.66666666667, ans=0.09899494936611666 2024-06-20 01:36:18,811 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.575e+02 7.844e+02 8.898e+02 1.040e+03 1.564e+03, threshold=1.780e+03, percent-clipped=0.0 2024-06-20 01:36:24,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104071.0, ans=0.1 2024-06-20 01:36:29,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=104089.33333333333, ans=0.025 2024-06-20 01:36:30,186 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.95 vs. limit=15.0 2024-06-20 01:36:35,077 INFO [train.py:1028] (1/2) Epoch 6, batch 6200, loss[loss=0.34, simple_loss=0.3549, pruned_loss=0.1626, over 13233.00 frames. ], tot_loss[loss=0.3078, simple_loss=0.3231, pruned_loss=0.1463, over 2575254.58 frames. ], batch size: 89, lr: 9.59e-03, grad_scale: 8.0 2024-06-20 01:36:35,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=104107.66666666667, ans=0.05 2024-06-20 01:36:42,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=104107.66666666667, ans=0.0 2024-06-20 01:36:59,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=104162.66666666667, ans=0.1 2024-06-20 01:37:02,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=104162.66666666667, ans=0.0 2024-06-20 01:37:06,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=104181.0, ans=0.125 2024-06-20 01:37:12,645 INFO [train.py:1028] (1/2) Epoch 6, batch 6250, loss[loss=0.3083, simple_loss=0.3268, pruned_loss=0.1448, over 13177.00 frames. ], tot_loss[loss=0.3099, simple_loss=0.3247, pruned_loss=0.1476, over 2567369.40 frames. ], batch size: 83, lr: 9.59e-03, grad_scale: 4.0 2024-06-20 01:37:14,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=104199.33333333333, ans=0.025 2024-06-20 01:37:18,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=104199.33333333333, ans=0.0 2024-06-20 01:37:27,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=104236.0, ans=0.2 2024-06-20 01:37:30,028 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.62 vs. limit=22.5 2024-06-20 01:37:30,884 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.952e+02 6.215e+02 7.331e+02 8.348e+02 1.298e+03, threshold=1.466e+03, percent-clipped=0.0 2024-06-20 01:37:33,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=104254.33333333333, ans=0.125 2024-06-20 01:37:36,487 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.18 vs. limit=15.0 2024-06-20 01:37:37,079 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.29 vs. limit=15.0 2024-06-20 01:37:41,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=104272.66666666667, ans=0.0 2024-06-20 01:37:45,009 INFO [train.py:1028] (1/2) Epoch 6, batch 6300, loss[loss=0.2675, simple_loss=0.3003, pruned_loss=0.1174, over 11964.00 frames. ], tot_loss[loss=0.3115, simple_loss=0.3263, pruned_loss=0.1483, over 2563134.60 frames. ], batch size: 17, lr: 9.58e-03, grad_scale: 8.0 2024-06-20 01:37:47,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=104291.0, ans=0.1 2024-06-20 01:37:49,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=104291.0, ans=0.1 2024-06-20 01:38:05,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=104327.66666666667, ans=0.0 2024-06-20 01:38:12,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=104346.0, ans=0.0 2024-06-20 01:38:17,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=104364.33333333333, ans=0.0 2024-06-20 01:38:18,213 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.35 vs. limit=15.0 2024-06-20 01:38:21,052 INFO [train.py:1028] (1/2) Epoch 6, batch 6350, loss[loss=0.3811, simple_loss=0.3766, pruned_loss=0.1928, over 12523.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.328, pruned_loss=0.1485, over 2572133.70 frames. ], batch size: 202, lr: 9.58e-03, grad_scale: 4.0 2024-06-20 01:38:24,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=104382.66666666667, ans=0.035 2024-06-20 01:38:38,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=104419.33333333333, ans=0.1 2024-06-20 01:38:39,576 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.984e+02 6.094e+02 7.184e+02 8.120e+02 1.749e+03, threshold=1.437e+03, percent-clipped=3.0 2024-06-20 01:38:44,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=104437.66666666667, ans=0.125 2024-06-20 01:38:46,821 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.14 vs. limit=15.0 2024-06-20 01:38:47,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=104437.66666666667, ans=0.125 2024-06-20 01:38:50,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=104456.0, ans=0.125 2024-06-20 01:38:56,894 INFO [train.py:1028] (1/2) Epoch 6, batch 6400, loss[loss=0.3126, simple_loss=0.3369, pruned_loss=0.1442, over 13220.00 frames. ], tot_loss[loss=0.3149, simple_loss=0.3303, pruned_loss=0.1498, over 2574427.70 frames. ], batch size: 67, lr: 9.57e-03, grad_scale: 8.0 2024-06-20 01:39:04,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=104492.66666666667, ans=0.125 2024-06-20 01:39:07,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=104492.66666666667, ans=0.0 2024-06-20 01:39:13,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=104511.0, ans=0.0 2024-06-20 01:39:16,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=104529.33333333333, ans=0.05 2024-06-20 01:39:29,523 INFO [train.py:1028] (1/2) Epoch 6, batch 6450, loss[loss=0.3915, simple_loss=0.3787, pruned_loss=0.2021, over 12542.00 frames. ], tot_loss[loss=0.3184, simple_loss=0.3335, pruned_loss=0.1517, over 2580376.66 frames. ], batch size: 202, lr: 9.57e-03, grad_scale: 8.0 2024-06-20 01:39:33,028 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=2.158e+02 2024-06-20 01:39:33,366 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.63 vs. limit=15.0 2024-06-20 01:39:33,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=104566.0, ans=0.125 2024-06-20 01:39:48,696 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.79 vs. limit=10.0 2024-06-20 01:39:48,873 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.998e+02 7.295e+02 8.514e+02 9.694e+02 1.702e+03, threshold=1.703e+03, percent-clipped=2.0 2024-06-20 01:39:50,258 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:39:55,393 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.92 vs. limit=15.0 2024-06-20 01:40:01,390 INFO [train.py:1028] (1/2) Epoch 6, batch 6500, loss[loss=0.3278, simple_loss=0.3264, pruned_loss=0.1645, over 10920.00 frames. ], tot_loss[loss=0.3198, simple_loss=0.3351, pruned_loss=0.1522, over 2584212.09 frames. ], batch size: 303, lr: 9.57e-03, grad_scale: 4.0 2024-06-20 01:40:13,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=104676.0, ans=0.125 2024-06-20 01:40:14,033 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=104676.0, ans=0.125 2024-06-20 01:40:34,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=104731.0, ans=22.5 2024-06-20 01:40:35,331 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.51 vs. limit=22.5 2024-06-20 01:40:35,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=104731.0, ans=0.125 2024-06-20 01:40:37,001 INFO [train.py:1028] (1/2) Epoch 6, batch 6550, loss[loss=0.2508, simple_loss=0.2876, pruned_loss=0.1069, over 12558.00 frames. ], tot_loss[loss=0.3206, simple_loss=0.3363, pruned_loss=0.1524, over 2588378.12 frames. ], batch size: 22, lr: 9.56e-03, grad_scale: 2.0 2024-06-20 01:40:37,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=104749.33333333333, ans=0.0 2024-06-20 01:40:38,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=104749.33333333333, ans=0.125 2024-06-20 01:40:39,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=104749.33333333333, ans=0.125 2024-06-20 01:41:01,502 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.581e+02 7.885e+02 9.108e+02 1.072e+03 2.590e+03, threshold=1.822e+03, percent-clipped=4.0 2024-06-20 01:41:05,297 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.81 vs. limit=10.0 2024-06-20 01:41:13,242 INFO [train.py:1028] (1/2) Epoch 6, batch 6600, loss[loss=0.2787, simple_loss=0.3117, pruned_loss=0.1229, over 13197.00 frames. ], tot_loss[loss=0.3203, simple_loss=0.3363, pruned_loss=0.1522, over 2590660.22 frames. ], batch size: 72, lr: 9.56e-03, grad_scale: 4.0 2024-06-20 01:41:14,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=104841.0, ans=0.125 2024-06-20 01:41:14,352 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.29 vs. limit=15.0 2024-06-20 01:41:18,687 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.76 vs. limit=15.0 2024-06-20 01:41:22,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.89 vs. limit=22.5 2024-06-20 01:41:27,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=104877.66666666667, ans=0.0 2024-06-20 01:41:27,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=104877.66666666667, ans=10.0 2024-06-20 01:41:31,778 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=104877.66666666667, ans=0.0 2024-06-20 01:41:34,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=104896.0, ans=0.125 2024-06-20 01:41:36,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104896.0, ans=0.1 2024-06-20 01:41:46,044 INFO [train.py:1028] (1/2) Epoch 6, batch 6650, loss[loss=0.3688, simple_loss=0.3653, pruned_loss=0.1862, over 13000.00 frames. ], tot_loss[loss=0.3224, simple_loss=0.3383, pruned_loss=0.1533, over 2583805.09 frames. ], batch size: 158, lr: 9.55e-03, grad_scale: 4.0 2024-06-20 01:41:52,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=104951.0, ans=0.125 2024-06-20 01:42:00,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=104969.33333333333, ans=0.1 2024-06-20 01:42:00,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=104969.33333333333, ans=0.0 2024-06-20 01:42:06,884 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.050e+02 7.308e+02 8.435e+02 9.713e+02 1.453e+03, threshold=1.687e+03, percent-clipped=0.0 2024-06-20 01:42:18,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=105006.0, ans=0.2 2024-06-20 01:42:22,575 INFO [train.py:1028] (1/2) Epoch 6, batch 6700, loss[loss=0.346, simple_loss=0.3529, pruned_loss=0.1696, over 12700.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3397, pruned_loss=0.1539, over 2582669.14 frames. ], batch size: 176, lr: 9.55e-03, grad_scale: 8.0 2024-06-20 01:42:37,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=105061.0, ans=0.0 2024-06-20 01:42:56,117 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.18 vs. limit=15.0 2024-06-20 01:42:57,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=105097.66666666667, ans=0.2 2024-06-20 01:43:00,689 INFO [train.py:1028] (1/2) Epoch 6, batch 6750, loss[loss=0.4362, simple_loss=0.4119, pruned_loss=0.2302, over 12252.00 frames. ], tot_loss[loss=0.3254, simple_loss=0.3409, pruned_loss=0.155, over 2575952.44 frames. ], batch size: 241, lr: 9.55e-03, grad_scale: 4.0 2024-06-20 01:43:08,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=105134.33333333333, ans=0.2 2024-06-20 01:43:09,758 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.27 vs. limit=15.0 2024-06-20 01:43:12,982 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.81 vs. limit=22.5 2024-06-20 01:43:14,970 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.82 vs. limit=22.5 2024-06-20 01:43:16,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=105152.66666666667, ans=0.125 2024-06-20 01:43:22,895 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.541e+02 6.858e+02 8.793e+02 1.036e+03 4.493e+03, threshold=1.759e+03, percent-clipped=1.0 2024-06-20 01:43:33,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=105207.66666666667, ans=0.125 2024-06-20 01:43:33,466 INFO [train.py:1028] (1/2) Epoch 6, batch 6800, loss[loss=0.3217, simple_loss=0.3446, pruned_loss=0.1494, over 13263.00 frames. ], tot_loss[loss=0.3264, simple_loss=0.3423, pruned_loss=0.1552, over 2577905.30 frames. ], batch size: 67, lr: 9.54e-03, grad_scale: 4.0 2024-06-20 01:43:57,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=105262.66666666667, ans=0.0 2024-06-20 01:44:00,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=105281.0, ans=0.2 2024-06-20 01:44:03,871 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.69 vs. limit=15.0 2024-06-20 01:44:05,920 INFO [train.py:1028] (1/2) Epoch 6, batch 6850, loss[loss=0.3505, simple_loss=0.3727, pruned_loss=0.1642, over 13276.00 frames. ], tot_loss[loss=0.3259, simple_loss=0.3423, pruned_loss=0.1547, over 2581908.70 frames. ], batch size: 63, lr: 9.54e-03, grad_scale: 4.0 2024-06-20 01:44:12,387 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.71 vs. limit=6.0 2024-06-20 01:44:24,914 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.79 vs. limit=22.5 2024-06-20 01:44:31,554 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.802e+02 7.452e+02 8.768e+02 1.027e+03 3.457e+03, threshold=1.754e+03, percent-clipped=4.0 2024-06-20 01:44:31,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=105354.33333333333, ans=0.05 2024-06-20 01:44:39,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=105372.66666666667, ans=0.1 2024-06-20 01:44:41,748 INFO [train.py:1028] (1/2) Epoch 6, batch 6900, loss[loss=0.3384, simple_loss=0.3594, pruned_loss=0.1587, over 13277.00 frames. ], tot_loss[loss=0.3277, simple_loss=0.3439, pruned_loss=0.1558, over 2584111.59 frames. ], batch size: 49, lr: 9.53e-03, grad_scale: 8.0 2024-06-20 01:44:45,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=105391.0, ans=0.125 2024-06-20 01:45:03,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=105427.66666666667, ans=0.125 2024-06-20 01:45:14,895 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.99 vs. limit=22.5 2024-06-20 01:45:17,191 INFO [train.py:1028] (1/2) Epoch 6, batch 6950, loss[loss=0.3007, simple_loss=0.3287, pruned_loss=0.1363, over 11806.00 frames. ], tot_loss[loss=0.3285, simple_loss=0.3448, pruned_loss=0.1561, over 2579301.49 frames. ], batch size: 17, lr: 9.53e-03, grad_scale: 4.0 2024-06-20 01:45:25,955 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.62 vs. limit=12.0 2024-06-20 01:45:28,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=105501.0, ans=0.0 2024-06-20 01:45:29,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=105519.33333333333, ans=0.2 2024-06-20 01:45:29,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=105519.33333333333, ans=0.1 2024-06-20 01:45:39,873 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.460e+02 1.025e+03 1.227e+03 1.439e+03 2.544e+03, threshold=2.454e+03, percent-clipped=9.0 2024-06-20 01:45:41,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=105537.66666666667, ans=0.125 2024-06-20 01:45:45,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=105556.0, ans=0.0 2024-06-20 01:45:46,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=105556.0, ans=0.0 2024-06-20 01:45:46,597 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.70 vs. limit=15.0 2024-06-20 01:45:49,423 INFO [train.py:1028] (1/2) Epoch 6, batch 7000, loss[loss=0.3571, simple_loss=0.3651, pruned_loss=0.1746, over 12887.00 frames. ], tot_loss[loss=0.3275, simple_loss=0.3441, pruned_loss=0.1554, over 2574971.43 frames. ], batch size: 158, lr: 9.52e-03, grad_scale: 8.0 2024-06-20 01:45:51,015 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2024-06-20 01:45:52,472 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.69 vs. limit=15.0 2024-06-20 01:45:53,088 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.14 vs. limit=15.0 2024-06-20 01:45:56,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=105592.66666666667, ans=0.1 2024-06-20 01:45:59,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=105592.66666666667, ans=0.125 2024-06-20 01:46:03,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=105611.0, ans=0.2 2024-06-20 01:46:10,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=105629.33333333333, ans=0.1 2024-06-20 01:46:17,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=105647.66666666667, ans=0.1 2024-06-20 01:46:26,606 INFO [train.py:1028] (1/2) Epoch 6, batch 7050, loss[loss=0.3542, simple_loss=0.3624, pruned_loss=0.173, over 12778.00 frames. ], tot_loss[loss=0.3283, simple_loss=0.3451, pruned_loss=0.1557, over 2581728.93 frames. ], batch size: 176, lr: 9.52e-03, grad_scale: 1.0 2024-06-20 01:46:29,958 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.67 vs. limit=15.0 2024-06-20 01:46:35,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=105684.33333333333, ans=0.0 2024-06-20 01:46:41,045 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=1.232e+02 2024-06-20 01:46:43,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=105702.66666666667, ans=0.125 2024-06-20 01:46:48,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=105721.0, ans=0.5 2024-06-20 01:46:53,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=105721.0, ans=0.125 2024-06-20 01:46:54,524 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.469e+02 1.226e+03 1.456e+03 1.740e+03 3.307e+03, threshold=2.912e+03, percent-clipped=4.0 2024-06-20 01:46:56,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=105739.33333333333, ans=0.0 2024-06-20 01:46:59,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=105739.33333333333, ans=0.1 2024-06-20 01:47:00,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=105739.33333333333, ans=0.125 2024-06-20 01:47:01,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=105757.66666666667, ans=0.1 2024-06-20 01:47:02,469 INFO [train.py:1028] (1/2) Epoch 6, batch 7100, loss[loss=0.3632, simple_loss=0.3727, pruned_loss=0.1768, over 13211.00 frames. ], tot_loss[loss=0.3307, simple_loss=0.3467, pruned_loss=0.1573, over 2575067.19 frames. ], batch size: 112, lr: 9.52e-03, grad_scale: 2.0 2024-06-20 01:47:09,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=105776.0, ans=0.05 2024-06-20 01:47:13,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=105776.0, ans=0.1 2024-06-20 01:47:14,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=105776.0, ans=0.125 2024-06-20 01:47:16,751 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=15.0 2024-06-20 01:47:20,762 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.90 vs. limit=22.5 2024-06-20 01:47:35,684 INFO [train.py:1028] (1/2) Epoch 6, batch 7150, loss[loss=0.4024, simple_loss=0.3962, pruned_loss=0.2043, over 12559.00 frames. ], tot_loss[loss=0.3311, simple_loss=0.3473, pruned_loss=0.1575, over 2573447.78 frames. ], batch size: 202, lr: 9.51e-03, grad_scale: 2.0 2024-06-20 01:47:38,008 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.98 vs. limit=15.0 2024-06-20 01:47:42,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=105867.66666666667, ans=0.0 2024-06-20 01:47:44,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=105867.66666666667, ans=0.125 2024-06-20 01:47:45,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=105867.66666666667, ans=0.125 2024-06-20 01:47:48,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=105886.0, ans=0.025 2024-06-20 01:47:54,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=105886.0, ans=0.2 2024-06-20 01:47:57,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=105904.33333333333, ans=0.1 2024-06-20 01:48:00,813 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.638e+02 9.300e+02 1.129e+03 1.295e+03 2.373e+03, threshold=2.257e+03, percent-clipped=0.0 2024-06-20 01:48:01,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=105922.66666666667, ans=0.07 2024-06-20 01:48:08,471 INFO [train.py:1028] (1/2) Epoch 6, batch 7200, loss[loss=0.3666, simple_loss=0.3785, pruned_loss=0.1774, over 13142.00 frames. ], tot_loss[loss=0.3312, simple_loss=0.3477, pruned_loss=0.1573, over 2578264.48 frames. ], batch size: 112, lr: 9.51e-03, grad_scale: 4.0 2024-06-20 01:48:10,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=105941.0, ans=0.0 2024-06-20 01:48:13,348 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.38 vs. limit=15.0 2024-06-20 01:48:17,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=105959.33333333333, ans=0.0 2024-06-20 01:48:38,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=19.96 vs. limit=15.0 2024-06-20 01:48:40,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=106014.33333333333, ans=0.1 2024-06-20 01:48:42,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=106014.33333333333, ans=0.125 2024-06-20 01:48:43,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=106014.33333333333, ans=0.2 2024-06-20 01:48:44,595 INFO [train.py:1028] (1/2) Epoch 6, batch 7250, loss[loss=0.3121, simple_loss=0.3388, pruned_loss=0.1427, over 12911.00 frames. ], tot_loss[loss=0.3319, simple_loss=0.3487, pruned_loss=0.1576, over 2579360.42 frames. ], batch size: 36, lr: 9.50e-03, grad_scale: 4.0 2024-06-20 01:48:51,214 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=7.338e+01 2024-06-20 01:48:51,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=106051.0, ans=0.125 2024-06-20 01:48:52,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=106051.0, ans=0.1 2024-06-20 01:48:53,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=106051.0, ans=0.09899494936611666 2024-06-20 01:49:04,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=106069.33333333333, ans=0.125 2024-06-20 01:49:09,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=106087.66666666667, ans=0.125 2024-06-20 01:49:13,544 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.299e+02 8.240e+02 9.829e+02 1.196e+03 1.932e+03, threshold=1.966e+03, percent-clipped=0.0 2024-06-20 01:49:18,624 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.16 vs. limit=15.0 2024-06-20 01:49:21,326 INFO [train.py:1028] (1/2) Epoch 6, batch 7300, loss[loss=0.3378, simple_loss=0.3552, pruned_loss=0.1603, over 12837.00 frames. ], tot_loss[loss=0.3332, simple_loss=0.35, pruned_loss=0.1582, over 2579256.93 frames. ], batch size: 36, lr: 9.50e-03, grad_scale: 4.0 2024-06-20 01:49:26,447 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.33 vs. limit=15.0 2024-06-20 01:49:36,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.63 vs. limit=15.0 2024-06-20 01:49:36,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=106161.0, ans=0.125 2024-06-20 01:49:38,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=106161.0, ans=0.0 2024-06-20 01:49:43,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=106179.33333333333, ans=0.125 2024-06-20 01:49:53,703 INFO [train.py:1028] (1/2) Epoch 6, batch 7350, loss[loss=0.3746, simple_loss=0.3843, pruned_loss=0.1825, over 13288.00 frames. ], tot_loss[loss=0.3339, simple_loss=0.3508, pruned_loss=0.1586, over 2580269.23 frames. ], batch size: 46, lr: 9.50e-03, grad_scale: 2.0 2024-06-20 01:50:06,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106252.66666666667, ans=0.1 2024-06-20 01:50:06,310 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.09 vs. limit=15.0 2024-06-20 01:50:19,368 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2024-06-20 01:50:19,651 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.852e+02 8.095e+02 9.523e+02 1.084e+03 2.294e+03, threshold=1.905e+03, percent-clipped=1.0 2024-06-20 01:50:26,117 INFO [train.py:1028] (1/2) Epoch 6, batch 7400, loss[loss=0.3466, simple_loss=0.3726, pruned_loss=0.1603, over 13235.00 frames. ], tot_loss[loss=0.3336, simple_loss=0.3508, pruned_loss=0.1582, over 2586390.26 frames. ], batch size: 63, lr: 9.49e-03, grad_scale: 4.0 2024-06-20 01:50:32,843 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=15.30 vs. limit=15.0 2024-06-20 01:50:50,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=106362.66666666667, ans=0.0 2024-06-20 01:50:51,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=106362.66666666667, ans=0.1 2024-06-20 01:50:53,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=106362.66666666667, ans=0.1 2024-06-20 01:50:53,944 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.04 vs. limit=15.0 2024-06-20 01:50:54,595 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.36 vs. limit=15.0 2024-06-20 01:50:57,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=106381.0, ans=0.125 2024-06-20 01:51:06,277 INFO [train.py:1028] (1/2) Epoch 6, batch 7450, loss[loss=0.2894, simple_loss=0.317, pruned_loss=0.1309, over 12626.00 frames. ], tot_loss[loss=0.3331, simple_loss=0.3504, pruned_loss=0.1579, over 2580169.48 frames. ], batch size: 29, lr: 9.49e-03, grad_scale: 2.0 2024-06-20 01:51:15,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=106417.66666666667, ans=0.0 2024-06-20 01:51:21,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=106436.0, ans=0.035 2024-06-20 01:51:27,724 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.473e+01 2024-06-20 01:51:32,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=106454.33333333333, ans=22.5 2024-06-20 01:51:33,365 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.93 vs. limit=10.0 2024-06-20 01:51:34,169 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.406e+02 8.515e+02 9.635e+02 1.171e+03 3.444e+03, threshold=1.927e+03, percent-clipped=3.0 2024-06-20 01:51:40,261 INFO [train.py:1028] (1/2) Epoch 6, batch 7500, loss[loss=0.3427, simple_loss=0.3437, pruned_loss=0.1709, over 10584.00 frames. ], tot_loss[loss=0.3354, simple_loss=0.3523, pruned_loss=0.1592, over 2577078.46 frames. ], batch size: 304, lr: 9.48e-03, grad_scale: 4.0 2024-06-20 01:51:44,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=106491.0, ans=0.025 2024-06-20 01:51:45,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=106491.0, ans=0.125 2024-06-20 01:51:53,038 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.63 vs. limit=10.0 2024-06-20 01:52:04,894 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.08 vs. limit=15.0 2024-06-20 01:52:05,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=106546.0, ans=0.2 2024-06-20 01:52:12,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=106564.33333333333, ans=0.125 2024-06-20 01:52:13,426 INFO [train.py:1028] (1/2) Epoch 6, batch 7550, loss[loss=0.3329, simple_loss=0.3393, pruned_loss=0.1632, over 12934.00 frames. ], tot_loss[loss=0.3368, simple_loss=0.3531, pruned_loss=0.1603, over 2576979.57 frames. ], batch size: 158, lr: 9.48e-03, grad_scale: 4.0 2024-06-20 01:52:35,718 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.94 vs. limit=10.0 2024-06-20 01:52:38,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=106637.66666666667, ans=0.125 2024-06-20 01:52:43,569 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 5.460e+02 7.838e+02 8.819e+02 9.949e+02 1.629e+03, threshold=1.764e+03, percent-clipped=0.0 2024-06-20 01:52:49,166 INFO [train.py:1028] (1/2) Epoch 6, batch 7600, loss[loss=0.3665, simple_loss=0.3747, pruned_loss=0.1791, over 13185.00 frames. ], tot_loss[loss=0.3371, simple_loss=0.3535, pruned_loss=0.1604, over 2576388.59 frames. ], batch size: 83, lr: 9.48e-03, grad_scale: 8.0 2024-06-20 01:53:04,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=106692.66666666667, ans=0.125 2024-06-20 01:53:05,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=106711.0, ans=0.0 2024-06-20 01:53:07,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=106711.0, ans=0.125 2024-06-20 01:53:13,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=106729.33333333333, ans=0.0 2024-06-20 01:53:17,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=106729.33333333333, ans=0.125 2024-06-20 01:53:17,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106729.33333333333, ans=0.1 2024-06-20 01:53:19,343 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.54 vs. limit=6.0 2024-06-20 01:53:25,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=106766.0, ans=0.1 2024-06-20 01:53:25,857 INFO [train.py:1028] (1/2) Epoch 6, batch 7650, loss[loss=0.3105, simple_loss=0.3312, pruned_loss=0.1449, over 12920.00 frames. ], tot_loss[loss=0.3376, simple_loss=0.3543, pruned_loss=0.1605, over 2572829.01 frames. ], batch size: 33, lr: 9.47e-03, grad_scale: 4.0 2024-06-20 01:53:30,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=106766.0, ans=0.025 2024-06-20 01:53:34,910 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.81 vs. limit=22.5 2024-06-20 01:53:47,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=106821.0, ans=0.125 2024-06-20 01:53:52,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.87 vs. limit=22.5 2024-06-20 01:53:53,490 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.773e+02 7.245e+02 8.799e+02 1.036e+03 1.566e+03, threshold=1.760e+03, percent-clipped=0.0 2024-06-20 01:53:58,977 INFO [train.py:1028] (1/2) Epoch 6, batch 7700, loss[loss=0.3509, simple_loss=0.3743, pruned_loss=0.1638, over 13265.00 frames. ], tot_loss[loss=0.3375, simple_loss=0.3543, pruned_loss=0.1603, over 2569436.09 frames. ], batch size: 63, lr: 9.47e-03, grad_scale: 8.0 2024-06-20 01:54:20,165 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.34 vs. limit=15.0 2024-06-20 01:54:28,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=106931.0, ans=0.1 2024-06-20 01:54:32,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=106931.0, ans=0.0 2024-06-20 01:54:33,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=106931.0, ans=0.0 2024-06-20 01:54:35,137 INFO [train.py:1028] (1/2) Epoch 6, batch 7750, loss[loss=0.3283, simple_loss=0.3466, pruned_loss=0.155, over 13070.00 frames. ], tot_loss[loss=0.3398, simple_loss=0.3559, pruned_loss=0.1619, over 2573549.04 frames. ], batch size: 71, lr: 9.46e-03, grad_scale: 8.0 2024-06-20 01:54:42,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=106967.66666666667, ans=0.125 2024-06-20 01:54:43,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=106967.66666666667, ans=0.125 2024-06-20 01:54:55,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=106986.0, ans=0.0 2024-06-20 01:54:59,265 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.22 vs. limit=15.0 2024-06-20 01:54:59,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=107004.33333333333, ans=0.125 2024-06-20 01:55:06,117 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.436e+02 6.608e+02 7.848e+02 9.254e+02 1.503e+03, threshold=1.570e+03, percent-clipped=0.0 2024-06-20 01:55:11,327 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:55:11,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=107041.0, ans=0.0 2024-06-20 01:55:11,763 INFO [train.py:1028] (1/2) Epoch 6, batch 7800, loss[loss=0.3135, simple_loss=0.3351, pruned_loss=0.1459, over 13186.00 frames. ], tot_loss[loss=0.3382, simple_loss=0.3552, pruned_loss=0.1606, over 2579123.25 frames. ], batch size: 95, lr: 9.46e-03, grad_scale: 8.0 2024-06-20 01:55:19,384 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.90 vs. limit=15.0 2024-06-20 01:55:22,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=107059.33333333333, ans=0.125 2024-06-20 01:55:22,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=107059.33333333333, ans=10.0 2024-06-20 01:55:37,643 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2024-06-20 01:55:45,002 INFO [train.py:1028] (1/2) Epoch 6, batch 7850, loss[loss=0.3045, simple_loss=0.3272, pruned_loss=0.1409, over 11262.00 frames. ], tot_loss[loss=0.339, simple_loss=0.3558, pruned_loss=0.1611, over 2573203.47 frames. ], batch size: 16, lr: 9.46e-03, grad_scale: 4.0 2024-06-20 01:55:45,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=107132.66666666667, ans=0.125 2024-06-20 01:55:45,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=107132.66666666667, ans=0.0 2024-06-20 01:55:51,337 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.04 vs. limit=15.0 2024-06-20 01:55:54,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=107151.0, ans=0.125 2024-06-20 01:55:56,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=107151.0, ans=0.125 2024-06-20 01:55:56,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=107151.0, ans=0.09899494936611666 2024-06-20 01:55:58,598 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.58 vs. limit=12.0 2024-06-20 01:55:58,625 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.10 vs. limit=22.5 2024-06-20 01:56:03,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=107169.33333333333, ans=0.0 2024-06-20 01:56:04,142 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=10.95 vs. limit=12.0 2024-06-20 01:56:12,839 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.744e+02 5.089e+02 6.002e+02 7.106e+02 1.254e+03, threshold=1.200e+03, percent-clipped=0.0 2024-06-20 01:56:20,822 INFO [train.py:1028] (1/2) Epoch 6, batch 7900, loss[loss=0.3039, simple_loss=0.3307, pruned_loss=0.1386, over 13209.00 frames. ], tot_loss[loss=0.338, simple_loss=0.3551, pruned_loss=0.1604, over 2572299.60 frames. ], batch size: 77, lr: 9.45e-03, grad_scale: 8.0 2024-06-20 01:56:23,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=107224.33333333333, ans=0.0 2024-06-20 01:56:26,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=107224.33333333333, ans=0.125 2024-06-20 01:56:31,913 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.04 vs. limit=10.0 2024-06-20 01:56:34,487 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.96 vs. limit=15.0 2024-06-20 01:56:41,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=107279.33333333333, ans=0.125 2024-06-20 01:56:44,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=107279.33333333333, ans=0.125 2024-06-20 01:56:49,322 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.28 vs. limit=15.0 2024-06-20 01:56:55,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=107297.66666666667, ans=0.0 2024-06-20 01:56:56,787 INFO [train.py:1028] (1/2) Epoch 6, batch 7950, loss[loss=0.3295, simple_loss=0.3394, pruned_loss=0.1598, over 10583.00 frames. ], tot_loss[loss=0.338, simple_loss=0.3552, pruned_loss=0.1604, over 2575331.04 frames. ], batch size: 303, lr: 9.45e-03, grad_scale: 8.0 2024-06-20 01:57:09,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=107352.66666666667, ans=0.2 2024-06-20 01:57:11,105 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2024-06-20 01:57:13,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=107352.66666666667, ans=0.125 2024-06-20 01:57:14,556 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.44 vs. limit=6.0 2024-06-20 01:57:15,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=107352.66666666667, ans=0.0 2024-06-20 01:57:19,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=107371.0, ans=0.0 2024-06-20 01:57:21,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=107371.0, ans=0.125 2024-06-20 01:57:23,542 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.66 vs. limit=22.5 2024-06-20 01:57:24,896 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.884e+02 5.199e+02 5.932e+02 7.051e+02 1.141e+03, threshold=1.186e+03, percent-clipped=0.0 2024-06-20 01:57:25,379 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.72 vs. limit=15.0 2024-06-20 01:57:29,769 INFO [train.py:1028] (1/2) Epoch 6, batch 8000, loss[loss=0.2788, simple_loss=0.3112, pruned_loss=0.1232, over 12745.00 frames. ], tot_loss[loss=0.3387, simple_loss=0.3563, pruned_loss=0.1606, over 2572574.98 frames. ], batch size: 29, lr: 9.44e-03, grad_scale: 16.0 2024-06-20 01:57:45,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=107444.33333333333, ans=0.0 2024-06-20 01:57:50,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=107462.66666666667, ans=0.04949747468305833 2024-06-20 01:57:58,056 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.658e-01 2024-06-20 01:57:58,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=107481.0, ans=0.0 2024-06-20 01:58:03,384 INFO [train.py:1028] (1/2) Epoch 6, batch 8050, loss[loss=0.3269, simple_loss=0.3494, pruned_loss=0.1522, over 13201.00 frames. ], tot_loss[loss=0.3376, simple_loss=0.3555, pruned_loss=0.1599, over 2572042.62 frames. ], batch size: 83, lr: 9.44e-03, grad_scale: 8.0 2024-06-20 01:58:05,056 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.86 vs. limit=15.0 2024-06-20 01:58:05,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=107499.33333333333, ans=0.125 2024-06-20 01:58:07,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=107499.33333333333, ans=0.0 2024-06-20 01:58:24,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=107536.0, ans=0.2 2024-06-20 01:58:35,048 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.233e+02 6.502e+02 7.648e+02 8.723e+02 1.499e+03, threshold=1.530e+03, percent-clipped=4.0 2024-06-20 01:58:38,872 INFO [train.py:1028] (1/2) Epoch 6, batch 8100, loss[loss=0.3173, simple_loss=0.3462, pruned_loss=0.1442, over 13150.00 frames. ], tot_loss[loss=0.3377, simple_loss=0.3557, pruned_loss=0.1599, over 2576223.05 frames. ], batch size: 112, lr: 9.44e-03, grad_scale: 8.0 2024-06-20 01:59:00,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=107627.66666666667, ans=0.07 2024-06-20 01:59:02,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=107646.0, ans=0.125 2024-06-20 01:59:03,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=107646.0, ans=10.0 2024-06-20 01:59:12,011 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.33 vs. limit=10.0 2024-06-20 01:59:15,358 INFO [train.py:1028] (1/2) Epoch 6, batch 8150, loss[loss=0.3075, simple_loss=0.3291, pruned_loss=0.1429, over 13146.00 frames. ], tot_loss[loss=0.3367, simple_loss=0.3554, pruned_loss=0.159, over 2579079.33 frames. ], batch size: 121, lr: 9.43e-03, grad_scale: 2.0 2024-06-20 01:59:40,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=107737.66666666667, ans=0.125 2024-06-20 01:59:40,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=107737.66666666667, ans=0.0 2024-06-20 01:59:45,695 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 4.222e+02 5.928e+02 6.980e+02 8.576e+02 2.219e+03, threshold=1.396e+03, percent-clipped=5.0 2024-06-20 01:59:46,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=107756.0, ans=0.125 2024-06-20 01:59:48,438 INFO [train.py:1028] (1/2) Epoch 6, batch 8200, loss[loss=0.3643, simple_loss=0.3758, pruned_loss=0.1764, over 13152.00 frames. ], tot_loss[loss=0.3381, simple_loss=0.3564, pruned_loss=0.1598, over 2582841.90 frames. ], batch size: 112, lr: 9.43e-03, grad_scale: 4.0 2024-06-20 01:59:54,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107774.33333333333, ans=0.1 2024-06-20 02:00:22,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=107847.66666666667, ans=0.0 2024-06-20 02:00:25,905 INFO [train.py:1028] (1/2) Epoch 6, batch 8250, loss[loss=0.2964, simple_loss=0.3223, pruned_loss=0.1353, over 13238.00 frames. ], tot_loss[loss=0.3367, simple_loss=0.3555, pruned_loss=0.159, over 2583694.11 frames. ], batch size: 52, lr: 9.42e-03, grad_scale: 2.0 2024-06-20 02:00:27,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=107866.0, ans=0.2 2024-06-20 02:00:28,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=107866.0, ans=0.125 2024-06-20 02:00:35,056 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=12.0 2024-06-20 02:00:42,777 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.18 vs. limit=12.0 2024-06-20 02:01:00,039 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.857e+02 5.541e+02 6.226e+02 7.589e+02 1.070e+03, threshold=1.245e+03, percent-clipped=0.0 2024-06-20 02:01:00,513 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=12.0 2024-06-20 02:01:02,104 INFO [train.py:1028] (1/2) Epoch 6, batch 8300, loss[loss=0.3179, simple_loss=0.3381, pruned_loss=0.1489, over 13043.00 frames. ], tot_loss[loss=0.3353, simple_loss=0.3543, pruned_loss=0.1582, over 2581635.61 frames. ], batch size: 102, lr: 9.42e-03, grad_scale: 4.0 2024-06-20 02:01:02,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=107957.66666666667, ans=0.125 2024-06-20 02:01:04,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=107957.66666666667, ans=0.125 2024-06-20 02:01:27,917 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=30.19 vs. limit=22.5 2024-06-20 02:01:33,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=108031.0, ans=0.07 2024-06-20 02:01:34,920 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.98 vs. limit=15.0 2024-06-20 02:01:35,720 INFO [train.py:1028] (1/2) Epoch 6, batch 8350, loss[loss=0.3228, simple_loss=0.3422, pruned_loss=0.1517, over 13135.00 frames. ], tot_loss[loss=0.3342, simple_loss=0.3538, pruned_loss=0.1573, over 2581060.69 frames. ], batch size: 112, lr: 9.42e-03, grad_scale: 4.0 2024-06-20 02:01:36,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=108049.33333333333, ans=0.125 2024-06-20 02:01:40,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=108049.33333333333, ans=0.2 2024-06-20 02:01:50,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=108086.0, ans=0.035 2024-06-20 02:01:51,042 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.86 vs. limit=15.0 2024-06-20 02:02:07,085 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.170e+02 4.932e+02 5.610e+02 6.475e+02 9.613e+02, threshold=1.122e+03, percent-clipped=0.0 2024-06-20 02:02:08,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=108141.0, ans=0.2 2024-06-20 02:02:09,201 INFO [train.py:1028] (1/2) Epoch 6, batch 8400, loss[loss=0.2885, simple_loss=0.3201, pruned_loss=0.1285, over 12965.00 frames. ], tot_loss[loss=0.3355, simple_loss=0.3546, pruned_loss=0.1582, over 2577132.10 frames. ], batch size: 39, lr: 9.41e-03, grad_scale: 8.0 2024-06-20 02:02:17,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=108159.33333333333, ans=0.1 2024-06-20 02:02:29,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=108177.66666666667, ans=0.025 2024-06-20 02:02:39,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=108214.33333333333, ans=0.2 2024-06-20 02:02:45,743 INFO [train.py:1028] (1/2) Epoch 6, batch 8450, loss[loss=0.3486, simple_loss=0.3738, pruned_loss=0.1617, over 13156.00 frames. ], tot_loss[loss=0.336, simple_loss=0.3555, pruned_loss=0.1583, over 2578538.21 frames. ], batch size: 112, lr: 9.41e-03, grad_scale: 8.0 2024-06-20 02:02:50,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=108232.66666666667, ans=0.125 2024-06-20 02:02:57,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=108251.0, ans=0.125 2024-06-20 02:03:01,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=108269.33333333333, ans=0.125 2024-06-20 02:03:02,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=108269.33333333333, ans=0.125 2024-06-20 02:03:04,166 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.22 vs. limit=10.0 2024-06-20 02:03:07,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=108269.33333333333, ans=0.2 2024-06-20 02:03:08,532 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=108287.66666666667, ans=0.125 2024-06-20 02:03:11,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=108287.66666666667, ans=0.125 2024-06-20 02:03:20,876 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.895e+02 5.939e+02 6.806e+02 7.908e+02 1.112e+03, threshold=1.361e+03, percent-clipped=0.0 2024-06-20 02:03:22,862 INFO [train.py:1028] (1/2) Epoch 6, batch 8500, loss[loss=0.2976, simple_loss=0.3245, pruned_loss=0.1353, over 12652.00 frames. ], tot_loss[loss=0.3378, simple_loss=0.3571, pruned_loss=0.1593, over 2577096.66 frames. ], batch size: 29, lr: 9.41e-03, grad_scale: 8.0 2024-06-20 02:03:22,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=108324.33333333333, ans=0.125 2024-06-20 02:03:26,247 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=108324.33333333333, ans=0.125 2024-06-20 02:03:50,244 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.642e+01 2024-06-20 02:03:56,153 INFO [train.py:1028] (1/2) Epoch 6, batch 8550, loss[loss=0.3046, simple_loss=0.3358, pruned_loss=0.1367, over 12594.00 frames. ], tot_loss[loss=0.3367, simple_loss=0.3564, pruned_loss=0.1585, over 2576178.98 frames. ], batch size: 22, lr: 9.40e-03, grad_scale: 8.0 2024-06-20 02:04:02,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=108434.33333333333, ans=0.0 2024-06-20 02:04:06,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=108434.33333333333, ans=0.0 2024-06-20 02:04:14,275 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=26.94 vs. limit=22.5 2024-06-20 02:04:24,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=108489.33333333333, ans=0.2 2024-06-20 02:04:26,997 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.304e+02 4.970e+02 5.873e+02 6.995e+02 1.256e+03, threshold=1.175e+03, percent-clipped=0.0 2024-06-20 02:04:32,529 INFO [train.py:1028] (1/2) Epoch 6, batch 8600, loss[loss=0.3048, simple_loss=0.3236, pruned_loss=0.143, over 13062.00 frames. ], tot_loss[loss=0.3375, simple_loss=0.3574, pruned_loss=0.1588, over 2574232.77 frames. ], batch size: 121, lr: 9.40e-03, grad_scale: 8.0 2024-06-20 02:04:52,277 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:04:56,400 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.85 vs. limit=6.0 2024-06-20 02:04:58,393 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.52 vs. limit=10.0 2024-06-20 02:05:00,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=108562.66666666667, ans=0.125 2024-06-20 02:05:07,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108581.0, ans=0.1 2024-06-20 02:05:08,940 INFO [train.py:1028] (1/2) Epoch 6, batch 8650, loss[loss=0.2987, simple_loss=0.3284, pruned_loss=0.1345, over 13014.00 frames. ], tot_loss[loss=0.3386, simple_loss=0.3588, pruned_loss=0.1592, over 2576678.52 frames. ], batch size: 102, lr: 9.39e-03, grad_scale: 8.0 2024-06-20 02:05:09,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=108599.33333333333, ans=0.125 2024-06-20 02:05:09,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=108599.33333333333, ans=0.125 2024-06-20 02:05:15,055 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.56 vs. limit=15.0 2024-06-20 02:05:26,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=108636.0, ans=0.05 2024-06-20 02:05:29,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=108654.33333333333, ans=0.09899494936611666 2024-06-20 02:05:33,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=108654.33333333333, ans=0.125 2024-06-20 02:05:39,569 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 3.415e+02 4.584e+02 5.124e+02 5.861e+02 9.283e+02, threshold=1.025e+03, percent-clipped=0.0 2024-06-20 02:05:41,806 INFO [train.py:1028] (1/2) Epoch 6, batch 8700, loss[loss=0.3459, simple_loss=0.3642, pruned_loss=0.1638, over 13201.00 frames. ], tot_loss[loss=0.3406, simple_loss=0.3603, pruned_loss=0.1605, over 2574605.62 frames. ], batch size: 59, lr: 9.39e-03, grad_scale: 8.0 2024-06-20 02:05:45,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=108691.0, ans=0.1 2024-06-20 02:05:58,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108727.66666666667, ans=0.1 2024-06-20 02:06:07,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=108746.0, ans=0.0 2024-06-20 02:06:10,314 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.78 vs. limit=6.0 2024-06-20 02:06:13,174 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.85 vs. limit=22.5 2024-06-20 02:06:15,336 INFO [train.py:1028] (1/2) Epoch 6, batch 8750, loss[loss=0.3535, simple_loss=0.3639, pruned_loss=0.1716, over 13080.00 frames. ], tot_loss[loss=0.3404, simple_loss=0.36, pruned_loss=0.1604, over 2570392.60 frames. ], batch size: 121, lr: 9.39e-03, grad_scale: 8.0 2024-06-20 02:06:15,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=108782.66666666667, ans=0.125 2024-06-20 02:06:17,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=108782.66666666667, ans=0.125 2024-06-20 02:06:20,026 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.82 vs. limit=6.0 2024-06-20 02:06:20,566 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.62 vs. limit=22.5 2024-06-20 02:06:37,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=108837.66666666667, ans=0.125 2024-06-20 02:06:54,417 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.692e+02 3.906e+02 4.666e+02 5.529e+02 8.216e+02, threshold=9.332e+02, percent-clipped=0.0 2024-06-20 02:06:56,327 INFO [train.py:1028] (1/2) Epoch 6, batch 8800, loss[loss=0.3212, simple_loss=0.3456, pruned_loss=0.1484, over 13233.00 frames. ], tot_loss[loss=0.3392, simple_loss=0.3591, pruned_loss=0.1597, over 2575534.07 frames. ], batch size: 72, lr: 9.38e-03, grad_scale: 16.0 2024-06-20 02:06:59,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=108874.33333333333, ans=0.125 2024-06-20 02:07:07,725 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.39 vs. limit=15.0 2024-06-20 02:07:16,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=108929.33333333333, ans=0.0 2024-06-20 02:07:16,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=108929.33333333333, ans=0.125 2024-06-20 02:07:27,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=108947.66666666667, ans=0.2 2024-06-20 02:07:29,768 INFO [train.py:1028] (1/2) Epoch 6, batch 8850, loss[loss=0.3778, simple_loss=0.3781, pruned_loss=0.1888, over 12562.00 frames. ], tot_loss[loss=0.3389, simple_loss=0.3584, pruned_loss=0.1597, over 2566090.80 frames. ], batch size: 202, lr: 9.38e-03, grad_scale: 16.0 2024-06-20 02:07:31,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=108966.0, ans=0.2 2024-06-20 02:07:31,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=108966.0, ans=0.2 2024-06-20 02:07:31,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=108966.0, ans=0.0 2024-06-20 02:07:41,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=108984.33333333333, ans=0.125 2024-06-20 02:07:41,708 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.84 vs. limit=6.0 2024-06-20 02:07:42,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=108984.33333333333, ans=0.0 2024-06-20 02:07:44,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=109002.66666666667, ans=0.1 2024-06-20 02:07:51,944 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.07 vs. limit=6.0 2024-06-20 02:07:52,763 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=19.60 vs. limit=15.0 2024-06-20 02:07:54,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=109021.0, ans=0.0 2024-06-20 02:08:00,553 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.671e+02 3.803e+02 4.332e+02 4.868e+02 1.024e+03, threshold=8.664e+02, percent-clipped=1.0 2024-06-20 02:08:02,450 INFO [train.py:1028] (1/2) Epoch 6, batch 8900, loss[loss=0.3422, simple_loss=0.3513, pruned_loss=0.1666, over 12908.00 frames. ], tot_loss[loss=0.3399, simple_loss=0.3589, pruned_loss=0.1604, over 2564773.67 frames. ], batch size: 33, lr: 9.37e-03, grad_scale: 16.0 2024-06-20 02:08:03,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=109057.66666666667, ans=0.125 2024-06-20 02:08:05,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=109057.66666666667, ans=0.0 2024-06-20 02:08:23,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=109094.33333333333, ans=0.09899494936611666 2024-06-20 02:08:24,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=109112.66666666667, ans=0.125 2024-06-20 02:08:25,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=109112.66666666667, ans=0.2 2024-06-20 02:08:37,834 INFO [train.py:1028] (1/2) Epoch 6, batch 8950, loss[loss=0.3928, simple_loss=0.3978, pruned_loss=0.1938, over 12537.00 frames. ], tot_loss[loss=0.3392, simple_loss=0.3587, pruned_loss=0.1599, over 2565166.17 frames. ], batch size: 202, lr: 9.37e-03, grad_scale: 8.0 2024-06-20 02:08:46,665 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.71 vs. limit=22.5 2024-06-20 02:08:47,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=109167.66666666667, ans=0.015 2024-06-20 02:08:54,551 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:08:56,707 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.16 vs. limit=15.0 2024-06-20 02:09:03,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=109204.33333333333, ans=0.125 2024-06-20 02:09:05,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=109204.33333333333, ans=0.1 2024-06-20 02:09:12,649 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.779e+02 3.551e+02 3.960e+02 4.545e+02 7.311e+02, threshold=7.920e+02, percent-clipped=0.0 2024-06-20 02:09:14,082 INFO [train.py:1028] (1/2) Epoch 6, batch 9000, loss[loss=0.3255, simple_loss=0.3498, pruned_loss=0.1506, over 13287.00 frames. ], tot_loss[loss=0.338, simple_loss=0.3583, pruned_loss=0.1588, over 2570830.22 frames. ], batch size: 46, lr: 9.37e-03, grad_scale: 8.0 2024-06-20 02:09:14,082 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 02:09:21,754 INFO [train.py:1060] (1/2) Epoch 6, validation: loss=0.223, simple_loss=0.2821, pruned_loss=0.08192, over 351949.00 frames. 2024-06-20 02:09:21,755 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 02:09:33,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=109259.33333333333, ans=0.1 2024-06-20 02:09:35,906 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.20 vs. limit=15.0 2024-06-20 02:09:37,426 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.01 vs. limit=5.0 2024-06-20 02:09:44,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=109296.0, ans=0.2 2024-06-20 02:09:47,347 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=11.93 vs. limit=12.0 2024-06-20 02:09:53,793 INFO [train.py:1028] (1/2) Epoch 6, batch 9050, loss[loss=0.2937, simple_loss=0.3295, pruned_loss=0.1289, over 11418.00 frames. ], tot_loss[loss=0.3377, simple_loss=0.3584, pruned_loss=0.1585, over 2569266.62 frames. ], batch size: 17, lr: 9.36e-03, grad_scale: 8.0 2024-06-20 02:09:55,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=109332.66666666667, ans=0.1 2024-06-20 02:10:02,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=109351.0, ans=0.0 2024-06-20 02:10:04,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=109351.0, ans=0.125 2024-06-20 02:10:13,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=109387.66666666667, ans=0.025 2024-06-20 02:10:23,506 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=13.07 vs. limit=12.0 2024-06-20 02:10:24,436 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.085e+02 3.564e+02 4.071e+02 1.282e+03, threshold=7.128e+02, percent-clipped=1.0 2024-06-20 02:10:25,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=109424.33333333333, ans=0.0 2024-06-20 02:10:25,806 INFO [train.py:1028] (1/2) Epoch 6, batch 9100, loss[loss=0.3339, simple_loss=0.3573, pruned_loss=0.1552, over 13228.00 frames. ], tot_loss[loss=0.3362, simple_loss=0.3576, pruned_loss=0.1574, over 2569870.78 frames. ], batch size: 72, lr: 9.36e-03, grad_scale: 8.0 2024-06-20 02:10:25,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=109424.33333333333, ans=0.2 2024-06-20 02:10:41,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=109461.0, ans=0.025 2024-06-20 02:10:41,147 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.32 vs. limit=12.0 2024-06-20 02:10:41,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=109461.0, ans=0.1 2024-06-20 02:10:43,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=109461.0, ans=0.125 2024-06-20 02:10:44,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=109479.33333333333, ans=0.0 2024-06-20 02:10:57,924 INFO [train.py:1028] (1/2) Epoch 6, batch 9150, loss[loss=0.3483, simple_loss=0.3684, pruned_loss=0.1641, over 13096.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3585, pruned_loss=0.158, over 2570355.65 frames. ], batch size: 77, lr: 9.35e-03, grad_scale: 8.0 2024-06-20 02:10:59,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=109516.0, ans=0.125 2024-06-20 02:11:02,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=109516.0, ans=0.025 2024-06-20 02:11:07,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=109534.33333333333, ans=0.125 2024-06-20 02:11:21,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=109571.0, ans=0.04949747468305833 2024-06-20 02:11:21,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=109571.0, ans=0.125 2024-06-20 02:11:26,806 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.83 vs. limit=15.0 2024-06-20 02:11:28,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=109589.33333333333, ans=0.0 2024-06-20 02:11:30,535 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2024-06-20 02:11:31,425 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.740e+02 3.582e+02 4.005e+02 4.544e+02 9.909e+02, threshold=8.010e+02, percent-clipped=1.0 2024-06-20 02:11:32,801 INFO [train.py:1028] (1/2) Epoch 6, batch 9200, loss[loss=0.3286, simple_loss=0.3533, pruned_loss=0.152, over 12889.00 frames. ], tot_loss[loss=0.3359, simple_loss=0.3579, pruned_loss=0.1569, over 2572858.04 frames. ], batch size: 36, lr: 9.35e-03, grad_scale: 16.0 2024-06-20 02:11:39,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=109626.0, ans=0.125 2024-06-20 02:11:39,426 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.51 vs. limit=15.0 2024-06-20 02:11:46,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=109644.33333333333, ans=0.2 2024-06-20 02:11:57,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=109681.0, ans=0.125 2024-06-20 02:12:07,077 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.80 vs. limit=5.0 2024-06-20 02:12:08,408 INFO [train.py:1028] (1/2) Epoch 6, batch 9250, loss[loss=0.3329, simple_loss=0.3597, pruned_loss=0.153, over 13252.00 frames. ], tot_loss[loss=0.3342, simple_loss=0.357, pruned_loss=0.1557, over 2575313.00 frames. ], batch size: 67, lr: 9.35e-03, grad_scale: 16.0 2024-06-20 02:12:20,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=109736.0, ans=0.125 2024-06-20 02:12:25,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=109736.0, ans=0.125 2024-06-20 02:12:35,964 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.02 vs. limit=15.0 2024-06-20 02:12:39,315 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 3.310e+02 3.758e+02 4.156e+02 5.924e+02, threshold=7.517e+02, percent-clipped=0.0 2024-06-20 02:12:40,645 INFO [train.py:1028] (1/2) Epoch 6, batch 9300, loss[loss=0.2917, simple_loss=0.3218, pruned_loss=0.1309, over 12926.00 frames. ], tot_loss[loss=0.3341, simple_loss=0.3569, pruned_loss=0.1556, over 2571071.28 frames. ], batch size: 39, lr: 9.34e-03, grad_scale: 16.0 2024-06-20 02:12:44,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=109791.0, ans=0.0 2024-06-20 02:12:50,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=109809.33333333333, ans=0.2 2024-06-20 02:13:04,950 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.66 vs. limit=22.5 2024-06-20 02:13:09,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=109864.33333333333, ans=0.0 2024-06-20 02:13:11,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=109882.66666666667, ans=0.025 2024-06-20 02:13:12,115 INFO [train.py:1028] (1/2) Epoch 6, batch 9350, loss[loss=0.3429, simple_loss=0.3665, pruned_loss=0.1596, over 12505.00 frames. ], tot_loss[loss=0.3347, simple_loss=0.3573, pruned_loss=0.1561, over 2567445.73 frames. ], batch size: 22, lr: 9.34e-03, grad_scale: 8.0 2024-06-20 02:13:19,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=109901.0, ans=0.2 2024-06-20 02:13:19,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=109901.0, ans=10.0 2024-06-20 02:13:19,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=109901.0, ans=0.02 2024-06-20 02:13:36,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=109956.0, ans=0.125 2024-06-20 02:13:37,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=109956.0, ans=0.0 2024-06-20 02:13:40,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=109956.0, ans=0.125 2024-06-20 02:13:42,184 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.58 vs. limit=15.0 2024-06-20 02:13:42,410 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.769e+02 4.042e+02 4.663e+02 5.220e+02 7.888e+02, threshold=9.325e+02, percent-clipped=1.0 2024-06-20 02:13:42,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=109974.33333333333, ans=0.07 2024-06-20 02:13:43,077 INFO [train.py:1028] (1/2) Epoch 6, batch 9400, loss[loss=0.367, simple_loss=0.3874, pruned_loss=0.1733, over 13271.00 frames. ], tot_loss[loss=0.3354, simple_loss=0.3576, pruned_loss=0.1566, over 2566643.49 frames. ], batch size: 52, lr: 9.34e-03, grad_scale: 8.0 2024-06-20 02:13:48,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=109992.66666666667, ans=0.125 2024-06-20 02:14:02,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=110011.0, ans=0.125 2024-06-20 02:14:03,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=110011.0, ans=10.0 2024-06-20 02:14:04,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=110011.0, ans=0.0 2024-06-20 02:14:12,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=110047.66666666667, ans=0.125 2024-06-20 02:14:18,955 INFO [train.py:1028] (1/2) Epoch 6, batch 9450, loss[loss=0.3181, simple_loss=0.352, pruned_loss=0.1421, over 12527.00 frames. ], tot_loss[loss=0.3363, simple_loss=0.3582, pruned_loss=0.1572, over 2568062.02 frames. ], batch size: 22, lr: 9.33e-03, grad_scale: 8.0 2024-06-20 02:14:19,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=110066.0, ans=0.125 2024-06-20 02:14:26,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=110084.33333333333, ans=0.1 2024-06-20 02:14:27,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=110084.33333333333, ans=0.0 2024-06-20 02:14:38,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=110121.0, ans=0.125 2024-06-20 02:14:40,722 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.62 vs. limit=15.0 2024-06-20 02:14:44,830 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.21 vs. limit=6.0 2024-06-20 02:14:48,775 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.528e+02 3.809e+02 4.296e+02 5.234e+02 7.435e+02, threshold=8.591e+02, percent-clipped=0.0 2024-06-20 02:14:49,657 INFO [train.py:1028] (1/2) Epoch 6, batch 9500, loss[loss=0.3195, simple_loss=0.3454, pruned_loss=0.1468, over 13223.00 frames. ], tot_loss[loss=0.3348, simple_loss=0.3571, pruned_loss=0.1562, over 2576941.84 frames. ], batch size: 43, lr: 9.33e-03, grad_scale: 8.0 2024-06-20 02:14:50,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=110157.66666666667, ans=0.2 2024-06-20 02:15:06,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=110194.33333333333, ans=0.0 2024-06-20 02:15:17,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=110212.66666666667, ans=0.025 2024-06-20 02:15:24,914 INFO [train.py:1028] (1/2) Epoch 6, batch 9550, loss[loss=0.3054, simple_loss=0.3305, pruned_loss=0.1401, over 13253.00 frames. ], tot_loss[loss=0.335, simple_loss=0.3572, pruned_loss=0.1564, over 2573137.51 frames. ], batch size: 40, lr: 9.32e-03, grad_scale: 8.0 2024-06-20 02:15:25,331 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.74 vs. limit=22.5 2024-06-20 02:15:28,174 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.70 vs. limit=15.0 2024-06-20 02:15:29,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=110249.33333333333, ans=0.125 2024-06-20 02:15:29,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=110249.33333333333, ans=0.125 2024-06-20 02:15:29,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=110249.33333333333, ans=0.025 2024-06-20 02:15:36,426 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2024-06-20 02:15:37,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=110286.0, ans=0.0 2024-06-20 02:15:39,141 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.84 vs. limit=10.0 2024-06-20 02:15:42,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=110286.0, ans=0.1 2024-06-20 02:15:42,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=110286.0, ans=10.0 2024-06-20 02:15:51,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=110322.66666666667, ans=0.0 2024-06-20 02:15:55,690 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.532e+02 3.605e+02 4.134e+02 4.891e+02 1.184e+03, threshold=8.268e+02, percent-clipped=1.0 2024-06-20 02:15:56,119 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.52 vs. limit=15.0 2024-06-20 02:15:56,424 INFO [train.py:1028] (1/2) Epoch 6, batch 9600, loss[loss=0.3547, simple_loss=0.3521, pruned_loss=0.1787, over 10723.00 frames. ], tot_loss[loss=0.3345, simple_loss=0.3564, pruned_loss=0.1563, over 2572776.94 frames. ], batch size: 303, lr: 9.32e-03, grad_scale: 16.0 2024-06-20 02:15:56,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=110341.0, ans=0.0 2024-06-20 02:16:18,318 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.73 vs. limit=15.0 2024-06-20 02:16:27,781 INFO [train.py:1028] (1/2) Epoch 6, batch 9650, loss[loss=0.3227, simple_loss=0.3405, pruned_loss=0.1525, over 13130.00 frames. ], tot_loss[loss=0.3345, simple_loss=0.3556, pruned_loss=0.1567, over 2564491.17 frames. ], batch size: 132, lr: 9.32e-03, grad_scale: 16.0 2024-06-20 02:16:28,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=110432.66666666667, ans=0.2 2024-06-20 02:16:40,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=110469.33333333333, ans=0.1 2024-06-20 02:16:51,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=110487.66666666667, ans=0.0 2024-06-20 02:16:53,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=110506.0, ans=0.025 2024-06-20 02:16:59,488 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.419e+02 3.340e+02 3.943e+02 4.395e+02 7.325e+02, threshold=7.886e+02, percent-clipped=0.0 2024-06-20 02:17:00,236 INFO [train.py:1028] (1/2) Epoch 6, batch 9700, loss[loss=0.3345, simple_loss=0.3424, pruned_loss=0.1633, over 13031.00 frames. ], tot_loss[loss=0.3337, simple_loss=0.3546, pruned_loss=0.1564, over 2559559.26 frames. ], batch size: 144, lr: 9.31e-03, grad_scale: 16.0 2024-06-20 02:17:05,497 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.67 vs. limit=15.0 2024-06-20 02:17:08,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=110542.66666666667, ans=0.025 2024-06-20 02:17:11,083 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.30 vs. limit=15.0 2024-06-20 02:17:26,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=110597.66666666667, ans=0.1 2024-06-20 02:17:32,739 INFO [train.py:1028] (1/2) Epoch 6, batch 9750, loss[loss=0.3167, simple_loss=0.331, pruned_loss=0.1512, over 13103.00 frames. ], tot_loss[loss=0.3327, simple_loss=0.3538, pruned_loss=0.1559, over 2555876.26 frames. ], batch size: 132, lr: 9.31e-03, grad_scale: 16.0 2024-06-20 02:17:34,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=110616.0, ans=0.0 2024-06-20 02:17:44,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=110652.66666666667, ans=0.0 2024-06-20 02:17:46,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=110652.66666666667, ans=0.125 2024-06-20 02:17:48,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=110652.66666666667, ans=0.2 2024-06-20 02:18:02,935 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 3.463e+02 3.995e+02 4.578e+02 6.727e+02, threshold=7.990e+02, percent-clipped=0.0 2024-06-20 02:18:02,967 INFO [train.py:1028] (1/2) Epoch 6, batch 9800, loss[loss=0.325, simple_loss=0.3481, pruned_loss=0.151, over 12919.00 frames. ], tot_loss[loss=0.3323, simple_loss=0.3537, pruned_loss=0.1555, over 2548485.66 frames. ], batch size: 39, lr: 9.31e-03, grad_scale: 8.0 2024-06-20 02:18:03,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=110707.66666666667, ans=0.1 2024-06-20 02:18:11,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=110726.0, ans=0.5 2024-06-20 02:18:11,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=110726.0, ans=0.125 2024-06-20 02:18:19,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=110744.33333333333, ans=0.125 2024-06-20 02:18:33,748 INFO [train.py:1028] (1/2) Epoch 6, batch 9850, loss[loss=0.3362, simple_loss=0.3539, pruned_loss=0.1593, over 13000.00 frames. ], tot_loss[loss=0.3325, simple_loss=0.3539, pruned_loss=0.1555, over 2540924.10 frames. ], batch size: 102, lr: 9.30e-03, grad_scale: 8.0 2024-06-20 02:18:37,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=110799.33333333333, ans=0.0 2024-06-20 02:18:43,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=110817.66666666667, ans=0.125 2024-06-20 02:18:45,132 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.48 vs. limit=15.0 2024-06-20 02:18:55,171 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=15.0 2024-06-20 02:19:05,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=110872.66666666667, ans=0.125 2024-06-20 02:19:06,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=110891.0, ans=0.05 2024-06-20 02:19:06,700 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.882e+02 3.754e+02 4.247e+02 4.914e+02 1.020e+03, threshold=8.494e+02, percent-clipped=2.0 2024-06-20 02:19:06,734 INFO [train.py:1028] (1/2) Epoch 6, batch 9900, loss[loss=0.3189, simple_loss=0.3483, pruned_loss=0.1447, over 13224.00 frames. ], tot_loss[loss=0.3317, simple_loss=0.3528, pruned_loss=0.1553, over 2532391.14 frames. ], batch size: 40, lr: 9.30e-03, grad_scale: 8.0 2024-06-20 02:19:08,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=110891.0, ans=0.0 2024-06-20 02:19:09,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=110891.0, ans=0.1 2024-06-20 02:19:17,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=110909.33333333333, ans=0.125 2024-06-20 02:19:21,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=110927.66666666667, ans=0.125 2024-06-20 02:19:31,643 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=2.167e+02 2024-06-20 02:19:35,389 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.18 vs. limit=22.5 2024-06-20 02:19:36,807 INFO [train.py:1028] (1/2) Epoch 6, batch 9950, loss[loss=0.3476, simple_loss=0.3784, pruned_loss=0.1584, over 12739.00 frames. ], tot_loss[loss=0.3315, simple_loss=0.3521, pruned_loss=0.1554, over 2523634.00 frames. ], batch size: 29, lr: 9.29e-03, grad_scale: 8.0 2024-06-20 02:19:37,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=110982.66666666667, ans=0.04949747468305833 2024-06-20 02:19:40,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=110982.66666666667, ans=0.125 2024-06-20 02:19:44,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=111001.0, ans=0.1 2024-06-20 02:19:44,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=111001.0, ans=0.2 2024-06-20 02:19:47,752 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.05 vs. limit=15.0 2024-06-20 02:19:49,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=111019.33333333333, ans=0.025 2024-06-20 02:19:51,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=111019.33333333333, ans=0.125 2024-06-20 02:20:05,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=111056.0, ans=0.0 2024-06-20 02:20:07,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=111056.0, ans=0.125 2024-06-20 02:20:09,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=111056.0, ans=0.0 2024-06-20 02:20:10,177 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.724e+02 3.391e+02 3.752e+02 4.643e+02 7.647e+02, threshold=7.505e+02, percent-clipped=0.0 2024-06-20 02:20:10,207 INFO [train.py:1028] (1/2) Epoch 6, batch 10000, loss[loss=0.3181, simple_loss=0.358, pruned_loss=0.1391, over 12510.00 frames. ], tot_loss[loss=0.332, simple_loss=0.3522, pruned_loss=0.1559, over 2486734.74 frames. ], batch size: 22, lr: 9.29e-03, grad_scale: 16.0 2024-06-20 02:20:11,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=111074.33333333333, ans=0.95 2024-06-20 02:20:14,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=111074.33333333333, ans=0.025 2024-06-20 02:20:35,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=111147.66666666667, ans=0.04949747468305833 2024-06-20 02:20:35,962 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.84 vs. limit=6.0 2024-06-20 02:20:39,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=111147.66666666667, ans=0.0 2024-06-20 02:20:41,880 INFO [train.py:1028] (1/2) Epoch 6, batch 10050, loss[loss=0.3459, simple_loss=0.3727, pruned_loss=0.1595, over 12625.00 frames. ], tot_loss[loss=0.3337, simple_loss=0.3528, pruned_loss=0.1572, over 2444628.64 frames. ], batch size: 22, lr: 9.29e-03, grad_scale: 16.0 2024-06-20 02:20:43,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=111166.0, ans=0.0 2024-06-20 02:20:49,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=111184.33333333333, ans=0.1 2024-06-20 02:20:58,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=111221.0, ans=0.04949747468305833 2024-06-20 02:21:11,716 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 3.198e+02 3.690e+02 4.365e+02 5.854e+02, threshold=7.381e+02, percent-clipped=0.0 2024-06-20 02:21:11,745 INFO [train.py:1028] (1/2) Epoch 6, batch 10100, loss[loss=0.2491, simple_loss=0.2846, pruned_loss=0.1068, over 11327.00 frames. ], tot_loss[loss=0.33, simple_loss=0.3501, pruned_loss=0.155, over 2426562.73 frames. ], batch size: 16, lr: 9.28e-03, grad_scale: 16.0 2024-06-20 02:21:13,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=111257.66666666667, ans=0.0 2024-06-20 02:21:13,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=111257.66666666667, ans=0.0 2024-06-20 02:23:25,998 INFO [train.py:1028] (1/2) Epoch 7, batch 0, loss[loss=0.28, simple_loss=0.3122, pruned_loss=0.1239, over 12933.00 frames. ], tot_loss[loss=0.28, simple_loss=0.3122, pruned_loss=0.1239, over 12933.00 frames. ], batch size: 36, lr: 8.70e-03, grad_scale: 32.0 2024-06-20 02:23:25,999 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 02:23:33,069 INFO [train.py:1060] (1/2) Epoch 7, validation: loss=0.2255, simple_loss=0.2848, pruned_loss=0.08313, over 351949.00 frames. 2024-06-20 02:23:33,070 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 02:23:55,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=111343.83333333333, ans=0.125 2024-06-20 02:24:06,390 INFO [train.py:1028] (1/2) Epoch 7, batch 50, loss[loss=0.2865, simple_loss=0.3214, pruned_loss=0.1258, over 12577.00 frames. ], tot_loss[loss=0.3091, simple_loss=0.3296, pruned_loss=0.1443, over 574153.22 frames. ], batch size: 29, lr: 8.70e-03, grad_scale: 32.0 2024-06-20 02:24:06,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=111380.5, ans=0.125 2024-06-20 02:24:27,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=111435.5, ans=0.125 2024-06-20 02:24:30,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=111435.5, ans=0.125 2024-06-20 02:24:30,719 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.573e+02 3.685e+02 4.120e+02 4.852e+02 6.556e+02, threshold=8.240e+02, percent-clipped=0.0 2024-06-20 02:24:43,955 INFO [train.py:1028] (1/2) Epoch 7, batch 100, loss[loss=0.2914, simple_loss=0.3239, pruned_loss=0.1294, over 13225.00 frames. ], tot_loss[loss=0.3072, simple_loss=0.3284, pruned_loss=0.143, over 1017260.95 frames. ], batch size: 46, lr: 8.69e-03, grad_scale: 16.0 2024-06-20 02:24:46,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=111472.16666666667, ans=0.025 2024-06-20 02:24:51,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=111490.5, ans=0.0 2024-06-20 02:24:52,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=111490.5, ans=0.125 2024-06-20 02:25:02,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=111527.16666666667, ans=0.09899494936611666 2024-06-20 02:25:11,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=111545.5, ans=0.125 2024-06-20 02:25:15,447 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.85 vs. limit=6.0 2024-06-20 02:25:15,662 INFO [train.py:1028] (1/2) Epoch 7, batch 150, loss[loss=0.3045, simple_loss=0.3344, pruned_loss=0.1373, over 12973.00 frames. ], tot_loss[loss=0.3051, simple_loss=0.3276, pruned_loss=0.1412, over 1365686.80 frames. ], batch size: 30, lr: 8.69e-03, grad_scale: 16.0 2024-06-20 02:25:24,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=111582.16666666667, ans=0.0 2024-06-20 02:25:31,354 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.72 vs. limit=22.5 2024-06-20 02:25:32,691 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.30 vs. limit=15.0 2024-06-20 02:25:37,308 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 3.375e+02 3.884e+02 4.529e+02 7.630e+02, threshold=7.768e+02, percent-clipped=0.0 2024-06-20 02:25:40,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=111637.16666666667, ans=0.025 2024-06-20 02:25:40,648 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.53 vs. limit=15.0 2024-06-20 02:25:41,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=111637.16666666667, ans=0.0 2024-06-20 02:25:47,226 INFO [train.py:1028] (1/2) Epoch 7, batch 200, loss[loss=0.3302, simple_loss=0.3366, pruned_loss=0.1619, over 12556.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3263, pruned_loss=0.14, over 1635508.87 frames. ], batch size: 202, lr: 8.69e-03, grad_scale: 16.0 2024-06-20 02:25:47,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=111655.5, ans=0.025 2024-06-20 02:25:47,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=111655.5, ans=0.0 2024-06-20 02:25:51,371 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.86 vs. limit=6.0 2024-06-20 02:26:03,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=111692.16666666667, ans=0.2 2024-06-20 02:26:03,285 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.42 vs. limit=22.5 2024-06-20 02:26:05,112 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.91 vs. limit=22.5 2024-06-20 02:26:09,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=111710.5, ans=0.125 2024-06-20 02:26:22,920 INFO [train.py:1028] (1/2) Epoch 7, batch 250, loss[loss=0.2778, simple_loss=0.2984, pruned_loss=0.1286, over 13052.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3259, pruned_loss=0.1394, over 1846791.84 frames. ], batch size: 144, lr: 8.68e-03, grad_scale: 16.0 2024-06-20 02:26:28,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=111747.16666666667, ans=0.125 2024-06-20 02:26:43,258 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.00 vs. limit=22.5 2024-06-20 02:26:48,008 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.497e+02 3.525e+02 3.839e+02 4.348e+02 5.982e+02, threshold=7.677e+02, percent-clipped=0.0 2024-06-20 02:26:48,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=111802.16666666667, ans=0.125 2024-06-20 02:26:50,996 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.86 vs. limit=10.0 2024-06-20 02:26:55,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=111820.5, ans=0.0 2024-06-20 02:26:58,407 INFO [train.py:1028] (1/2) Epoch 7, batch 300, loss[loss=0.281, simple_loss=0.2987, pruned_loss=0.1316, over 13189.00 frames. ], tot_loss[loss=0.302, simple_loss=0.3255, pruned_loss=0.1392, over 2010086.86 frames. ], batch size: 112, lr: 8.68e-03, grad_scale: 16.0 2024-06-20 02:27:02,953 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.71 vs. limit=10.0 2024-06-20 02:27:12,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=111875.5, ans=0.1 2024-06-20 02:27:30,172 INFO [train.py:1028] (1/2) Epoch 7, batch 350, loss[loss=0.2797, simple_loss=0.3143, pruned_loss=0.1226, over 12863.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3243, pruned_loss=0.138, over 2139504.53 frames. ], batch size: 33, lr: 8.68e-03, grad_scale: 16.0 2024-06-20 02:27:31,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=111930.5, ans=0.125 2024-06-20 02:27:32,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=111930.5, ans=0.025 2024-06-20 02:27:35,584 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=22.5 2024-06-20 02:27:51,843 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.595e+02 3.164e+02 3.644e+02 4.068e+02 7.617e+02, threshold=7.289e+02, percent-clipped=0.0 2024-06-20 02:27:58,451 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=2.832e-01 2024-06-20 02:28:02,012 INFO [train.py:1028] (1/2) Epoch 7, batch 400, loss[loss=0.2814, simple_loss=0.3169, pruned_loss=0.1229, over 13292.00 frames. ], tot_loss[loss=0.2984, simple_loss=0.3233, pruned_loss=0.1367, over 2239974.92 frames. ], batch size: 63, lr: 8.67e-03, grad_scale: 32.0 2024-06-20 02:28:16,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=112040.5, ans=0.0 2024-06-20 02:28:16,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=112040.5, ans=0.0 2024-06-20 02:28:17,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=112058.83333333333, ans=0.035 2024-06-20 02:28:22,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=112058.83333333333, ans=0.025 2024-06-20 02:28:27,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=112077.16666666667, ans=0.07 2024-06-20 02:28:32,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=112095.5, ans=0.125 2024-06-20 02:28:34,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=112095.5, ans=0.0 2024-06-20 02:28:36,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=112113.83333333333, ans=0.0 2024-06-20 02:28:36,600 INFO [train.py:1028] (1/2) Epoch 7, batch 450, loss[loss=0.3088, simple_loss=0.3362, pruned_loss=0.1407, over 13191.00 frames. ], tot_loss[loss=0.2985, simple_loss=0.3235, pruned_loss=0.1368, over 2314853.39 frames. ], batch size: 67, lr: 8.67e-03, grad_scale: 32.0 2024-06-20 02:28:50,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=112132.16666666667, ans=0.0 2024-06-20 02:28:50,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=112132.16666666667, ans=0.05 2024-06-20 02:28:51,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=112132.16666666667, ans=0.0 2024-06-20 02:28:58,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=112168.83333333333, ans=0.5 2024-06-20 02:29:01,546 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.33 vs. limit=22.5 2024-06-20 02:29:02,425 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 3.014e+02 3.303e+02 3.735e+02 5.209e+02, threshold=6.606e+02, percent-clipped=0.0 2024-06-20 02:29:04,099 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.55 vs. limit=15.0 2024-06-20 02:29:05,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=112187.16666666667, ans=0.0 2024-06-20 02:29:06,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=112187.16666666667, ans=0.0 2024-06-20 02:29:09,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=112187.16666666667, ans=0.0 2024-06-20 02:29:12,044 INFO [train.py:1028] (1/2) Epoch 7, batch 500, loss[loss=0.2824, simple_loss=0.3069, pruned_loss=0.1289, over 13085.00 frames. ], tot_loss[loss=0.2982, simple_loss=0.3236, pruned_loss=0.1365, over 2376402.46 frames. ], batch size: 121, lr: 8.66e-03, grad_scale: 16.0 2024-06-20 02:29:12,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=112205.5, ans=0.025 2024-06-20 02:29:16,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=112205.5, ans=10.0 2024-06-20 02:29:18,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=112223.83333333333, ans=0.125 2024-06-20 02:29:18,844 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.34 vs. limit=22.5 2024-06-20 02:29:23,313 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.24 vs. limit=15.0 2024-06-20 02:29:33,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=112260.5, ans=0.025 2024-06-20 02:29:33,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=112260.5, ans=0.125 2024-06-20 02:29:36,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=12.0 2024-06-20 02:29:42,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=112297.16666666667, ans=0.125 2024-06-20 02:29:43,395 INFO [train.py:1028] (1/2) Epoch 7, batch 550, loss[loss=0.3149, simple_loss=0.3321, pruned_loss=0.1489, over 12918.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3235, pruned_loss=0.1365, over 2421349.71 frames. ], batch size: 158, lr: 8.66e-03, grad_scale: 16.0 2024-06-20 02:29:53,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=112315.5, ans=0.025 2024-06-20 02:29:58,811 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.76 vs. limit=15.0 2024-06-20 02:30:05,446 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.397e+02 3.220e+02 3.613e+02 4.202e+02 5.786e+02, threshold=7.226e+02, percent-clipped=0.0 2024-06-20 02:30:11,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=112370.5, ans=0.025 2024-06-20 02:30:12,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=112370.5, ans=0.125 2024-06-20 02:30:14,871 INFO [train.py:1028] (1/2) Epoch 7, batch 600, loss[loss=0.2645, simple_loss=0.2833, pruned_loss=0.1229, over 13043.00 frames. ], tot_loss[loss=0.2982, simple_loss=0.3237, pruned_loss=0.1364, over 2457619.73 frames. ], batch size: 144, lr: 8.66e-03, grad_scale: 16.0 2024-06-20 02:30:51,256 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=15.0 2024-06-20 02:30:51,474 INFO [train.py:1028] (1/2) Epoch 7, batch 650, loss[loss=0.2977, simple_loss=0.3244, pruned_loss=0.1354, over 13183.00 frames. ], tot_loss[loss=0.2977, simple_loss=0.3237, pruned_loss=0.1359, over 2489567.15 frames. ], batch size: 59, lr: 8.65e-03, grad_scale: 16.0 2024-06-20 02:30:52,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=112480.5, ans=0.125 2024-06-20 02:31:07,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=112498.83333333333, ans=0.2 2024-06-20 02:31:19,295 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.499e+02 3.165e+02 3.568e+02 3.998e+02 5.322e+02, threshold=7.136e+02, percent-clipped=0.0 2024-06-20 02:31:25,890 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.78 vs. limit=15.0 2024-06-20 02:31:28,906 INFO [train.py:1028] (1/2) Epoch 7, batch 700, loss[loss=0.2887, simple_loss=0.3192, pruned_loss=0.1291, over 13281.00 frames. ], tot_loss[loss=0.2984, simple_loss=0.324, pruned_loss=0.1364, over 2511978.36 frames. ], batch size: 46, lr: 8.65e-03, grad_scale: 16.0 2024-06-20 02:31:31,864 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.75 vs. limit=15.0 2024-06-20 02:31:38,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.63 vs. limit=15.0 2024-06-20 02:31:50,244 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.14 vs. limit=15.0 2024-06-20 02:31:59,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=112645.5, ans=0.025 2024-06-20 02:31:59,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=112645.5, ans=0.1 2024-06-20 02:32:00,862 INFO [train.py:1028] (1/2) Epoch 7, batch 750, loss[loss=0.2799, simple_loss=0.3156, pruned_loss=0.1221, over 13268.00 frames. ], tot_loss[loss=0.2977, simple_loss=0.3238, pruned_loss=0.1358, over 2527301.27 frames. ], batch size: 63, lr: 8.65e-03, grad_scale: 16.0 2024-06-20 02:32:01,157 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2024-06-20 02:32:10,912 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.40 vs. limit=10.0 2024-06-20 02:32:11,510 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.62 vs. limit=10.0 2024-06-20 02:32:16,990 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2024-06-20 02:32:18,357 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.58 vs. limit=22.5 2024-06-20 02:32:23,187 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.459e+02 3.159e+02 3.471e+02 4.010e+02 6.766e+02, threshold=6.943e+02, percent-clipped=0.0 2024-06-20 02:32:30,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=112737.16666666667, ans=0.0 2024-06-20 02:32:35,610 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.49 vs. limit=15.0 2024-06-20 02:32:35,780 INFO [train.py:1028] (1/2) Epoch 7, batch 800, loss[loss=0.2755, simple_loss=0.3076, pruned_loss=0.1217, over 12980.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3231, pruned_loss=0.1353, over 2539826.35 frames. ], batch size: 36, lr: 8.64e-03, grad_scale: 32.0 2024-06-20 02:32:40,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=112755.5, ans=0.0 2024-06-20 02:32:44,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=112773.83333333333, ans=0.025 2024-06-20 02:32:46,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=112773.83333333333, ans=0.2 2024-06-20 02:32:49,963 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.31 vs. limit=22.5 2024-06-20 02:32:53,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=112792.16666666667, ans=0.2 2024-06-20 02:32:55,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=112810.5, ans=0.0 2024-06-20 02:33:00,045 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.57 vs. limit=10.0 2024-06-20 02:33:08,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=112828.83333333333, ans=0.125 2024-06-20 02:33:10,601 INFO [train.py:1028] (1/2) Epoch 7, batch 850, loss[loss=0.266, simple_loss=0.2943, pruned_loss=0.1188, over 13142.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.3218, pruned_loss=0.1346, over 2551473.97 frames. ], batch size: 95, lr: 8.64e-03, grad_scale: 32.0 2024-06-20 02:33:12,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=112847.16666666667, ans=0.125 2024-06-20 02:33:13,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=112847.16666666667, ans=0.0 2024-06-20 02:33:17,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=112865.5, ans=0.05 2024-06-20 02:33:18,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=112865.5, ans=0.1 2024-06-20 02:33:22,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=112883.83333333333, ans=0.125 2024-06-20 02:33:29,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=112902.16666666667, ans=0.125 2024-06-20 02:33:29,243 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:33:29,458 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.96 vs. limit=15.0 2024-06-20 02:33:33,530 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.292e+02 3.077e+02 3.365e+02 3.860e+02 6.017e+02, threshold=6.730e+02, percent-clipped=0.0 2024-06-20 02:33:34,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=112902.16666666667, ans=0.2 2024-06-20 02:33:42,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=112938.83333333333, ans=0.0 2024-06-20 02:33:43,179 INFO [train.py:1028] (1/2) Epoch 7, batch 900, loss[loss=0.2993, simple_loss=0.3349, pruned_loss=0.1319, over 12897.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.3221, pruned_loss=0.1352, over 2556905.96 frames. ], batch size: 36, lr: 8.64e-03, grad_scale: 16.0 2024-06-20 02:34:02,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=112993.83333333333, ans=0.05 2024-06-20 02:34:05,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=112993.83333333333, ans=0.125 2024-06-20 02:34:11,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=113012.16666666667, ans=0.125 2024-06-20 02:34:14,907 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.65 vs. limit=8.0 2024-06-20 02:34:15,782 INFO [train.py:1028] (1/2) Epoch 7, batch 950, loss[loss=0.3096, simple_loss=0.3469, pruned_loss=0.1361, over 12961.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3224, pruned_loss=0.1351, over 2559747.19 frames. ], batch size: 39, lr: 8.63e-03, grad_scale: 16.0 2024-06-20 02:34:20,831 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.56 vs. limit=22.5 2024-06-20 02:34:32,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=113067.16666666667, ans=15.0 2024-06-20 02:34:34,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=113067.16666666667, ans=0.0 2024-06-20 02:34:42,587 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.502e+02 3.293e+02 3.623e+02 4.262e+02 9.004e+02, threshold=7.247e+02, percent-clipped=2.0 2024-06-20 02:34:44,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=113103.83333333333, ans=0.125 2024-06-20 02:34:51,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=113122.16666666667, ans=0.125 2024-06-20 02:34:51,365 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.09 vs. limit=22.5 2024-06-20 02:34:51,568 INFO [train.py:1028] (1/2) Epoch 7, batch 1000, loss[loss=0.2962, simple_loss=0.3194, pruned_loss=0.1365, over 13246.00 frames. ], tot_loss[loss=0.2967, simple_loss=0.3223, pruned_loss=0.1356, over 2562281.82 frames. ], batch size: 49, lr: 8.63e-03, grad_scale: 16.0 2024-06-20 02:34:53,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=113122.16666666667, ans=0.035 2024-06-20 02:34:56,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=113122.16666666667, ans=0.0 2024-06-20 02:35:02,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=113140.5, ans=0.125 2024-06-20 02:35:24,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=113195.5, ans=0.0 2024-06-20 02:35:24,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=113195.5, ans=0.0 2024-06-20 02:35:26,392 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.85 vs. limit=22.5 2024-06-20 02:35:28,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113213.83333333333, ans=0.1 2024-06-20 02:35:28,720 INFO [train.py:1028] (1/2) Epoch 7, batch 1050, loss[loss=0.2765, simple_loss=0.3069, pruned_loss=0.123, over 13145.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3222, pruned_loss=0.1355, over 2565613.94 frames. ], batch size: 77, lr: 8.63e-03, grad_scale: 16.0 2024-06-20 02:35:36,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=113232.16666666667, ans=0.125 2024-06-20 02:35:37,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=113232.16666666667, ans=0.09899494936611666 2024-06-20 02:35:51,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=113268.83333333333, ans=0.125 2024-06-20 02:35:51,957 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.386e+02 3.293e+02 3.684e+02 4.272e+02 6.892e+02, threshold=7.367e+02, percent-clipped=0.0 2024-06-20 02:35:56,369 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.80 vs. limit=15.0 2024-06-20 02:35:56,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=113287.16666666667, ans=0.2 2024-06-20 02:35:59,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=113287.16666666667, ans=0.025 2024-06-20 02:36:00,021 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.36 vs. limit=10.0 2024-06-20 02:36:00,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=113287.16666666667, ans=0.125 2024-06-20 02:36:02,060 INFO [train.py:1028] (1/2) Epoch 7, batch 1100, loss[loss=0.2832, simple_loss=0.3182, pruned_loss=0.1241, over 13273.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3231, pruned_loss=0.1359, over 2570127.60 frames. ], batch size: 52, lr: 8.62e-03, grad_scale: 8.0 2024-06-20 02:36:02,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=113305.5, ans=0.125 2024-06-20 02:36:02,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=113305.5, ans=0.2 2024-06-20 02:36:11,550 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=4.072e+01 2024-06-20 02:36:28,613 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.36 vs. limit=12.0 2024-06-20 02:36:33,563 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=18.21 vs. limit=15.0 2024-06-20 02:36:34,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=113378.83333333333, ans=0.0 2024-06-20 02:36:37,381 INFO [train.py:1028] (1/2) Epoch 7, batch 1150, loss[loss=0.3024, simple_loss=0.3312, pruned_loss=0.1368, over 13289.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.3235, pruned_loss=0.1361, over 2570914.97 frames. ], batch size: 52, lr: 8.62e-03, grad_scale: 8.0 2024-06-20 02:36:38,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=113397.16666666667, ans=0.125 2024-06-20 02:36:39,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=113397.16666666667, ans=0.1 2024-06-20 02:36:44,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=113415.5, ans=0.125 2024-06-20 02:36:46,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=113415.5, ans=0.125 2024-06-20 02:36:49,142 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.30 vs. limit=10.0 2024-06-20 02:36:56,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=113433.83333333333, ans=0.0 2024-06-20 02:37:04,126 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 3.082e+02 3.526e+02 4.163e+02 7.819e+02, threshold=7.052e+02, percent-clipped=1.0 2024-06-20 02:37:05,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=113470.5, ans=0.125 2024-06-20 02:37:10,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=113470.5, ans=0.125 2024-06-20 02:37:12,551 INFO [train.py:1028] (1/2) Epoch 7, batch 1200, loss[loss=0.2758, simple_loss=0.3161, pruned_loss=0.1177, over 13135.00 frames. ], tot_loss[loss=0.2984, simple_loss=0.3239, pruned_loss=0.1364, over 2573362.85 frames. ], batch size: 77, lr: 8.62e-03, grad_scale: 16.0 2024-06-20 02:37:22,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=113507.16666666667, ans=0.0 2024-06-20 02:37:27,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=113525.5, ans=0.125 2024-06-20 02:37:36,367 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2024-06-20 02:37:44,942 INFO [train.py:1028] (1/2) Epoch 7, batch 1250, loss[loss=0.2758, simple_loss=0.3114, pruned_loss=0.1201, over 13153.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3237, pruned_loss=0.136, over 2583045.94 frames. ], batch size: 112, lr: 8.61e-03, grad_scale: 16.0 2024-06-20 02:37:49,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=113580.5, ans=0.2 2024-06-20 02:37:51,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=113598.83333333333, ans=0.125 2024-06-20 02:38:00,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=113617.16666666667, ans=0.125 2024-06-20 02:38:03,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=113617.16666666667, ans=0.2 2024-06-20 02:38:03,736 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=15.0 2024-06-20 02:38:09,278 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 3.036e+02 3.306e+02 3.625e+02 5.609e+02, threshold=6.611e+02, percent-clipped=0.0 2024-06-20 02:38:11,663 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.59 vs. limit=22.5 2024-06-20 02:38:16,657 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.04 vs. limit=10.0 2024-06-20 02:38:17,652 INFO [train.py:1028] (1/2) Epoch 7, batch 1300, loss[loss=0.3007, simple_loss=0.3221, pruned_loss=0.1396, over 12715.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3233, pruned_loss=0.1355, over 2583582.08 frames. ], batch size: 176, lr: 8.61e-03, grad_scale: 8.0 2024-06-20 02:38:21,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=113672.16666666667, ans=0.0 2024-06-20 02:38:33,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=113708.83333333333, ans=0.025 2024-06-20 02:38:43,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=113727.16666666667, ans=0.0 2024-06-20 02:38:45,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=113727.16666666667, ans=0.0 2024-06-20 02:38:49,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=113745.5, ans=0.0 2024-06-20 02:38:53,271 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.00 vs. limit=10.0 2024-06-20 02:38:54,170 INFO [train.py:1028] (1/2) Epoch 7, batch 1350, loss[loss=0.28, simple_loss=0.3082, pruned_loss=0.1259, over 13209.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.3226, pruned_loss=0.1349, over 2584920.51 frames. ], batch size: 59, lr: 8.61e-03, grad_scale: 4.0 2024-06-20 02:39:02,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=113782.16666666667, ans=0.125 2024-06-20 02:39:05,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=113782.16666666667, ans=0.1 2024-06-20 02:39:14,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=113800.5, ans=0.05 2024-06-20 02:39:20,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=113818.83333333333, ans=0.125 2024-06-20 02:39:22,275 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.20 vs. limit=22.5 2024-06-20 02:39:25,103 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:39:25,509 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.595e+02 3.135e+02 3.601e+02 4.113e+02 9.413e+02, threshold=7.203e+02, percent-clipped=3.0 2024-06-20 02:39:25,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=113837.16666666667, ans=0.125 2024-06-20 02:39:29,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=113837.16666666667, ans=0.125 2024-06-20 02:39:31,933 INFO [train.py:1028] (1/2) Epoch 7, batch 1400, loss[loss=0.2746, simple_loss=0.3004, pruned_loss=0.1244, over 12469.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3226, pruned_loss=0.1353, over 2586012.86 frames. ], batch size: 25, lr: 8.60e-03, grad_scale: 8.0 2024-06-20 02:39:34,573 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.52 vs. limit=15.0 2024-06-20 02:39:38,553 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.84 vs. limit=22.5 2024-06-20 02:39:44,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=113873.83333333333, ans=0.125 2024-06-20 02:39:44,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=113873.83333333333, ans=0.0 2024-06-20 02:39:52,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=113910.5, ans=0.125 2024-06-20 02:39:57,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=113928.83333333333, ans=0.0 2024-06-20 02:40:00,687 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=9.643e-01 2024-06-20 02:40:01,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113928.83333333333, ans=0.1 2024-06-20 02:40:05,073 INFO [train.py:1028] (1/2) Epoch 7, batch 1450, loss[loss=0.2855, simple_loss=0.3083, pruned_loss=0.1313, over 13168.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3228, pruned_loss=0.1355, over 2585464.32 frames. ], batch size: 121, lr: 8.60e-03, grad_scale: 8.0 2024-06-20 02:40:15,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=113965.5, ans=0.0 2024-06-20 02:40:23,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=113983.83333333333, ans=0.0 2024-06-20 02:40:25,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=114002.16666666667, ans=0.125 2024-06-20 02:40:33,540 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.895e+02 3.254e+02 3.707e+02 1.249e+03, threshold=6.509e+02, percent-clipped=1.0 2024-06-20 02:40:39,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=114038.83333333333, ans=0.0 2024-06-20 02:40:40,204 INFO [train.py:1028] (1/2) Epoch 7, batch 1500, loss[loss=0.2831, simple_loss=0.3102, pruned_loss=0.128, over 13236.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.322, pruned_loss=0.1349, over 2587731.57 frames. ], batch size: 83, lr: 8.60e-03, grad_scale: 8.0 2024-06-20 02:40:41,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=114038.83333333333, ans=0.0 2024-06-20 02:40:45,362 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=7.183e+02 2024-06-20 02:40:51,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=114057.16666666667, ans=0.125 2024-06-20 02:40:59,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=114093.83333333333, ans=0.07 2024-06-20 02:41:02,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=114093.83333333333, ans=0.1 2024-06-20 02:41:04,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=114093.83333333333, ans=0.125 2024-06-20 02:41:04,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=114093.83333333333, ans=0.0 2024-06-20 02:41:10,614 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.69 vs. limit=15.0 2024-06-20 02:41:15,779 INFO [train.py:1028] (1/2) Epoch 7, batch 1550, loss[loss=0.29, simple_loss=0.3172, pruned_loss=0.1314, over 13173.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3233, pruned_loss=0.1357, over 2583225.81 frames. ], batch size: 103, lr: 8.59e-03, grad_scale: 8.0 2024-06-20 02:41:16,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=114130.5, ans=0.025 2024-06-20 02:41:17,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=114130.5, ans=0.0 2024-06-20 02:41:39,395 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.02 vs. limit=22.5 2024-06-20 02:41:39,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=114185.5, ans=0.0 2024-06-20 02:41:41,703 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.988e+02 3.266e+02 3.782e+02 7.300e+02, threshold=6.532e+02, percent-clipped=1.0 2024-06-20 02:41:45,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=114203.83333333333, ans=0.125 2024-06-20 02:41:45,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=114203.83333333333, ans=0.2 2024-06-20 02:41:48,068 INFO [train.py:1028] (1/2) Epoch 7, batch 1600, loss[loss=0.2843, simple_loss=0.3125, pruned_loss=0.1281, over 13167.00 frames. ], tot_loss[loss=0.2967, simple_loss=0.3227, pruned_loss=0.1354, over 2578704.67 frames. ], batch size: 77, lr: 8.59e-03, grad_scale: 16.0 2024-06-20 02:42:04,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=114258.83333333333, ans=0.125 2024-06-20 02:42:05,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=114258.83333333333, ans=0.1 2024-06-20 02:42:07,072 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.61 vs. limit=15.0 2024-06-20 02:42:10,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=114277.16666666667, ans=0.0 2024-06-20 02:42:22,655 INFO [train.py:1028] (1/2) Epoch 7, batch 1650, loss[loss=0.2991, simple_loss=0.3196, pruned_loss=0.1393, over 13173.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3229, pruned_loss=0.1357, over 2573768.99 frames. ], batch size: 95, lr: 8.59e-03, grad_scale: 16.0 2024-06-20 02:42:25,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=114313.83333333333, ans=15.0 2024-06-20 02:42:26,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=114313.83333333333, ans=10.0 2024-06-20 02:42:29,971 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.33 vs. limit=15.0 2024-06-20 02:42:38,095 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.78 vs. limit=15.0 2024-06-20 02:42:40,784 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.47 vs. limit=10.0 2024-06-20 02:42:45,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=114368.83333333333, ans=0.0 2024-06-20 02:42:47,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=114368.83333333333, ans=0.2 2024-06-20 02:42:49,033 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.588e+02 2.836e+02 3.313e+02 4.541e+02, threshold=5.671e+02, percent-clipped=0.0 2024-06-20 02:42:54,160 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.97 vs. limit=22.5 2024-06-20 02:42:55,636 INFO [train.py:1028] (1/2) Epoch 7, batch 1700, loss[loss=0.2912, simple_loss=0.3254, pruned_loss=0.1285, over 12974.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3232, pruned_loss=0.1356, over 2579905.48 frames. ], batch size: 26, lr: 8.58e-03, grad_scale: 16.0 2024-06-20 02:42:57,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=114405.5, ans=0.125 2024-06-20 02:43:04,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=114423.83333333333, ans=0.125 2024-06-20 02:43:05,194 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.46 vs. limit=5.0 2024-06-20 02:43:11,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=114442.16666666667, ans=0.025 2024-06-20 02:43:14,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=114442.16666666667, ans=0.0 2024-06-20 02:43:31,735 INFO [train.py:1028] (1/2) Epoch 7, batch 1750, loss[loss=0.3249, simple_loss=0.3514, pruned_loss=0.1492, over 12497.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.323, pruned_loss=0.1351, over 2581042.65 frames. ], batch size: 22, lr: 8.58e-03, grad_scale: 16.0 2024-06-20 02:43:33,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=114497.16666666667, ans=0.95 2024-06-20 02:43:35,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=114497.16666666667, ans=0.125 2024-06-20 02:43:38,675 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.79 vs. limit=15.0 2024-06-20 02:43:39,673 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=2.586e-03 2024-06-20 02:43:51,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=114552.16666666667, ans=0.125 2024-06-20 02:43:52,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=114552.16666666667, ans=0.0 2024-06-20 02:43:52,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=114552.16666666667, ans=12.0 2024-06-20 02:43:53,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=114552.16666666667, ans=0.125 2024-06-20 02:43:54,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=114552.16666666667, ans=0.0 2024-06-20 02:43:56,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=114552.16666666667, ans=0.125 2024-06-20 02:43:57,836 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.929e+02 3.362e+02 3.859e+02 5.647e+02, threshold=6.723e+02, percent-clipped=0.0 2024-06-20 02:44:03,724 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.12 vs. limit=15.0 2024-06-20 02:44:04,000 INFO [train.py:1028] (1/2) Epoch 7, batch 1800, loss[loss=0.2898, simple_loss=0.3133, pruned_loss=0.1332, over 13278.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.3237, pruned_loss=0.136, over 2581581.99 frames. ], batch size: 67, lr: 8.58e-03, grad_scale: 16.0 2024-06-20 02:44:09,495 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.21 vs. limit=15.0 2024-06-20 02:44:12,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=114607.16666666667, ans=0.125 2024-06-20 02:44:21,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=114625.5, ans=0.0 2024-06-20 02:44:27,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=114643.83333333333, ans=0.125 2024-06-20 02:44:29,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114643.83333333333, ans=0.1 2024-06-20 02:44:39,397 INFO [train.py:1028] (1/2) Epoch 7, batch 1850, loss[loss=0.2944, simple_loss=0.3221, pruned_loss=0.1334, over 13207.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3236, pruned_loss=0.1356, over 2582812.05 frames. ], batch size: 83, lr: 8.57e-03, grad_scale: 16.0 2024-06-20 02:44:40,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=114680.5, ans=0.1 2024-06-20 02:44:42,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=114680.5, ans=0.0 2024-06-20 02:44:42,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=114680.5, ans=0.2 2024-06-20 02:44:43,109 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=12.0 2024-06-20 02:44:43,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.89 vs. limit=15.0 2024-06-20 02:44:49,971 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:45:00,079 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:45:05,058 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.736e+02 3.067e+02 3.410e+02 4.860e+02, threshold=6.133e+02, percent-clipped=0.0 2024-06-20 02:45:09,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=114753.83333333333, ans=15.0 2024-06-20 02:45:14,299 INFO [train.py:1028] (1/2) Epoch 7, batch 1900, loss[loss=0.284, simple_loss=0.3111, pruned_loss=0.1284, over 13169.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3226, pruned_loss=0.135, over 2584421.22 frames. ], batch size: 95, lr: 8.57e-03, grad_scale: 16.0 2024-06-20 02:45:25,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=114790.5, ans=0.0 2024-06-20 02:45:47,056 INFO [train.py:1028] (1/2) Epoch 7, batch 1950, loss[loss=0.2778, simple_loss=0.3114, pruned_loss=0.1221, over 13266.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3216, pruned_loss=0.1348, over 2590112.58 frames. ], batch size: 52, lr: 8.57e-03, grad_scale: 16.0 2024-06-20 02:45:48,059 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.96 vs. limit=6.0 2024-06-20 02:45:48,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=114863.83333333333, ans=0.2 2024-06-20 02:46:00,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=114900.5, ans=0.125 2024-06-20 02:46:01,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=114900.5, ans=0.0 2024-06-20 02:46:05,109 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.75 vs. limit=15.0 2024-06-20 02:46:07,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=114918.83333333333, ans=0.125 2024-06-20 02:46:12,784 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.618e+02 2.813e+02 3.234e+02 8.250e+02, threshold=5.625e+02, percent-clipped=1.0 2024-06-20 02:46:17,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114937.16666666667, ans=0.1 2024-06-20 02:46:18,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=114937.16666666667, ans=0.2 2024-06-20 02:46:22,171 INFO [train.py:1028] (1/2) Epoch 7, batch 2000, loss[loss=0.3244, simple_loss=0.3581, pruned_loss=0.1453, over 12457.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.322, pruned_loss=0.1352, over 2587139.27 frames. ], batch size: 22, lr: 8.56e-03, grad_scale: 32.0 2024-06-20 02:46:26,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=114955.5, ans=0.0 2024-06-20 02:46:28,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=114973.83333333333, ans=0.0 2024-06-20 02:46:29,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=114973.83333333333, ans=0.0 2024-06-20 02:46:30,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=114973.83333333333, ans=0.125 2024-06-20 02:46:33,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=114973.83333333333, ans=0.125 2024-06-20 02:46:36,819 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=25.54 vs. limit=22.5 2024-06-20 02:46:43,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=115010.5, ans=0.07 2024-06-20 02:46:47,149 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.01 vs. limit=15.0 2024-06-20 02:46:48,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=115028.83333333333, ans=0.125 2024-06-20 02:46:50,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=115028.83333333333, ans=0.2 2024-06-20 02:46:53,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=115028.83333333333, ans=0.125 2024-06-20 02:46:53,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=115047.16666666667, ans=0.0 2024-06-20 02:46:53,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=115047.16666666667, ans=0.025 2024-06-20 02:46:54,145 INFO [train.py:1028] (1/2) Epoch 7, batch 2050, loss[loss=0.2846, simple_loss=0.3208, pruned_loss=0.1243, over 12632.00 frames. ], tot_loss[loss=0.2967, simple_loss=0.3224, pruned_loss=0.1355, over 2582223.97 frames. ], batch size: 29, lr: 8.56e-03, grad_scale: 32.0 2024-06-20 02:46:58,430 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.08 vs. limit=10.0 2024-06-20 02:47:00,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=115065.5, ans=0.125 2024-06-20 02:47:10,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=115083.83333333333, ans=0.0 2024-06-20 02:47:11,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=115083.83333333333, ans=0.0 2024-06-20 02:47:11,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=115083.83333333333, ans=0.0 2024-06-20 02:47:22,456 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 2.871e+02 3.171e+02 3.710e+02 5.065e+02, threshold=6.343e+02, percent-clipped=0.0 2024-06-20 02:47:25,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=115120.5, ans=0.125 2024-06-20 02:47:28,263 INFO [train.py:1028] (1/2) Epoch 7, batch 2100, loss[loss=0.2865, simple_loss=0.315, pruned_loss=0.129, over 13203.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.322, pruned_loss=0.1345, over 2585245.90 frames. ], batch size: 59, lr: 8.56e-03, grad_scale: 16.0 2024-06-20 02:47:31,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=115138.83333333333, ans=0.0 2024-06-20 02:47:31,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=115138.83333333333, ans=0.0 2024-06-20 02:47:33,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=115138.83333333333, ans=0.2 2024-06-20 02:47:46,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115175.5, ans=0.1 2024-06-20 02:47:46,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=115193.83333333333, ans=0.125 2024-06-20 02:47:47,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=115193.83333333333, ans=0.125 2024-06-20 02:47:48,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115193.83333333333, ans=0.1 2024-06-20 02:47:57,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=115212.16666666667, ans=0.2 2024-06-20 02:47:59,989 INFO [train.py:1028] (1/2) Epoch 7, batch 2150, loss[loss=0.2963, simple_loss=0.3254, pruned_loss=0.1336, over 13219.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3223, pruned_loss=0.1345, over 2588238.98 frames. ], batch size: 52, lr: 8.55e-03, grad_scale: 16.0 2024-06-20 02:48:05,963 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.496e+01 2024-06-20 02:48:07,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=115248.83333333333, ans=0.0 2024-06-20 02:48:16,438 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.14 vs. limit=6.0 2024-06-20 02:48:28,792 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.860e+02 3.126e+02 3.434e+02 5.560e+02, threshold=6.253e+02, percent-clipped=0.0 2024-06-20 02:48:29,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=115303.83333333333, ans=0.125 2024-06-20 02:48:29,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=115303.83333333333, ans=0.125 2024-06-20 02:48:29,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=115303.83333333333, ans=0.1 2024-06-20 02:48:34,785 INFO [train.py:1028] (1/2) Epoch 7, batch 2200, loss[loss=0.3239, simple_loss=0.3378, pruned_loss=0.155, over 13213.00 frames. ], tot_loss[loss=0.2964, simple_loss=0.3227, pruned_loss=0.1351, over 2588306.31 frames. ], batch size: 83, lr: 8.55e-03, grad_scale: 16.0 2024-06-20 02:49:07,134 INFO [train.py:1028] (1/2) Epoch 7, batch 2250, loss[loss=0.2915, simple_loss=0.319, pruned_loss=0.132, over 13291.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3217, pruned_loss=0.1343, over 2587508.77 frames. ], batch size: 63, lr: 8.55e-03, grad_scale: 16.0 2024-06-20 02:49:28,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=115450.5, ans=0.125 2024-06-20 02:49:32,085 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.02 vs. limit=22.5 2024-06-20 02:49:36,961 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.802e+02 3.115e+02 3.434e+02 5.571e+02, threshold=6.230e+02, percent-clipped=0.0 2024-06-20 02:49:37,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=115487.16666666667, ans=0.5 2024-06-20 02:49:39,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=115487.16666666667, ans=0.125 2024-06-20 02:49:41,782 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.96 vs. limit=15.0 2024-06-20 02:49:42,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=115505.5, ans=0.125 2024-06-20 02:49:43,263 INFO [train.py:1028] (1/2) Epoch 7, batch 2300, loss[loss=0.2912, simple_loss=0.3314, pruned_loss=0.1255, over 12936.00 frames. ], tot_loss[loss=0.2946, simple_loss=0.3214, pruned_loss=0.1339, over 2582145.20 frames. ], batch size: 33, lr: 8.54e-03, grad_scale: 16.0 2024-06-20 02:49:45,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.76 vs. limit=22.5 2024-06-20 02:49:50,272 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.94 vs. limit=15.0 2024-06-20 02:49:58,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=115542.16666666667, ans=0.0 2024-06-20 02:50:05,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=115560.5, ans=0.125 2024-06-20 02:50:16,333 INFO [train.py:1028] (1/2) Epoch 7, batch 2350, loss[loss=0.3026, simple_loss=0.3311, pruned_loss=0.1371, over 13204.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3217, pruned_loss=0.1345, over 2584817.17 frames. ], batch size: 67, lr: 8.54e-03, grad_scale: 16.0 2024-06-20 02:50:21,215 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.04 vs. limit=12.0 2024-06-20 02:50:28,330 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.05 vs. limit=15.0 2024-06-20 02:50:29,132 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.70 vs. limit=15.0 2024-06-20 02:50:32,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=115633.83333333333, ans=0.0 2024-06-20 02:50:33,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115633.83333333333, ans=0.1 2024-06-20 02:50:46,567 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.439e+02 3.021e+02 3.471e+02 4.112e+02 5.806e+02, threshold=6.941e+02, percent-clipped=0.0 2024-06-20 02:50:51,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=115670.5, ans=0.125 2024-06-20 02:50:51,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=115688.83333333333, ans=0.0 2024-06-20 02:50:51,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=115688.83333333333, ans=0.125 2024-06-20 02:50:52,331 INFO [train.py:1028] (1/2) Epoch 7, batch 2400, loss[loss=0.2847, simple_loss=0.3174, pruned_loss=0.1261, over 13271.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.3216, pruned_loss=0.1342, over 2587638.83 frames. ], batch size: 46, lr: 8.54e-03, grad_scale: 32.0 2024-06-20 02:51:12,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=115725.5, ans=0.2 2024-06-20 02:51:16,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=115743.83333333333, ans=0.025 2024-06-20 02:51:20,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=115762.16666666667, ans=0.025 2024-06-20 02:51:20,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=115762.16666666667, ans=0.025 2024-06-20 02:51:26,965 INFO [train.py:1028] (1/2) Epoch 7, batch 2450, loss[loss=0.3058, simple_loss=0.3256, pruned_loss=0.143, over 13297.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3204, pruned_loss=0.134, over 2584760.85 frames. ], batch size: 63, lr: 8.53e-03, grad_scale: 16.0 2024-06-20 02:51:29,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115780.5, ans=0.1 2024-06-20 02:51:36,993 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.150e+01 2024-06-20 02:51:40,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=115817.16666666667, ans=0.125 2024-06-20 02:51:43,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=115817.16666666667, ans=0.0 2024-06-20 02:51:46,713 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:51:53,442 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.262e+02 3.633e+02 4.202e+02 7.763e+02, threshold=7.267e+02, percent-clipped=2.0 2024-06-20 02:51:57,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=115853.83333333333, ans=0.1 2024-06-20 02:51:58,922 INFO [train.py:1028] (1/2) Epoch 7, batch 2500, loss[loss=0.2704, simple_loss=0.3016, pruned_loss=0.1196, over 13199.00 frames. ], tot_loss[loss=0.293, simple_loss=0.3191, pruned_loss=0.1334, over 2587768.68 frames. ], batch size: 83, lr: 8.53e-03, grad_scale: 16.0 2024-06-20 02:51:59,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=115872.16666666667, ans=0.125 2024-06-20 02:52:07,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=115890.5, ans=0.125 2024-06-20 02:52:22,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=115927.16666666667, ans=0.04949747468305833 2024-06-20 02:52:26,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=115927.16666666667, ans=0.125 2024-06-20 02:52:34,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=115945.5, ans=0.0 2024-06-20 02:52:35,714 INFO [train.py:1028] (1/2) Epoch 7, batch 2550, loss[loss=0.3073, simple_loss=0.3428, pruned_loss=0.1359, over 12452.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.3183, pruned_loss=0.1333, over 2586446.13 frames. ], batch size: 22, lr: 8.53e-03, grad_scale: 16.0 2024-06-20 02:52:37,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=115963.83333333333, ans=0.0 2024-06-20 02:52:47,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=116000.5, ans=0.125 2024-06-20 02:53:02,324 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.05 vs. limit=15.0 2024-06-20 02:53:05,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=116037.16666666667, ans=0.125 2024-06-20 02:53:06,278 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 3.060e+02 3.371e+02 3.874e+02 6.411e+02, threshold=6.742e+02, percent-clipped=0.0 2024-06-20 02:53:11,225 INFO [train.py:1028] (1/2) Epoch 7, batch 2600, loss[loss=0.269, simple_loss=0.3038, pruned_loss=0.1171, over 13290.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3168, pruned_loss=0.133, over 2587288.75 frames. ], batch size: 52, lr: 8.52e-03, grad_scale: 16.0 2024-06-20 02:53:14,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=116055.5, ans=0.125 2024-06-20 02:53:22,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=116073.83333333333, ans=0.0 2024-06-20 02:53:32,136 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=4.617e+01 2024-06-20 02:53:37,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=116128.83333333333, ans=0.2 2024-06-20 02:53:38,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=116128.83333333333, ans=0.125 2024-06-20 02:53:43,818 INFO [train.py:1028] (1/2) Epoch 7, batch 2650, loss[loss=0.2757, simple_loss=0.2943, pruned_loss=0.1286, over 13000.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3155, pruned_loss=0.1326, over 2587869.79 frames. ], batch size: 144, lr: 8.52e-03, grad_scale: 16.0 2024-06-20 02:53:47,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=116147.16666666667, ans=0.1 2024-06-20 02:53:55,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=116165.5, ans=0.125 2024-06-20 02:54:12,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=116220.5, ans=0.1 2024-06-20 02:54:13,813 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.752e+02 2.964e+02 3.316e+02 4.432e+02, threshold=5.928e+02, percent-clipped=0.0 2024-06-20 02:54:14,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=116220.5, ans=0.125 2024-06-20 02:54:18,719 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:54:19,134 INFO [train.py:1028] (1/2) Epoch 7, batch 2700, loss[loss=0.2702, simple_loss=0.2895, pruned_loss=0.1255, over 13280.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3141, pruned_loss=0.1322, over 2584963.63 frames. ], batch size: 89, lr: 8.52e-03, grad_scale: 16.0 2024-06-20 02:54:32,943 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.90 vs. limit=15.0 2024-06-20 02:54:36,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=116275.5, ans=0.125 2024-06-20 02:54:37,111 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.20 vs. limit=15.0 2024-06-20 02:54:48,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=116312.16666666667, ans=0.07 2024-06-20 02:54:54,708 INFO [train.py:1028] (1/2) Epoch 7, batch 2750, loss[loss=0.2743, simple_loss=0.3132, pruned_loss=0.1176, over 13251.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3124, pruned_loss=0.1308, over 2580854.17 frames. ], batch size: 43, lr: 8.51e-03, grad_scale: 16.0 2024-06-20 02:55:04,331 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:55:14,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=116367.16666666667, ans=0.2 2024-06-20 02:55:15,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=116385.5, ans=0.0 2024-06-20 02:55:23,072 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 2.663e+02 3.101e+02 3.520e+02 6.481e+02, threshold=6.202e+02, percent-clipped=1.0 2024-06-20 02:55:25,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=116403.83333333333, ans=0.025 2024-06-20 02:55:26,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=116403.83333333333, ans=0.125 2024-06-20 02:55:28,088 INFO [train.py:1028] (1/2) Epoch 7, batch 2800, loss[loss=0.3085, simple_loss=0.3135, pruned_loss=0.1518, over 11039.00 frames. ], tot_loss[loss=0.2861, simple_loss=0.3114, pruned_loss=0.1303, over 2578672.04 frames. ], batch size: 303, lr: 8.51e-03, grad_scale: 32.0 2024-06-20 02:55:40,732 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.65 vs. limit=15.0 2024-06-20 02:55:59,873 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=22.5 2024-06-20 02:56:03,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=116513.83333333333, ans=0.0 2024-06-20 02:56:03,535 INFO [train.py:1028] (1/2) Epoch 7, batch 2850, loss[loss=0.2668, simple_loss=0.2964, pruned_loss=0.1186, over 13259.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3104, pruned_loss=0.1303, over 2577084.10 frames. ], batch size: 49, lr: 8.51e-03, grad_scale: 32.0 2024-06-20 02:56:11,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=116532.16666666667, ans=0.125 2024-06-20 02:56:15,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=116532.16666666667, ans=0.125 2024-06-20 02:56:19,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=116550.5, ans=0.0 2024-06-20 02:56:22,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=116568.83333333333, ans=0.125 2024-06-20 02:56:27,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=116568.83333333333, ans=0.125 2024-06-20 02:56:29,825 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.08 vs. limit=12.0 2024-06-20 02:56:31,180 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.737e+02 3.057e+02 3.325e+02 5.006e+02, threshold=6.114e+02, percent-clipped=0.0 2024-06-20 02:56:36,487 INFO [train.py:1028] (1/2) Epoch 7, batch 2900, loss[loss=0.2544, simple_loss=0.2889, pruned_loss=0.11, over 13138.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3079, pruned_loss=0.1289, over 2584873.65 frames. ], batch size: 55, lr: 8.50e-03, grad_scale: 32.0 2024-06-20 02:56:39,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=116605.5, ans=0.125 2024-06-20 02:56:40,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=116605.5, ans=0.125 2024-06-20 02:56:45,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=116623.83333333333, ans=0.1 2024-06-20 02:56:52,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=116642.16666666667, ans=0.0 2024-06-20 02:56:56,413 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.81 vs. limit=10.0 2024-06-20 02:56:56,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=116642.16666666667, ans=0.0 2024-06-20 02:56:56,899 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:57:03,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=116660.5, ans=0.125 2024-06-20 02:57:08,830 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=6.935e+00 2024-06-20 02:57:12,767 INFO [train.py:1028] (1/2) Epoch 7, batch 2950, loss[loss=0.2958, simple_loss=0.3186, pruned_loss=0.1365, over 13303.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3073, pruned_loss=0.1286, over 2579187.47 frames. ], batch size: 43, lr: 8.50e-03, grad_scale: 32.0 2024-06-20 02:57:25,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=116715.5, ans=0.2 2024-06-20 02:57:30,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=116733.83333333333, ans=0.125 2024-06-20 02:57:33,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=116752.16666666667, ans=0.0 2024-06-20 02:57:40,668 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.886e+02 2.547e+02 2.899e+02 3.253e+02 4.782e+02, threshold=5.798e+02, percent-clipped=0.0 2024-06-20 02:57:41,197 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.35 vs. limit=15.0 2024-06-20 02:57:42,995 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2024-06-20 02:57:45,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116788.83333333333, ans=0.1 2024-06-20 02:57:45,936 INFO [train.py:1028] (1/2) Epoch 7, batch 3000, loss[loss=0.2535, simple_loss=0.2862, pruned_loss=0.1104, over 13222.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3048, pruned_loss=0.1268, over 2577635.95 frames. ], batch size: 59, lr: 8.50e-03, grad_scale: 32.0 2024-06-20 02:57:45,937 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 02:57:53,713 INFO [train.py:1060] (1/2) Epoch 7, validation: loss=0.2166, simple_loss=0.2775, pruned_loss=0.07786, over 351949.00 frames. 2024-06-20 02:57:53,714 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 02:57:57,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=116788.83333333333, ans=0.0 2024-06-20 02:58:03,167 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=12.0 2024-06-20 02:58:05,892 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.23 vs. limit=15.0 2024-06-20 02:58:12,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=116825.5, ans=0.0 2024-06-20 02:58:28,179 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=116862.16666666667, ans=0.125 2024-06-20 02:58:29,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=116880.5, ans=0.0 2024-06-20 02:58:30,072 INFO [train.py:1028] (1/2) Epoch 7, batch 3050, loss[loss=0.2752, simple_loss=0.3106, pruned_loss=0.1199, over 13359.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3049, pruned_loss=0.1272, over 2577538.70 frames. ], batch size: 46, lr: 8.49e-03, grad_scale: 32.0 2024-06-20 02:58:51,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=116935.5, ans=0.5 2024-06-20 02:58:52,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=116935.5, ans=0.125 2024-06-20 02:58:59,369 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.485e+02 2.826e+02 3.204e+02 5.343e+02, threshold=5.653e+02, percent-clipped=0.0 2024-06-20 02:59:04,874 INFO [train.py:1028] (1/2) Epoch 7, batch 3100, loss[loss=0.2771, simple_loss=0.2948, pruned_loss=0.1297, over 12989.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.304, pruned_loss=0.1263, over 2577905.47 frames. ], batch size: 144, lr: 8.49e-03, grad_scale: 32.0 2024-06-20 02:59:05,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=116972.16666666667, ans=0.025 2024-06-20 02:59:13,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=116990.5, ans=0.0 2024-06-20 02:59:15,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=116990.5, ans=0.04949747468305833 2024-06-20 02:59:22,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=117008.83333333333, ans=0.0 2024-06-20 02:59:36,763 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.58 vs. limit=6.0 2024-06-20 02:59:38,160 INFO [train.py:1028] (1/2) Epoch 7, batch 3150, loss[loss=0.2757, simple_loss=0.2992, pruned_loss=0.1261, over 12877.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3024, pruned_loss=0.1252, over 2580181.65 frames. ], batch size: 158, lr: 8.49e-03, grad_scale: 32.0 2024-06-20 02:59:46,550 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.03 vs. limit=15.0 2024-06-20 02:59:51,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=117082.16666666667, ans=0.0 2024-06-20 02:59:55,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=117100.5, ans=0.5 2024-06-20 02:59:59,242 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=7.067e+02 2024-06-20 03:00:08,953 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.386e+02 2.544e+02 2.932e+02 4.184e+02, threshold=5.088e+02, percent-clipped=0.0 2024-06-20 03:00:09,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=117137.16666666667, ans=0.09899494936611666 2024-06-20 03:00:11,453 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.97 vs. limit=15.0 2024-06-20 03:00:13,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=117155.5, ans=0.125 2024-06-20 03:00:14,129 INFO [train.py:1028] (1/2) Epoch 7, batch 3200, loss[loss=0.2502, simple_loss=0.2844, pruned_loss=0.108, over 13129.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3016, pruned_loss=0.1246, over 2580894.15 frames. ], batch size: 55, lr: 8.48e-03, grad_scale: 32.0 2024-06-20 03:00:14,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=117155.5, ans=0.025 2024-06-20 03:00:15,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=117155.5, ans=0.1 2024-06-20 03:00:28,575 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.04 vs. limit=15.0 2024-06-20 03:00:36,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=117210.5, ans=0.025 2024-06-20 03:00:37,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=117210.5, ans=0.1 2024-06-20 03:00:50,331 INFO [train.py:1028] (1/2) Epoch 7, batch 3250, loss[loss=0.2607, simple_loss=0.2947, pruned_loss=0.1134, over 13261.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3013, pruned_loss=0.1246, over 2585268.03 frames. ], batch size: 72, lr: 8.48e-03, grad_scale: 32.0 2024-06-20 03:01:03,898 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.17 vs. limit=10.0 2024-06-20 03:01:05,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.04 vs. limit=15.0 2024-06-20 03:01:19,374 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.44 vs. limit=15.0 2024-06-20 03:01:19,526 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.393e+02 2.596e+02 2.860e+02 4.224e+02, threshold=5.192e+02, percent-clipped=0.0 2024-06-20 03:01:28,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=117320.5, ans=0.2 2024-06-20 03:01:30,175 INFO [train.py:1028] (1/2) Epoch 7, batch 3300, loss[loss=0.2954, simple_loss=0.3153, pruned_loss=0.1377, over 12709.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3004, pruned_loss=0.1241, over 2582649.91 frames. ], batch size: 176, lr: 8.48e-03, grad_scale: 32.0 2024-06-20 03:01:33,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=117338.83333333333, ans=0.0 2024-06-20 03:01:41,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=117357.16666666667, ans=0.0 2024-06-20 03:01:50,344 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.44 vs. limit=22.5 2024-06-20 03:02:01,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=117412.16666666667, ans=0.0 2024-06-20 03:02:04,695 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.78 vs. limit=15.0 2024-06-20 03:02:07,603 INFO [train.py:1028] (1/2) Epoch 7, batch 3350, loss[loss=0.2787, simple_loss=0.2917, pruned_loss=0.1329, over 12976.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3005, pruned_loss=0.125, over 2576918.66 frames. ], batch size: 158, lr: 8.47e-03, grad_scale: 32.0 2024-06-20 03:02:11,945 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.69 vs. limit=22.5 2024-06-20 03:02:29,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=117485.5, ans=0.0 2024-06-20 03:02:34,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=117503.83333333333, ans=0.0 2024-06-20 03:02:35,269 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 2.736e+02 3.017e+02 3.321e+02 4.935e+02, threshold=6.034e+02, percent-clipped=0.0 2024-06-20 03:02:44,475 INFO [train.py:1028] (1/2) Epoch 7, batch 3400, loss[loss=0.3083, simple_loss=0.3224, pruned_loss=0.1472, over 12496.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3003, pruned_loss=0.1251, over 2575099.05 frames. ], batch size: 22, lr: 8.47e-03, grad_scale: 32.0 2024-06-20 03:02:45,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=117522.16666666667, ans=0.0 2024-06-20 03:02:55,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=117540.5, ans=0.025 2024-06-20 03:03:02,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=117558.83333333333, ans=0.125 2024-06-20 03:03:05,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=117577.16666666667, ans=0.125 2024-06-20 03:03:09,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=117577.16666666667, ans=0.015 2024-06-20 03:03:16,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=117595.5, ans=0.025 2024-06-20 03:03:16,249 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 03:03:17,838 INFO [train.py:1028] (1/2) Epoch 7, batch 3450, loss[loss=0.2937, simple_loss=0.3116, pruned_loss=0.1378, over 12719.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3, pruned_loss=0.1247, over 2576741.40 frames. ], batch size: 176, lr: 8.47e-03, grad_scale: 32.0 2024-06-20 03:03:23,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=117613.83333333333, ans=0.1 2024-06-20 03:03:45,039 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.43 vs. limit=12.0 2024-06-20 03:03:45,290 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.567e+02 2.836e+02 3.247e+02 4.313e+02, threshold=5.672e+02, percent-clipped=0.0 2024-06-20 03:03:45,685 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.09 vs. limit=15.0 2024-06-20 03:03:50,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=117705.5, ans=0.1 2024-06-20 03:03:50,964 INFO [train.py:1028] (1/2) Epoch 7, batch 3500, loss[loss=0.2443, simple_loss=0.2833, pruned_loss=0.1027, over 12858.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.2997, pruned_loss=0.1244, over 2576127.77 frames. ], batch size: 33, lr: 8.46e-03, grad_scale: 32.0 2024-06-20 03:03:54,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=117705.5, ans=0.0 2024-06-20 03:03:56,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=117705.5, ans=0.09899494936611666 2024-06-20 03:03:58,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=117705.5, ans=0.0 2024-06-20 03:03:59,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=117705.5, ans=0.125 2024-06-20 03:04:01,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=117723.83333333333, ans=0.1 2024-06-20 03:04:03,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=117723.83333333333, ans=0.0 2024-06-20 03:04:09,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=117742.16666666667, ans=0.2 2024-06-20 03:04:19,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=117760.5, ans=0.125 2024-06-20 03:04:27,507 INFO [train.py:1028] (1/2) Epoch 7, batch 3550, loss[loss=0.2524, simple_loss=0.2799, pruned_loss=0.1125, over 13159.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.2987, pruned_loss=0.1239, over 2577996.87 frames. ], batch size: 95, lr: 8.46e-03, grad_scale: 32.0 2024-06-20 03:04:29,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=117797.16666666667, ans=0.125 2024-06-20 03:04:39,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=117815.5, ans=0.0 2024-06-20 03:04:58,447 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.622e+02 2.898e+02 3.303e+02 5.007e+02, threshold=5.795e+02, percent-clipped=0.0 2024-06-20 03:05:02,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=117870.5, ans=0.125 2024-06-20 03:05:03,933 INFO [train.py:1028] (1/2) Epoch 7, batch 3600, loss[loss=0.2613, simple_loss=0.2924, pruned_loss=0.115, over 13349.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.2978, pruned_loss=0.1235, over 2582435.96 frames. ], batch size: 49, lr: 8.46e-03, grad_scale: 32.0 2024-06-20 03:05:13,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=117907.16666666667, ans=0.0 2024-06-20 03:05:24,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=117943.83333333333, ans=0.125 2024-06-20 03:05:31,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=117962.16666666667, ans=0.2 2024-06-20 03:05:34,724 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.45 vs. limit=22.5 2024-06-20 03:05:36,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=117962.16666666667, ans=0.125 2024-06-20 03:05:37,512 INFO [train.py:1028] (1/2) Epoch 7, batch 3650, loss[loss=0.2833, simple_loss=0.3034, pruned_loss=0.1316, over 13043.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.2975, pruned_loss=0.1231, over 2581028.78 frames. ], batch size: 102, lr: 8.45e-03, grad_scale: 32.0 2024-06-20 03:06:08,300 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.24 vs. limit=15.0 2024-06-20 03:06:08,425 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.591e+02 2.748e+02 3.061e+02 4.305e+02, threshold=5.497e+02, percent-clipped=0.0 2024-06-20 03:06:08,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118053.83333333333, ans=0.1 2024-06-20 03:06:14,109 INFO [train.py:1028] (1/2) Epoch 7, batch 3700, loss[loss=0.2669, simple_loss=0.2948, pruned_loss=0.1194, over 13215.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.2959, pruned_loss=0.1222, over 2585332.13 frames. ], batch size: 72, lr: 8.45e-03, grad_scale: 32.0 2024-06-20 03:06:19,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=118072.16666666667, ans=0.0 2024-06-20 03:06:44,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=118145.5, ans=0.1 2024-06-20 03:06:46,497 INFO [train.py:1028] (1/2) Epoch 7, batch 3750, loss[loss=0.2949, simple_loss=0.3223, pruned_loss=0.1337, over 12398.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.2951, pruned_loss=0.1217, over 2587672.83 frames. ], batch size: 22, lr: 8.45e-03, grad_scale: 32.0 2024-06-20 03:06:53,436 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.86 vs. limit=22.5 2024-06-20 03:07:08,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=118218.83333333333, ans=0.125 2024-06-20 03:07:17,504 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.521e+02 2.707e+02 3.048e+02 4.386e+02, threshold=5.414e+02, percent-clipped=0.0 2024-06-20 03:07:20,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=118237.16666666667, ans=0.2 2024-06-20 03:07:21,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=118237.16666666667, ans=0.125 2024-06-20 03:07:21,959 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.83 vs. limit=22.5 2024-06-20 03:07:22,778 INFO [train.py:1028] (1/2) Epoch 7, batch 3800, loss[loss=0.2877, simple_loss=0.3024, pruned_loss=0.1365, over 13222.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.295, pruned_loss=0.1212, over 2585179.59 frames. ], batch size: 83, lr: 8.44e-03, grad_scale: 32.0 2024-06-20 03:07:31,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=118273.83333333333, ans=0.0 2024-06-20 03:07:31,778 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.87 vs. limit=15.0 2024-06-20 03:07:43,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=118310.5, ans=0.0 2024-06-20 03:07:47,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=118310.5, ans=0.0 2024-06-20 03:07:47,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=118310.5, ans=0.125 2024-06-20 03:07:54,327 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.10 vs. limit=10.0 2024-06-20 03:07:55,863 INFO [train.py:1028] (1/2) Epoch 7, batch 3850, loss[loss=0.2474, simple_loss=0.2708, pruned_loss=0.112, over 13066.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.2946, pruned_loss=0.1205, over 2585056.68 frames. ], batch size: 144, lr: 8.44e-03, grad_scale: 32.0 2024-06-20 03:07:56,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118347.16666666667, ans=0.1 2024-06-20 03:07:59,209 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=1.101e+02 2024-06-20 03:07:59,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=118347.16666666667, ans=0.125 2024-06-20 03:08:09,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=118365.5, ans=0.2 2024-06-20 03:08:11,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=118365.5, ans=0.1 2024-06-20 03:08:19,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=118402.16666666667, ans=0.2 2024-06-20 03:08:26,158 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.286e+02 2.504e+02 2.695e+02 3.658e+02, threshold=5.007e+02, percent-clipped=0.0 2024-06-20 03:08:31,447 INFO [train.py:1028] (1/2) Epoch 7, batch 3900, loss[loss=0.2641, simple_loss=0.2909, pruned_loss=0.1186, over 13217.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.2946, pruned_loss=0.1207, over 2587808.90 frames. ], batch size: 83, lr: 8.44e-03, grad_scale: 32.0 2024-06-20 03:08:38,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=118457.16666666667, ans=0.2 2024-06-20 03:08:39,271 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.49 vs. limit=6.0 2024-06-20 03:08:47,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=118475.5, ans=0.125 2024-06-20 03:08:48,608 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.71 vs. limit=15.0 2024-06-20 03:08:49,231 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.17 vs. limit=15.0 2024-06-20 03:09:08,695 INFO [train.py:1028] (1/2) Epoch 7, batch 3950, loss[loss=0.2825, simple_loss=0.2974, pruned_loss=0.1338, over 13119.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.2934, pruned_loss=0.1197, over 2587958.51 frames. ], batch size: 132, lr: 8.43e-03, grad_scale: 32.0 2024-06-20 03:09:10,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=118530.5, ans=0.0 2024-06-20 03:09:10,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=118530.5, ans=0.0 2024-06-20 03:09:19,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=118548.83333333333, ans=0.125 2024-06-20 03:09:23,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=118567.16666666667, ans=0.125 2024-06-20 03:09:26,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118567.16666666667, ans=0.1 2024-06-20 03:09:28,389 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=12.0 2024-06-20 03:09:30,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=118585.5, ans=0.0 2024-06-20 03:09:35,100 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=15.07 vs. limit=15.0 2024-06-20 03:09:36,476 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.205e+02 2.442e+02 2.563e+02 3.872e+02, threshold=4.884e+02, percent-clipped=0.0 2024-06-20 03:09:41,880 INFO [train.py:1028] (1/2) Epoch 7, batch 4000, loss[loss=0.2623, simple_loss=0.2996, pruned_loss=0.1125, over 13043.00 frames. ], tot_loss[loss=0.266, simple_loss=0.293, pruned_loss=0.1195, over 2583912.93 frames. ], batch size: 39, lr: 8.43e-03, grad_scale: 32.0 2024-06-20 03:09:42,956 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.30 vs. limit=15.0 2024-06-20 03:09:43,684 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.84 vs. limit=22.5 2024-06-20 03:09:50,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.87 vs. limit=15.0 2024-06-20 03:09:51,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=118640.5, ans=0.0 2024-06-20 03:09:53,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=118640.5, ans=0.125 2024-06-20 03:10:01,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=118677.16666666667, ans=0.025 2024-06-20 03:10:18,807 INFO [train.py:1028] (1/2) Epoch 7, batch 4050, loss[loss=0.261, simple_loss=0.2772, pruned_loss=0.1224, over 11013.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.2924, pruned_loss=0.1195, over 2582399.70 frames. ], batch size: 304, lr: 8.43e-03, grad_scale: 32.0 2024-06-20 03:10:22,213 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=15.0 2024-06-20 03:10:29,935 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.82 vs. limit=6.0 2024-06-20 03:10:31,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=118750.5, ans=0.04949747468305833 2024-06-20 03:10:35,641 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.75 vs. limit=15.0 2024-06-20 03:10:42,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118768.83333333333, ans=0.1 2024-06-20 03:10:43,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=118768.83333333333, ans=0.02 2024-06-20 03:10:44,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=118787.16666666667, ans=0.1 2024-06-20 03:10:46,528 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.320e+02 2.579e+02 2.998e+02 4.314e+02, threshold=5.157e+02, percent-clipped=0.0 2024-06-20 03:10:51,737 INFO [train.py:1028] (1/2) Epoch 7, batch 4100, loss[loss=0.2796, simple_loss=0.2979, pruned_loss=0.1307, over 13077.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.2931, pruned_loss=0.12, over 2577809.57 frames. ], batch size: 102, lr: 8.42e-03, grad_scale: 32.0 2024-06-20 03:10:52,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118805.5, ans=0.1 2024-06-20 03:11:03,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=118823.83333333333, ans=0.125 2024-06-20 03:11:03,315 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.29 vs. limit=22.5 2024-06-20 03:11:04,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=118823.83333333333, ans=0.125 2024-06-20 03:11:04,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=118823.83333333333, ans=0.5 2024-06-20 03:11:04,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=118823.83333333333, ans=0.125 2024-06-20 03:11:07,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=118842.16666666667, ans=0.125 2024-06-20 03:11:08,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=118842.16666666667, ans=0.0 2024-06-20 03:11:08,467 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=11.36 vs. limit=12.0 2024-06-20 03:11:10,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=118842.16666666667, ans=0.0 2024-06-20 03:11:12,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=118842.16666666667, ans=0.125 2024-06-20 03:11:12,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=118842.16666666667, ans=0.0 2024-06-20 03:11:15,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=118860.5, ans=0.0 2024-06-20 03:11:22,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.70 vs. limit=15.0 2024-06-20 03:11:24,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=118878.83333333333, ans=0.125 2024-06-20 03:11:27,580 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.99 vs. limit=15.0 2024-06-20 03:11:27,857 INFO [train.py:1028] (1/2) Epoch 7, batch 4150, loss[loss=0.2287, simple_loss=0.261, pruned_loss=0.09818, over 13113.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.2922, pruned_loss=0.1193, over 2576870.93 frames. ], batch size: 55, lr: 8.42e-03, grad_scale: 32.0 2024-06-20 03:11:32,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=118897.16666666667, ans=0.125 2024-06-20 03:11:35,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=118915.5, ans=0.1 2024-06-20 03:11:36,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=118915.5, ans=0.0 2024-06-20 03:11:36,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118915.5, ans=0.1 2024-06-20 03:11:40,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=118915.5, ans=0.125 2024-06-20 03:11:48,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=118952.16666666667, ans=0.1 2024-06-20 03:11:49,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=118952.16666666667, ans=0.2 2024-06-20 03:11:56,050 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.525e+02 2.820e+02 3.215e+02 4.715e+02, threshold=5.639e+02, percent-clipped=0.0 2024-06-20 03:12:00,518 INFO [train.py:1028] (1/2) Epoch 7, batch 4200, loss[loss=0.2485, simple_loss=0.2736, pruned_loss=0.1117, over 13040.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.292, pruned_loss=0.1194, over 2579816.28 frames. ], batch size: 102, lr: 8.42e-03, grad_scale: 32.0 2024-06-20 03:12:03,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=118988.83333333333, ans=0.125 2024-06-20 03:12:04,219 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.73 vs. limit=12.0 2024-06-20 03:12:12,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=119007.16666666667, ans=0.0 2024-06-20 03:12:12,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119007.16666666667, ans=0.1 2024-06-20 03:12:14,270 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.85 vs. limit=10.0 2024-06-20 03:12:25,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=119043.83333333333, ans=0.125 2024-06-20 03:12:26,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=119043.83333333333, ans=0.07 2024-06-20 03:12:35,919 INFO [train.py:1028] (1/2) Epoch 7, batch 4250, loss[loss=0.2495, simple_loss=0.2846, pruned_loss=0.1072, over 13234.00 frames. ], tot_loss[loss=0.266, simple_loss=0.2926, pruned_loss=0.1197, over 2582842.05 frames. ], batch size: 46, lr: 8.41e-03, grad_scale: 32.0 2024-06-20 03:12:36,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=119080.5, ans=0.0 2024-06-20 03:12:41,641 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2024-06-20 03:12:49,609 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.88 vs. limit=12.0 2024-06-20 03:12:57,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=119135.5, ans=0.2 2024-06-20 03:13:04,277 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.470e+02 2.746e+02 3.092e+02 5.160e+02, threshold=5.491e+02, percent-clipped=0.0 2024-06-20 03:13:12,470 INFO [train.py:1028] (1/2) Epoch 7, batch 4300, loss[loss=0.253, simple_loss=0.2816, pruned_loss=0.1121, over 13227.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.2921, pruned_loss=0.1195, over 2581237.39 frames. ], batch size: 59, lr: 8.41e-03, grad_scale: 32.0 2024-06-20 03:13:13,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=119172.16666666667, ans=0.125 2024-06-20 03:13:15,472 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.70 vs. limit=15.0 2024-06-20 03:13:15,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=119172.16666666667, ans=0.125 2024-06-20 03:13:19,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=119190.5, ans=0.125 2024-06-20 03:13:22,764 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=15.0 2024-06-20 03:13:23,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=119190.5, ans=0.125 2024-06-20 03:13:37,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=119245.5, ans=0.07 2024-06-20 03:13:38,179 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.07 vs. limit=22.5 2024-06-20 03:13:45,043 INFO [train.py:1028] (1/2) Epoch 7, batch 4350, loss[loss=0.2792, simple_loss=0.3025, pruned_loss=0.1279, over 13205.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.2912, pruned_loss=0.1191, over 2585812.28 frames. ], batch size: 59, lr: 8.41e-03, grad_scale: 16.0 2024-06-20 03:13:45,416 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.83 vs. limit=6.0 2024-06-20 03:13:46,121 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.62 vs. limit=15.0 2024-06-20 03:13:48,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=119263.83333333333, ans=0.025 2024-06-20 03:13:49,012 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.78 vs. limit=22.5 2024-06-20 03:13:50,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=119263.83333333333, ans=0.0 2024-06-20 03:13:51,376 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=119282.16666666667, ans=0.125 2024-06-20 03:13:53,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=119282.16666666667, ans=0.125 2024-06-20 03:13:57,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=119282.16666666667, ans=0.0 2024-06-20 03:14:00,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=119300.5, ans=0.125 2024-06-20 03:14:01,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=119300.5, ans=0.0 2024-06-20 03:14:15,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=119318.83333333333, ans=0.125 2024-06-20 03:14:15,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=119337.16666666667, ans=0.125 2024-06-20 03:14:16,206 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.82 vs. limit=15.0 2024-06-20 03:14:18,955 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.488e+02 2.711e+02 3.113e+02 4.480e+02, threshold=5.422e+02, percent-clipped=0.0 2024-06-20 03:14:21,627 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.69 vs. limit=22.5 2024-06-20 03:14:23,197 INFO [train.py:1028] (1/2) Epoch 7, batch 4400, loss[loss=0.2474, simple_loss=0.2779, pruned_loss=0.1084, over 13230.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.2907, pruned_loss=0.1188, over 2586442.04 frames. ], batch size: 83, lr: 8.40e-03, grad_scale: 32.0 2024-06-20 03:14:25,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=119355.5, ans=0.2 2024-06-20 03:14:31,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=119373.83333333333, ans=0.125 2024-06-20 03:14:46,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=119410.5, ans=0.1 2024-06-20 03:14:56,106 INFO [train.py:1028] (1/2) Epoch 7, batch 4450, loss[loss=0.2571, simple_loss=0.2927, pruned_loss=0.1108, over 12862.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.2916, pruned_loss=0.1198, over 2581714.04 frames. ], batch size: 33, lr: 8.40e-03, grad_scale: 16.0 2024-06-20 03:14:58,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=119447.16666666667, ans=0.2 2024-06-20 03:15:00,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119447.16666666667, ans=0.1 2024-06-20 03:15:01,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=119447.16666666667, ans=0.125 2024-06-20 03:15:10,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=119465.5, ans=0.2 2024-06-20 03:15:14,280 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2024-06-20 03:15:24,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=119520.5, ans=0.0 2024-06-20 03:15:28,420 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.570e+02 2.847e+02 3.179e+02 7.208e+02, threshold=5.693e+02, percent-clipped=1.0 2024-06-20 03:15:31,763 INFO [train.py:1028] (1/2) Epoch 7, batch 4500, loss[loss=0.2445, simple_loss=0.2672, pruned_loss=0.1109, over 13237.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.2897, pruned_loss=0.1186, over 2586017.05 frames. ], batch size: 89, lr: 8.40e-03, grad_scale: 16.0 2024-06-20 03:15:36,640 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.91 vs. limit=15.0 2024-06-20 03:15:43,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119557.16666666667, ans=0.1 2024-06-20 03:15:46,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=119575.5, ans=0.125 2024-06-20 03:15:46,534 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.54 vs. limit=22.5 2024-06-20 03:15:47,910 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.64 vs. limit=15.0 2024-06-20 03:15:52,937 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.07 vs. limit=6.0 2024-06-20 03:16:00,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=119612.16666666667, ans=0.0 2024-06-20 03:16:04,290 INFO [train.py:1028] (1/2) Epoch 7, batch 4550, loss[loss=0.2475, simple_loss=0.2864, pruned_loss=0.1043, over 13222.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.2892, pruned_loss=0.1181, over 2590253.66 frames. ], batch size: 52, lr: 8.40e-03, grad_scale: 16.0 2024-06-20 03:16:11,127 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.00 vs. limit=10.0 2024-06-20 03:16:12,273 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.07 vs. limit=15.0 2024-06-20 03:16:17,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=119648.83333333333, ans=0.125 2024-06-20 03:16:18,355 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2024-06-20 03:16:26,299 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.37 vs. limit=22.5 2024-06-20 03:16:31,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=119685.5, ans=0.0 2024-06-20 03:16:31,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=119685.5, ans=0.125 2024-06-20 03:16:33,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=119685.5, ans=0.0 2024-06-20 03:16:35,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=119703.83333333333, ans=0.2 2024-06-20 03:16:39,071 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.347e+02 2.548e+02 2.845e+02 6.369e+02, threshold=5.097e+02, percent-clipped=1.0 2024-06-20 03:16:41,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=119703.83333333333, ans=0.125 2024-06-20 03:16:42,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=119722.16666666667, ans=0.0 2024-06-20 03:16:42,478 INFO [train.py:1028] (1/2) Epoch 7, batch 4600, loss[loss=0.2925, simple_loss=0.3078, pruned_loss=0.1386, over 12564.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.2903, pruned_loss=0.1188, over 2584847.85 frames. ], batch size: 202, lr: 8.39e-03, grad_scale: 16.0 2024-06-20 03:16:48,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=119740.5, ans=0.125 2024-06-20 03:16:49,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=119740.5, ans=0.2 2024-06-20 03:16:49,783 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.06 vs. limit=10.0 2024-06-20 03:16:50,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=119740.5, ans=0.0 2024-06-20 03:16:50,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=119740.5, ans=0.125 2024-06-20 03:17:09,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=119777.16666666667, ans=0.0 2024-06-20 03:17:18,242 INFO [train.py:1028] (1/2) Epoch 7, batch 4650, loss[loss=0.2605, simple_loss=0.2833, pruned_loss=0.1188, over 13091.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.2888, pruned_loss=0.1177, over 2587965.83 frames. ], batch size: 132, lr: 8.39e-03, grad_scale: 16.0 2024-06-20 03:17:30,860 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.41 vs. limit=15.0 2024-06-20 03:17:33,547 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.96 vs. limit=15.0 2024-06-20 03:17:40,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=119868.83333333333, ans=0.025 2024-06-20 03:17:47,546 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.379e+02 2.554e+02 2.861e+02 4.607e+02, threshold=5.109e+02, percent-clipped=0.0 2024-06-20 03:17:47,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=119887.16666666667, ans=0.0 2024-06-20 03:17:51,290 INFO [train.py:1028] (1/2) Epoch 7, batch 4700, loss[loss=0.2671, simple_loss=0.2936, pruned_loss=0.1203, over 12458.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.2892, pruned_loss=0.1182, over 2583095.86 frames. ], batch size: 25, lr: 8.39e-03, grad_scale: 16.0 2024-06-20 03:17:56,343 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.16 vs. limit=22.5 2024-06-20 03:17:57,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=119923.83333333333, ans=0.0 2024-06-20 03:18:05,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=119942.16666666667, ans=0.2 2024-06-20 03:18:08,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=119942.16666666667, ans=0.125 2024-06-20 03:18:21,974 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.15 vs. limit=15.0 2024-06-20 03:18:26,786 INFO [train.py:1028] (1/2) Epoch 7, batch 4750, loss[loss=0.3034, simple_loss=0.3161, pruned_loss=0.1454, over 12559.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.2892, pruned_loss=0.1183, over 2580149.56 frames. ], batch size: 202, lr: 8.38e-03, grad_scale: 16.0 2024-06-20 03:18:33,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=120015.5, ans=0.07 2024-06-20 03:18:34,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=120015.5, ans=0.1 2024-06-20 03:18:41,205 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=18.83 vs. limit=15.0 2024-06-20 03:18:42,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=120033.83333333333, ans=0.2 2024-06-20 03:18:43,641 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.325e+01 2024-06-20 03:18:47,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=120052.16666666667, ans=0.125 2024-06-20 03:18:51,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=120052.16666666667, ans=0.125 2024-06-20 03:18:56,362 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.817e+02 2.311e+02 2.626e+02 3.007e+02 5.190e+02, threshold=5.252e+02, percent-clipped=1.0 2024-06-20 03:19:02,974 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.82 vs. limit=15.0 2024-06-20 03:19:03,747 INFO [train.py:1028] (1/2) Epoch 7, batch 4800, loss[loss=0.2644, simple_loss=0.2932, pruned_loss=0.1178, over 13293.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.2892, pruned_loss=0.118, over 2577310.30 frames. ], batch size: 63, lr: 8.38e-03, grad_scale: 32.0 2024-06-20 03:19:18,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=120125.5, ans=0.125 2024-06-20 03:19:21,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=120125.5, ans=0.1 2024-06-20 03:19:30,766 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.11 vs. limit=15.0 2024-06-20 03:19:36,025 INFO [train.py:1028] (1/2) Epoch 7, batch 4850, loss[loss=0.2548, simple_loss=0.2815, pruned_loss=0.114, over 13241.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.2888, pruned_loss=0.1177, over 2575810.43 frames. ], batch size: 89, lr: 8.38e-03, grad_scale: 32.0 2024-06-20 03:19:50,695 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=22.5 2024-06-20 03:20:03,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=120235.5, ans=0.125 2024-06-20 03:20:09,541 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.372e+02 2.668e+02 2.933e+02 4.071e+02, threshold=5.336e+02, percent-clipped=0.0 2024-06-20 03:20:12,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=120272.16666666667, ans=0.125 2024-06-20 03:20:13,007 INFO [train.py:1028] (1/2) Epoch 7, batch 4900, loss[loss=0.2429, simple_loss=0.2796, pruned_loss=0.1031, over 13196.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.2882, pruned_loss=0.1172, over 2576689.62 frames. ], batch size: 59, lr: 8.37e-03, grad_scale: 32.0 2024-06-20 03:20:16,393 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.20 vs. limit=15.0 2024-06-20 03:20:16,456 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.01 vs. limit=22.5 2024-06-20 03:20:18,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=120272.16666666667, ans=0.025 2024-06-20 03:20:19,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=120290.5, ans=0.2 2024-06-20 03:20:21,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=120290.5, ans=0.0 2024-06-20 03:20:25,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=120308.83333333333, ans=0.2 2024-06-20 03:20:26,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=120308.83333333333, ans=0.1 2024-06-20 03:20:32,757 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 03:20:35,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=120327.16666666667, ans=0.0 2024-06-20 03:20:38,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=120327.16666666667, ans=0.125 2024-06-20 03:20:39,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=120345.5, ans=0.025 2024-06-20 03:20:45,739 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.02 vs. limit=10.0 2024-06-20 03:20:45,982 INFO [train.py:1028] (1/2) Epoch 7, batch 4950, loss[loss=0.2734, simple_loss=0.2845, pruned_loss=0.1311, over 11001.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.2886, pruned_loss=0.118, over 2570687.42 frames. ], batch size: 304, lr: 8.37e-03, grad_scale: 32.0 2024-06-20 03:20:49,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=120363.83333333333, ans=0.1 2024-06-20 03:21:01,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=120400.5, ans=0.2 2024-06-20 03:21:10,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=120418.83333333333, ans=0.2 2024-06-20 03:21:17,981 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.277e+02 2.597e+02 2.962e+02 4.569e+02, threshold=5.195e+02, percent-clipped=0.0 2024-06-20 03:21:21,324 INFO [train.py:1028] (1/2) Epoch 7, batch 5000, loss[loss=0.2747, simple_loss=0.2943, pruned_loss=0.1275, over 13140.00 frames. ], tot_loss[loss=0.261, simple_loss=0.2878, pruned_loss=0.1171, over 2574733.27 frames. ], batch size: 95, lr: 8.37e-03, grad_scale: 32.0 2024-06-20 03:21:24,575 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 03:21:31,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=120473.83333333333, ans=0.125 2024-06-20 03:21:34,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=120492.16666666667, ans=0.0 2024-06-20 03:21:34,941 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.49 vs. limit=22.5 2024-06-20 03:21:35,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=120492.16666666667, ans=0.0 2024-06-20 03:21:51,887 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.82 vs. limit=15.0 2024-06-20 03:21:54,754 INFO [train.py:1028] (1/2) Epoch 7, batch 5050, loss[loss=0.2458, simple_loss=0.2708, pruned_loss=0.1104, over 13031.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.2876, pruned_loss=0.1166, over 2573451.55 frames. ], batch size: 36, lr: 8.36e-03, grad_scale: 32.0 2024-06-20 03:21:57,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=120547.16666666667, ans=0.0 2024-06-20 03:22:02,877 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.63 vs. limit=10.0 2024-06-20 03:22:27,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=120565.5, ans=0.125 2024-06-20 03:22:33,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=120583.83333333333, ans=0.125 2024-06-20 03:22:36,419 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=8.141e-03 2024-06-20 03:22:46,143 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.20 vs. limit=12.0 2024-06-20 03:22:47,527 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.294e+02 2.491e+02 2.842e+02 3.979e+02, threshold=4.982e+02, percent-clipped=0.0 2024-06-20 03:22:52,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=120638.83333333333, ans=0.0 2024-06-20 03:22:52,753 INFO [train.py:1028] (1/2) Epoch 7, batch 5100, loss[loss=0.2626, simple_loss=0.2946, pruned_loss=0.1153, over 12909.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.2882, pruned_loss=0.1177, over 2570909.28 frames. ], batch size: 39, lr: 8.36e-03, grad_scale: 32.0 2024-06-20 03:22:57,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=120638.83333333333, ans=0.09899494936611666 2024-06-20 03:23:04,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=120657.16666666667, ans=0.0 2024-06-20 03:23:08,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=120675.5, ans=0.0 2024-06-20 03:23:18,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=120712.16666666667, ans=0.1 2024-06-20 03:23:18,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=120712.16666666667, ans=0.125 2024-06-20 03:23:28,119 INFO [train.py:1028] (1/2) Epoch 7, batch 5150, loss[loss=0.2474, simple_loss=0.2726, pruned_loss=0.1111, over 13171.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.2882, pruned_loss=0.1177, over 2572061.68 frames. ], batch size: 132, lr: 8.36e-03, grad_scale: 32.0 2024-06-20 03:23:41,397 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.52 vs. limit=15.0 2024-06-20 03:23:55,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=120803.83333333333, ans=0.2 2024-06-20 03:23:57,552 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 2.303e+02 2.458e+02 2.816e+02 3.894e+02, threshold=4.916e+02, percent-clipped=0.0 2024-06-20 03:23:57,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=120803.83333333333, ans=0.125 2024-06-20 03:23:58,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=120803.83333333333, ans=0.0 2024-06-20 03:23:59,980 INFO [train.py:1028] (1/2) Epoch 7, batch 5200, loss[loss=0.2531, simple_loss=0.2787, pruned_loss=0.1137, over 13092.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.2877, pruned_loss=0.1173, over 2575590.18 frames. ], batch size: 95, lr: 8.35e-03, grad_scale: 32.0 2024-06-20 03:24:14,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=120858.83333333333, ans=0.025 2024-06-20 03:24:15,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=120858.83333333333, ans=0.5 2024-06-20 03:24:25,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=120877.16666666667, ans=0.0 2024-06-20 03:24:35,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=120913.83333333333, ans=0.1 2024-06-20 03:24:35,726 INFO [train.py:1028] (1/2) Epoch 7, batch 5250, loss[loss=0.2718, simple_loss=0.3022, pruned_loss=0.1208, over 13279.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.2879, pruned_loss=0.1174, over 2572378.08 frames. ], batch size: 52, lr: 8.35e-03, grad_scale: 16.0 2024-06-20 03:24:43,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=120932.16666666667, ans=0.125 2024-06-20 03:24:44,573 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.31 vs. limit=15.0 2024-06-20 03:24:49,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=120950.5, ans=0.1 2024-06-20 03:24:50,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=120950.5, ans=0.125 2024-06-20 03:24:52,437 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=12.0 2024-06-20 03:24:53,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=120950.5, ans=0.0 2024-06-20 03:25:06,984 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.261e+02 2.482e+02 2.850e+02 4.798e+02, threshold=4.964e+02, percent-clipped=0.0 2024-06-20 03:25:08,919 INFO [train.py:1028] (1/2) Epoch 7, batch 5300, loss[loss=0.2525, simple_loss=0.2771, pruned_loss=0.1139, over 13042.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.2874, pruned_loss=0.1172, over 2568806.41 frames. ], batch size: 144, lr: 8.35e-03, grad_scale: 16.0 2024-06-20 03:25:10,446 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 03:25:13,909 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.28 vs. limit=15.0 2024-06-20 03:25:25,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=121042.16666666667, ans=0.0 2024-06-20 03:25:25,231 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.08 vs. limit=22.5 2024-06-20 03:25:25,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=121042.16666666667, ans=0.1 2024-06-20 03:25:33,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=121060.5, ans=0.125 2024-06-20 03:25:34,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=121060.5, ans=0.0 2024-06-20 03:25:40,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=121078.83333333333, ans=0.5 2024-06-20 03:25:41,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121078.83333333333, ans=0.1 2024-06-20 03:25:42,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=121078.83333333333, ans=0.1 2024-06-20 03:25:45,448 INFO [train.py:1028] (1/2) Epoch 7, batch 5350, loss[loss=0.2441, simple_loss=0.283, pruned_loss=0.1026, over 12168.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.2868, pruned_loss=0.1168, over 2576813.77 frames. ], batch size: 17, lr: 8.34e-03, grad_scale: 16.0 2024-06-20 03:25:48,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=121097.16666666667, ans=0.0 2024-06-20 03:25:54,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=121115.5, ans=0.0 2024-06-20 03:25:59,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=121133.83333333333, ans=0.125 2024-06-20 03:26:05,105 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=12.0 2024-06-20 03:26:06,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=121152.16666666667, ans=0.125 2024-06-20 03:26:18,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=121170.5, ans=0.125 2024-06-20 03:26:19,528 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.219e+02 2.469e+02 2.731e+02 4.565e+02, threshold=4.939e+02, percent-clipped=0.0 2024-06-20 03:26:20,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=121188.83333333333, ans=0.125 2024-06-20 03:26:21,257 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.66 vs. limit=22.5 2024-06-20 03:26:21,517 INFO [train.py:1028] (1/2) Epoch 7, batch 5400, loss[loss=0.3124, simple_loss=0.3129, pruned_loss=0.1559, over 12272.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.2879, pruned_loss=0.1179, over 2568381.86 frames. ], batch size: 241, lr: 8.34e-03, grad_scale: 16.0 2024-06-20 03:26:21,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=121188.83333333333, ans=0.125 2024-06-20 03:26:22,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121188.83333333333, ans=0.1 2024-06-20 03:26:23,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=121188.83333333333, ans=0.025 2024-06-20 03:26:26,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=121188.83333333333, ans=0.125 2024-06-20 03:26:26,673 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=15.0 2024-06-20 03:26:30,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=121207.16666666667, ans=0.0 2024-06-20 03:26:36,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=121225.5, ans=0.125 2024-06-20 03:26:42,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=121243.83333333333, ans=0.09899494936611666 2024-06-20 03:26:54,993 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.64 vs. limit=8.0 2024-06-20 03:26:55,080 INFO [train.py:1028] (1/2) Epoch 7, batch 5450, loss[loss=0.2695, simple_loss=0.2908, pruned_loss=0.1241, over 12201.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.287, pruned_loss=0.1171, over 2571260.18 frames. ], batch size: 25, lr: 8.34e-03, grad_scale: 16.0 2024-06-20 03:26:55,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=121280.5, ans=0.025 2024-06-20 03:27:17,518 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.91 vs. limit=15.0 2024-06-20 03:27:26,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=121353.83333333333, ans=0.125 2024-06-20 03:27:29,211 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.228e+02 2.507e+02 2.768e+02 3.960e+02, threshold=5.014e+02, percent-clipped=0.0 2024-06-20 03:27:31,316 INFO [train.py:1028] (1/2) Epoch 7, batch 5500, loss[loss=0.2989, simple_loss=0.3091, pruned_loss=0.1444, over 12265.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.2875, pruned_loss=0.1173, over 2563707.40 frames. ], batch size: 241, lr: 8.34e-03, grad_scale: 16.0 2024-06-20 03:27:32,100 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 03:27:36,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=121372.16666666667, ans=0.125 2024-06-20 03:27:51,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=121427.16666666667, ans=0.0 2024-06-20 03:27:54,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=121427.16666666667, ans=0.125 2024-06-20 03:27:57,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=121445.5, ans=0.125 2024-06-20 03:27:57,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=121445.5, ans=0.025 2024-06-20 03:28:04,117 INFO [train.py:1028] (1/2) Epoch 7, batch 5550, loss[loss=0.2646, simple_loss=0.2911, pruned_loss=0.1191, over 13255.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.2864, pruned_loss=0.1162, over 2568469.01 frames. ], batch size: 43, lr: 8.33e-03, grad_scale: 16.0 2024-06-20 03:28:12,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=121463.83333333333, ans=0.125 2024-06-20 03:28:28,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=121518.83333333333, ans=0.0 2024-06-20 03:28:34,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=121537.16666666667, ans=0.2 2024-06-20 03:28:38,029 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.185e+02 2.488e+02 2.819e+02 3.782e+02, threshold=4.975e+02, percent-clipped=0.0 2024-06-20 03:28:39,932 INFO [train.py:1028] (1/2) Epoch 7, batch 5600, loss[loss=0.2414, simple_loss=0.2661, pruned_loss=0.1083, over 13203.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.2858, pruned_loss=0.1159, over 2570552.80 frames. ], batch size: 89, lr: 8.33e-03, grad_scale: 32.0 2024-06-20 03:28:40,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=121555.5, ans=0.125 2024-06-20 03:29:00,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=121610.5, ans=0.1 2024-06-20 03:29:07,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=121628.83333333333, ans=0.125 2024-06-20 03:29:15,912 INFO [train.py:1028] (1/2) Epoch 7, batch 5650, loss[loss=0.2886, simple_loss=0.3036, pruned_loss=0.1368, over 12589.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.2858, pruned_loss=0.1155, over 2576086.61 frames. ], batch size: 202, lr: 8.33e-03, grad_scale: 16.0 2024-06-20 03:29:33,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=121683.83333333333, ans=0.95 2024-06-20 03:29:33,552 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=2.93 vs. limit=15.0 2024-06-20 03:29:37,395 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.43 vs. limit=22.5 2024-06-20 03:29:41,424 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.69 vs. limit=6.0 2024-06-20 03:29:42,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=121720.5, ans=0.125 2024-06-20 03:29:45,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=121720.5, ans=0.125 2024-06-20 03:29:47,414 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.844e+02 2.213e+02 2.396e+02 2.699e+02 4.366e+02, threshold=4.793e+02, percent-clipped=0.0 2024-06-20 03:29:48,806 INFO [train.py:1028] (1/2) Epoch 7, batch 5700, loss[loss=0.2594, simple_loss=0.2919, pruned_loss=0.1134, over 13291.00 frames. ], tot_loss[loss=0.259, simple_loss=0.2862, pruned_loss=0.1159, over 2580870.52 frames. ], batch size: 63, lr: 8.32e-03, grad_scale: 16.0 2024-06-20 03:29:50,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=121738.83333333333, ans=0.2 2024-06-20 03:29:56,479 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.99 vs. limit=15.0 2024-06-20 03:30:06,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2024-06-20 03:30:16,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=121793.83333333333, ans=0.125 2024-06-20 03:30:25,719 INFO [train.py:1028] (1/2) Epoch 7, batch 5750, loss[loss=0.2657, simple_loss=0.2852, pruned_loss=0.1231, over 12746.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.2877, pruned_loss=0.1168, over 2581545.02 frames. ], batch size: 176, lr: 8.32e-03, grad_scale: 16.0 2024-06-20 03:30:27,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=121830.5, ans=0.125 2024-06-20 03:30:27,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=121830.5, ans=0.125 2024-06-20 03:30:30,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=121830.5, ans=0.125 2024-06-20 03:30:33,964 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.12 vs. limit=15.0 2024-06-20 03:30:35,413 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=17.08 vs. limit=15.0 2024-06-20 03:30:35,993 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.31 vs. limit=22.5 2024-06-20 03:30:44,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=121867.16666666667, ans=0.125 2024-06-20 03:30:44,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=121867.16666666667, ans=0.07 2024-06-20 03:30:45,308 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.31 vs. limit=15.0 2024-06-20 03:30:58,606 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.499e+02 2.844e+02 3.088e+02 5.345e+02, threshold=5.687e+02, percent-clipped=1.0 2024-06-20 03:30:58,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=121903.83333333333, ans=0.125 2024-06-20 03:30:59,866 INFO [train.py:1028] (1/2) Epoch 7, batch 5800, loss[loss=0.3049, simple_loss=0.3165, pruned_loss=0.1466, over 12791.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.2891, pruned_loss=0.1178, over 2579653.12 frames. ], batch size: 176, lr: 8.32e-03, grad_scale: 16.0 2024-06-20 03:31:06,093 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.58 vs. limit=22.5 2024-06-20 03:31:16,096 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.76 vs. limit=10.0 2024-06-20 03:31:23,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121958.83333333333, ans=0.1 2024-06-20 03:31:25,141 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2024-06-20 03:31:36,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=121995.5, ans=0.125 2024-06-20 03:31:37,679 INFO [train.py:1028] (1/2) Epoch 7, batch 5850, loss[loss=0.2984, simple_loss=0.3111, pruned_loss=0.1428, over 12633.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.2912, pruned_loss=0.119, over 2576552.12 frames. ], batch size: 202, lr: 8.31e-03, grad_scale: 16.0 2024-06-20 03:31:39,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=122013.83333333333, ans=0.2 2024-06-20 03:32:12,588 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.369e+02 2.652e+02 3.014e+02 4.273e+02, threshold=5.303e+02, percent-clipped=0.0 2024-06-20 03:32:13,051 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.41 vs. limit=22.5 2024-06-20 03:32:14,085 INFO [train.py:1028] (1/2) Epoch 7, batch 5900, loss[loss=0.2556, simple_loss=0.2854, pruned_loss=0.1129, over 13068.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.2929, pruned_loss=0.1196, over 2576603.24 frames. ], batch size: 121, lr: 8.31e-03, grad_scale: 16.0 2024-06-20 03:32:14,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=122105.5, ans=0.125 2024-06-20 03:32:16,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=122105.5, ans=0.025 2024-06-20 03:32:20,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=122123.83333333333, ans=0.2 2024-06-20 03:32:22,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122123.83333333333, ans=0.1 2024-06-20 03:32:30,885 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.54 vs. limit=15.0 2024-06-20 03:32:31,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122142.16666666667, ans=0.1 2024-06-20 03:32:33,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=122160.5, ans=10.0 2024-06-20 03:32:37,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=122160.5, ans=0.0 2024-06-20 03:32:38,133 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.75 vs. limit=6.0 2024-06-20 03:32:39,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=122160.5, ans=0.1 2024-06-20 03:32:41,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=122178.83333333333, ans=0.125 2024-06-20 03:32:44,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=122178.83333333333, ans=0.125 2024-06-20 03:32:46,470 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=19.73 vs. limit=15.0 2024-06-20 03:32:48,091 INFO [train.py:1028] (1/2) Epoch 7, batch 5950, loss[loss=0.2548, simple_loss=0.2852, pruned_loss=0.1123, over 13122.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.2945, pruned_loss=0.1203, over 2580401.75 frames. ], batch size: 121, lr: 8.31e-03, grad_scale: 16.0 2024-06-20 03:32:51,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=122197.16666666667, ans=0.125 2024-06-20 03:32:54,197 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.50 vs. limit=15.0 2024-06-20 03:32:54,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=122215.5, ans=0.125 2024-06-20 03:33:04,411 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.46 vs. limit=5.0 2024-06-20 03:33:06,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=122233.83333333333, ans=0.2 2024-06-20 03:33:09,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=122252.16666666667, ans=0.125 2024-06-20 03:33:10,997 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.52 vs. limit=15.0 2024-06-20 03:33:24,058 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.886e+02 2.377e+02 2.580e+02 3.005e+02 4.739e+02, threshold=5.159e+02, percent-clipped=0.0 2024-06-20 03:33:25,323 INFO [train.py:1028] (1/2) Epoch 7, batch 6000, loss[loss=0.3621, simple_loss=0.3575, pruned_loss=0.1834, over 12119.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.296, pruned_loss=0.1212, over 2574670.24 frames. ], batch size: 241, lr: 8.30e-03, grad_scale: 32.0 2024-06-20 03:33:25,324 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 03:33:33,298 INFO [train.py:1060] (1/2) Epoch 7, validation: loss=0.2156, simple_loss=0.277, pruned_loss=0.07715, over 351949.00 frames. 2024-06-20 03:33:33,298 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 03:33:39,663 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 03:33:40,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=122307.16666666667, ans=0.2 2024-06-20 03:33:45,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122307.16666666667, ans=0.1 2024-06-20 03:33:46,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.48 vs. limit=10.0 2024-06-20 03:33:47,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=122325.5, ans=0.125 2024-06-20 03:33:55,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122343.83333333333, ans=0.1 2024-06-20 03:33:59,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=122362.16666666667, ans=0.125 2024-06-20 03:34:03,896 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.36 vs. limit=15.0 2024-06-20 03:34:07,652 INFO [train.py:1028] (1/2) Epoch 7, batch 6050, loss[loss=0.2726, simple_loss=0.2978, pruned_loss=0.1237, over 12975.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.297, pruned_loss=0.1216, over 2577706.57 frames. ], batch size: 39, lr: 8.30e-03, grad_scale: 32.0 2024-06-20 03:34:13,514 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.81 vs. limit=10.0 2024-06-20 03:34:20,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=122398.83333333333, ans=0.125 2024-06-20 03:34:21,032 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=12.0 2024-06-20 03:34:24,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=122417.16666666667, ans=0.125 2024-06-20 03:34:34,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122435.5, ans=0.1 2024-06-20 03:34:36,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=122435.5, ans=0.125 2024-06-20 03:34:42,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.62 vs. limit=12.0 2024-06-20 03:34:44,462 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.234e+02 2.408e+02 2.721e+02 3.814e+02, threshold=4.817e+02, percent-clipped=0.0 2024-06-20 03:34:45,874 INFO [train.py:1028] (1/2) Epoch 7, batch 6100, loss[loss=0.2801, simple_loss=0.3042, pruned_loss=0.128, over 13146.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.2982, pruned_loss=0.122, over 2579695.31 frames. ], batch size: 121, lr: 8.30e-03, grad_scale: 32.0 2024-06-20 03:34:48,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=122472.16666666667, ans=0.125 2024-06-20 03:35:02,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=122508.83333333333, ans=0.125 2024-06-20 03:35:09,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=122527.16666666667, ans=0.0 2024-06-20 03:35:12,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=122545.5, ans=0.125 2024-06-20 03:35:19,523 INFO [train.py:1028] (1/2) Epoch 7, batch 6150, loss[loss=0.2861, simple_loss=0.298, pruned_loss=0.1371, over 10938.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3006, pruned_loss=0.1232, over 2578377.99 frames. ], batch size: 304, lr: 8.30e-03, grad_scale: 32.0 2024-06-20 03:35:20,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=122563.83333333333, ans=0.07 2024-06-20 03:35:35,508 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=18.46 vs. limit=15.0 2024-06-20 03:35:41,396 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.94 vs. limit=15.0 2024-06-20 03:35:53,674 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.23 vs. limit=15.0 2024-06-20 03:35:57,343 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.462e+02 2.680e+02 2.939e+02 3.804e+02, threshold=5.360e+02, percent-clipped=0.0 2024-06-20 03:35:58,682 INFO [train.py:1028] (1/2) Epoch 7, batch 6200, loss[loss=0.3098, simple_loss=0.3418, pruned_loss=0.1389, over 13231.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3032, pruned_loss=0.1244, over 2575298.64 frames. ], batch size: 89, lr: 8.29e-03, grad_scale: 32.0 2024-06-20 03:36:19,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122692.16666666667, ans=0.1 2024-06-20 03:36:22,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=122710.5, ans=0.0 2024-06-20 03:36:27,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=122710.5, ans=0.125 2024-06-20 03:36:29,875 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.40 vs. limit=15.0 2024-06-20 03:36:32,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=122728.83333333333, ans=0.025 2024-06-20 03:36:35,393 INFO [train.py:1028] (1/2) Epoch 7, batch 6250, loss[loss=0.2749, simple_loss=0.3065, pruned_loss=0.1217, over 13173.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3045, pruned_loss=0.1252, over 2568153.76 frames. ], batch size: 83, lr: 8.29e-03, grad_scale: 32.0 2024-06-20 03:36:35,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=122747.16666666667, ans=0.125 2024-06-20 03:36:47,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.07 vs. limit=15.0 2024-06-20 03:36:48,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=122783.83333333333, ans=0.2 2024-06-20 03:36:48,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=122783.83333333333, ans=0.125 2024-06-20 03:36:52,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=122783.83333333333, ans=0.125 2024-06-20 03:37:00,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=122802.16666666667, ans=0.0 2024-06-20 03:37:01,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122820.5, ans=0.1 2024-06-20 03:37:01,566 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.77 vs. limit=15.0 2024-06-20 03:37:04,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=122820.5, ans=0.07 2024-06-20 03:37:06,622 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.294e+02 2.622e+02 2.981e+02 5.644e+02, threshold=5.244e+02, percent-clipped=1.0 2024-06-20 03:37:07,945 INFO [train.py:1028] (1/2) Epoch 7, batch 6300, loss[loss=0.2953, simple_loss=0.3181, pruned_loss=0.1363, over 11053.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3064, pruned_loss=0.1263, over 2562536.00 frames. ], batch size: 16, lr: 8.29e-03, grad_scale: 32.0 2024-06-20 03:37:10,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122838.83333333333, ans=0.1 2024-06-20 03:37:15,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=122857.16666666667, ans=0.2 2024-06-20 03:37:18,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122857.16666666667, ans=0.1 2024-06-20 03:37:19,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=122857.16666666667, ans=0.025 2024-06-20 03:37:19,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=122857.16666666667, ans=0.2 2024-06-20 03:37:28,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=122893.83333333333, ans=0.0 2024-06-20 03:37:31,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=122893.83333333333, ans=0.0 2024-06-20 03:37:33,254 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.370e+01 2024-06-20 03:37:44,003 INFO [train.py:1028] (1/2) Epoch 7, batch 6350, loss[loss=0.3361, simple_loss=0.3447, pruned_loss=0.1638, over 12522.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.308, pruned_loss=0.1266, over 2571789.26 frames. ], batch size: 202, lr: 8.28e-03, grad_scale: 32.0 2024-06-20 03:38:00,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122967.16666666667, ans=0.1 2024-06-20 03:38:15,041 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.62 vs. limit=15.0 2024-06-20 03:38:15,798 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.78 vs. limit=22.5 2024-06-20 03:38:15,891 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.235e+02 2.504e+02 2.773e+02 3.984e+02, threshold=5.008e+02, percent-clipped=0.0 2024-06-20 03:38:17,165 INFO [train.py:1028] (1/2) Epoch 7, batch 6400, loss[loss=0.2642, simple_loss=0.3032, pruned_loss=0.1126, over 13236.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.311, pruned_loss=0.1281, over 2573943.68 frames. ], batch size: 67, lr: 8.28e-03, grad_scale: 32.0 2024-06-20 03:38:22,007 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=1.549e-01 2024-06-20 03:38:23,572 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.86 vs. limit=22.5 2024-06-20 03:38:29,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=123040.5, ans=0.125 2024-06-20 03:38:30,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=123040.5, ans=0.125 2024-06-20 03:38:30,654 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 03:38:37,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=123058.83333333333, ans=0.0 2024-06-20 03:38:39,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123077.16666666667, ans=0.1 2024-06-20 03:38:45,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=123077.16666666667, ans=0.125 2024-06-20 03:38:46,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=123095.5, ans=0.125 2024-06-20 03:38:47,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=123095.5, ans=0.125 2024-06-20 03:38:47,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=123095.5, ans=0.025 2024-06-20 03:38:52,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123113.83333333333, ans=0.1 2024-06-20 03:38:53,339 INFO [train.py:1028] (1/2) Epoch 7, batch 6450, loss[loss=0.3237, simple_loss=0.3364, pruned_loss=0.1555, over 12576.00 frames. ], tot_loss[loss=0.2864, simple_loss=0.3136, pruned_loss=0.1295, over 2579514.74 frames. ], batch size: 202, lr: 8.28e-03, grad_scale: 32.0 2024-06-20 03:39:00,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=123132.16666666667, ans=0.09899494936611666 2024-06-20 03:39:14,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=123168.83333333333, ans=0.0 2024-06-20 03:39:16,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123168.83333333333, ans=0.1 2024-06-20 03:39:18,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=123168.83333333333, ans=0.0 2024-06-20 03:39:19,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=123187.16666666667, ans=0.125 2024-06-20 03:39:19,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=123187.16666666667, ans=0.0 2024-06-20 03:39:23,826 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.71 vs. limit=10.0 2024-06-20 03:39:24,724 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.744e+02 2.304e+02 2.497e+02 2.882e+02 5.271e+02, threshold=4.994e+02, percent-clipped=1.0 2024-06-20 03:39:25,973 INFO [train.py:1028] (1/2) Epoch 7, batch 6500, loss[loss=0.322, simple_loss=0.3346, pruned_loss=0.1548, over 10850.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3147, pruned_loss=0.1297, over 2582627.22 frames. ], batch size: 304, lr: 8.27e-03, grad_scale: 32.0 2024-06-20 03:39:44,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=123242.16666666667, ans=0.1 2024-06-20 03:39:53,616 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.09 vs. limit=6.0 2024-06-20 03:40:00,391 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.95 vs. limit=12.0 2024-06-20 03:40:01,352 INFO [train.py:1028] (1/2) Epoch 7, batch 6550, loss[loss=0.2748, simple_loss=0.3074, pruned_loss=0.1211, over 12512.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3157, pruned_loss=0.1293, over 2587657.02 frames. ], batch size: 22, lr: 8.27e-03, grad_scale: 32.0 2024-06-20 03:40:17,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=123333.83333333333, ans=0.0 2024-06-20 03:40:30,957 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.53 vs. limit=22.5 2024-06-20 03:40:31,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=123370.5, ans=0.0 2024-06-20 03:40:35,605 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.373e+02 2.639e+02 2.924e+02 4.252e+02, threshold=5.278e+02, percent-clipped=0.0 2024-06-20 03:40:36,913 INFO [train.py:1028] (1/2) Epoch 7, batch 6600, loss[loss=0.2971, simple_loss=0.3333, pruned_loss=0.1305, over 13207.00 frames. ], tot_loss[loss=0.2874, simple_loss=0.3158, pruned_loss=0.1295, over 2589755.83 frames. ], batch size: 72, lr: 8.27e-03, grad_scale: 32.0 2024-06-20 03:40:38,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=123388.83333333333, ans=0.1 2024-06-20 03:40:48,243 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.02 vs. limit=15.0 2024-06-20 03:40:52,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=123425.5, ans=0.1 2024-06-20 03:40:54,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=123425.5, ans=15.0 2024-06-20 03:40:59,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=123443.83333333333, ans=10.0 2024-06-20 03:41:00,746 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=15.0 2024-06-20 03:41:02,983 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=1.118e-01 2024-06-20 03:41:06,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123462.16666666667, ans=0.1 2024-06-20 03:41:09,179 INFO [train.py:1028] (1/2) Epoch 7, batch 6650, loss[loss=0.3337, simple_loss=0.3504, pruned_loss=0.1585, over 12958.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3185, pruned_loss=0.1311, over 2584783.44 frames. ], batch size: 158, lr: 8.26e-03, grad_scale: 32.0 2024-06-20 03:41:19,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=123498.83333333333, ans=0.125 2024-06-20 03:41:46,465 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.454e+02 2.649e+02 3.041e+02 4.260e+02, threshold=5.297e+02, percent-clipped=0.0 2024-06-20 03:41:47,922 INFO [train.py:1028] (1/2) Epoch 7, batch 6700, loss[loss=0.3021, simple_loss=0.3227, pruned_loss=0.1407, over 12707.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3191, pruned_loss=0.1314, over 2584714.16 frames. ], batch size: 176, lr: 8.26e-03, grad_scale: 32.0 2024-06-20 03:42:00,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=123608.83333333333, ans=0.125 2024-06-20 03:42:03,150 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.99 vs. limit=10.0 2024-06-20 03:42:03,179 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.24 vs. limit=22.5 2024-06-20 03:42:12,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=123627.16666666667, ans=0.0 2024-06-20 03:42:17,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=123645.5, ans=0.125 2024-06-20 03:42:18,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=123645.5, ans=0.125 2024-06-20 03:42:21,455 INFO [train.py:1028] (1/2) Epoch 7, batch 6750, loss[loss=0.3917, simple_loss=0.3889, pruned_loss=0.1972, over 12162.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3203, pruned_loss=0.1326, over 2578214.02 frames. ], batch size: 241, lr: 8.26e-03, grad_scale: 32.0 2024-06-20 03:42:28,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=123663.83333333333, ans=0.0 2024-06-20 03:42:34,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=123682.16666666667, ans=0.0 2024-06-20 03:42:49,263 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.13 vs. limit=15.0 2024-06-20 03:42:50,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=123737.16666666667, ans=0.0 2024-06-20 03:42:51,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=123737.16666666667, ans=0.0 2024-06-20 03:42:55,537 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 2.388e+02 2.766e+02 3.010e+02 4.712e+02, threshold=5.532e+02, percent-clipped=0.0 2024-06-20 03:42:55,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=123737.16666666667, ans=0.125 2024-06-20 03:42:56,793 INFO [train.py:1028] (1/2) Epoch 7, batch 6800, loss[loss=0.2696, simple_loss=0.302, pruned_loss=0.1186, over 13278.00 frames. ], tot_loss[loss=0.2944, simple_loss=0.3224, pruned_loss=0.1332, over 2580519.86 frames. ], batch size: 67, lr: 8.26e-03, grad_scale: 32.0 2024-06-20 03:43:03,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=123773.83333333333, ans=0.0 2024-06-20 03:43:05,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=123773.83333333333, ans=0.07 2024-06-20 03:43:19,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=123810.5, ans=0.1 2024-06-20 03:43:19,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=123810.5, ans=0.2 2024-06-20 03:43:28,950 INFO [train.py:1028] (1/2) Epoch 7, batch 6850, loss[loss=0.3046, simple_loss=0.3406, pruned_loss=0.1343, over 13272.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3235, pruned_loss=0.1334, over 2585653.48 frames. ], batch size: 63, lr: 8.25e-03, grad_scale: 32.0 2024-06-20 03:43:40,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=123865.5, ans=0.2 2024-06-20 03:43:50,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=123883.83333333333, ans=0.1 2024-06-20 03:43:53,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=123902.16666666667, ans=0.125 2024-06-20 03:43:56,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123902.16666666667, ans=0.1 2024-06-20 03:43:58,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=123920.5, ans=0.05 2024-06-20 03:44:00,722 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.52 vs. limit=15.0 2024-06-20 03:44:04,155 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.286e+02 2.541e+02 2.820e+02 3.805e+02, threshold=5.082e+02, percent-clipped=0.0 2024-06-20 03:44:05,379 INFO [train.py:1028] (1/2) Epoch 7, batch 6900, loss[loss=0.2928, simple_loss=0.3299, pruned_loss=0.1279, over 13313.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.3241, pruned_loss=0.1334, over 2587694.50 frames. ], batch size: 49, lr: 8.25e-03, grad_scale: 32.0 2024-06-20 03:44:05,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=123938.83333333333, ans=0.125 2024-06-20 03:44:07,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=123938.83333333333, ans=0.1 2024-06-20 03:44:18,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=123975.5, ans=0.125 2024-06-20 03:44:25,689 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.99 vs. limit=15.0 2024-06-20 03:44:27,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123993.83333333333, ans=0.1 2024-06-20 03:44:29,941 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.30 vs. limit=10.0 2024-06-20 03:44:41,540 INFO [train.py:1028] (1/2) Epoch 7, batch 6950, loss[loss=0.2437, simple_loss=0.2915, pruned_loss=0.09793, over 12245.00 frames. ], tot_loss[loss=0.295, simple_loss=0.324, pruned_loss=0.133, over 2582252.19 frames. ], batch size: 18, lr: 8.25e-03, grad_scale: 16.0 2024-06-20 03:44:48,459 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.43 vs. limit=15.0 2024-06-20 03:45:07,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=124103.83333333333, ans=0.95 2024-06-20 03:45:09,768 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.55 vs. limit=22.5 2024-06-20 03:45:14,030 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.246e+02 2.472e+02 2.850e+02 3.703e+02, threshold=4.943e+02, percent-clipped=0.0 2024-06-20 03:45:14,683 INFO [train.py:1028] (1/2) Epoch 7, batch 7000, loss[loss=0.3066, simple_loss=0.3267, pruned_loss=0.1433, over 12923.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.3225, pruned_loss=0.132, over 2578735.57 frames. ], batch size: 158, lr: 8.24e-03, grad_scale: 16.0 2024-06-20 03:45:31,775 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.77 vs. limit=10.0 2024-06-20 03:45:42,208 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=2.488e-01 2024-06-20 03:45:42,561 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.83 vs. limit=15.0 2024-06-20 03:45:48,345 INFO [train.py:1028] (1/2) Epoch 7, batch 7050, loss[loss=0.359, simple_loss=0.374, pruned_loss=0.172, over 12756.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3231, pruned_loss=0.1319, over 2585599.34 frames. ], batch size: 176, lr: 8.24e-03, grad_scale: 16.0 2024-06-20 03:45:52,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=124213.83333333333, ans=0.05 2024-06-20 03:45:52,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=124213.83333333333, ans=0.0 2024-06-20 03:45:55,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=124213.83333333333, ans=0.1 2024-06-20 03:45:55,362 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.08 vs. limit=12.0 2024-06-20 03:46:00,743 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.02 vs. limit=15.0 2024-06-20 03:46:01,428 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.05 vs. limit=15.0 2024-06-20 03:46:02,902 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.65 vs. limit=15.0 2024-06-20 03:46:07,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=124250.5, ans=0.125 2024-06-20 03:46:11,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.27 vs. limit=15.0 2024-06-20 03:46:17,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=124287.16666666667, ans=0.2 2024-06-20 03:46:18,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=124287.16666666667, ans=0.2 2024-06-20 03:46:19,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=124287.16666666667, ans=0.125 2024-06-20 03:46:23,426 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.285e+02 2.499e+02 2.793e+02 4.009e+02, threshold=4.999e+02, percent-clipped=0.0 2024-06-20 03:46:24,056 INFO [train.py:1028] (1/2) Epoch 7, batch 7100, loss[loss=0.3224, simple_loss=0.3474, pruned_loss=0.1487, over 13159.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3246, pruned_loss=0.1331, over 2578076.04 frames. ], batch size: 112, lr: 8.24e-03, grad_scale: 16.0 2024-06-20 03:46:27,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=124305.5, ans=0.125 2024-06-20 03:46:28,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=124305.5, ans=0.125 2024-06-20 03:46:29,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=124305.5, ans=0.125 2024-06-20 03:46:35,033 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=124323.83333333333, ans=0.125 2024-06-20 03:46:45,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=124342.16666666667, ans=0.025 2024-06-20 03:46:47,971 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.72 vs. limit=15.0 2024-06-20 03:46:48,623 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.54 vs. limit=22.5 2024-06-20 03:46:52,597 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.11 vs. limit=15.0 2024-06-20 03:46:55,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=124378.83333333333, ans=0.0 2024-06-20 03:47:00,722 INFO [train.py:1028] (1/2) Epoch 7, batch 7150, loss[loss=0.3275, simple_loss=0.3467, pruned_loss=0.1541, over 12582.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3253, pruned_loss=0.1332, over 2575508.80 frames. ], batch size: 202, lr: 8.23e-03, grad_scale: 16.0 2024-06-20 03:47:00,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=124397.16666666667, ans=0.125 2024-06-20 03:47:04,938 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.07 vs. limit=22.5 2024-06-20 03:47:09,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=124415.5, ans=0.1 2024-06-20 03:47:09,375 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2024-06-20 03:47:18,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=124433.83333333333, ans=0.125 2024-06-20 03:47:25,187 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=124452.16666666667, ans=0.1 2024-06-20 03:47:27,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=124470.5, ans=0.1 2024-06-20 03:47:31,269 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.71 vs. limit=10.0 2024-06-20 03:47:32,576 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.252e+02 2.494e+02 2.765e+02 4.689e+02, threshold=4.988e+02, percent-clipped=0.0 2024-06-20 03:47:33,205 INFO [train.py:1028] (1/2) Epoch 7, batch 7200, loss[loss=0.3174, simple_loss=0.3485, pruned_loss=0.1432, over 13179.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.326, pruned_loss=0.1333, over 2580876.67 frames. ], batch size: 112, lr: 8.23e-03, grad_scale: 32.0 2024-06-20 03:47:37,456 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.07 vs. limit=15.0 2024-06-20 03:47:37,832 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 03:47:41,421 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=14.11 vs. limit=15.0 2024-06-20 03:47:46,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=124525.5, ans=0.125 2024-06-20 03:47:47,099 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.89 vs. limit=15.0 2024-06-20 03:47:50,490 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.58 vs. limit=15.0 2024-06-20 03:47:53,441 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.95 vs. limit=15.0 2024-06-20 03:47:54,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=124543.83333333333, ans=0.125 2024-06-20 03:47:54,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=124543.83333333333, ans=0.0 2024-06-20 03:47:57,126 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.70 vs. limit=10.0 2024-06-20 03:48:06,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=124562.16666666667, ans=0.1 2024-06-20 03:48:09,289 INFO [train.py:1028] (1/2) Epoch 7, batch 7250, loss[loss=0.2537, simple_loss=0.2989, pruned_loss=0.1043, over 12776.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.3265, pruned_loss=0.1329, over 2580978.48 frames. ], batch size: 36, lr: 8.23e-03, grad_scale: 32.0 2024-06-20 03:48:12,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=124580.5, ans=0.125 2024-06-20 03:48:13,740 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=15.34 vs. limit=15.0 2024-06-20 03:48:19,212 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.33 vs. limit=12.0 2024-06-20 03:48:21,965 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.38 vs. limit=22.5 2024-06-20 03:48:25,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=124617.16666666667, ans=0.125 2024-06-20 03:48:29,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=124635.5, ans=0.2 2024-06-20 03:48:29,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=124635.5, ans=0.95 2024-06-20 03:48:33,323 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.34 vs. limit=10.0 2024-06-20 03:48:33,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=124635.5, ans=0.125 2024-06-20 03:48:33,971 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.53 vs. limit=22.5 2024-06-20 03:48:50,262 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.82 vs. limit=15.0 2024-06-20 03:48:51,095 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 2.132e+02 2.298e+02 2.578e+02 3.533e+02, threshold=4.595e+02, percent-clipped=0.0 2024-06-20 03:48:51,742 INFO [train.py:1028] (1/2) Epoch 7, batch 7300, loss[loss=0.2955, simple_loss=0.3224, pruned_loss=0.1343, over 12989.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3275, pruned_loss=0.1333, over 2580914.68 frames. ], batch size: 36, lr: 8.23e-03, grad_scale: 32.0 2024-06-20 03:48:53,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=124672.16666666667, ans=0.125 2024-06-20 03:49:01,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=124690.5, ans=0.125 2024-06-20 03:49:05,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=124708.83333333333, ans=0.025 2024-06-20 03:49:08,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=124708.83333333333, ans=0.125 2024-06-20 03:49:10,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=124708.83333333333, ans=15.0 2024-06-20 03:49:10,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=124727.16666666667, ans=0.025 2024-06-20 03:49:13,185 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.89 vs. limit=22.5 2024-06-20 03:49:15,081 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.95 vs. limit=15.0 2024-06-20 03:49:24,411 INFO [train.py:1028] (1/2) Epoch 7, batch 7350, loss[loss=0.3103, simple_loss=0.3394, pruned_loss=0.1406, over 13308.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.329, pruned_loss=0.1341, over 2581799.42 frames. ], batch size: 46, lr: 8.22e-03, grad_scale: 32.0 2024-06-20 03:49:33,640 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.66 vs. limit=12.0 2024-06-20 03:49:36,824 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.90 vs. limit=15.0 2024-06-20 03:49:39,527 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.01 vs. limit=15.0 2024-06-20 03:49:42,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=124800.5, ans=0.125 2024-06-20 03:49:51,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=124837.16666666667, ans=10.0 2024-06-20 03:49:56,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=124837.16666666667, ans=0.125 2024-06-20 03:49:57,150 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.291e+02 2.553e+02 2.865e+02 4.209e+02, threshold=5.106e+02, percent-clipped=0.0 2024-06-20 03:49:57,814 INFO [train.py:1028] (1/2) Epoch 7, batch 7400, loss[loss=0.3022, simple_loss=0.3445, pruned_loss=0.1299, over 13256.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3283, pruned_loss=0.1332, over 2586949.95 frames. ], batch size: 63, lr: 8.22e-03, grad_scale: 32.0 2024-06-20 03:50:23,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.33 vs. limit=22.5 2024-06-20 03:50:25,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=124910.5, ans=0.04949747468305833 2024-06-20 03:50:28,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=124928.83333333333, ans=0.2 2024-06-20 03:50:31,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=124928.83333333333, ans=0.125 2024-06-20 03:50:34,802 INFO [train.py:1028] (1/2) Epoch 7, batch 7450, loss[loss=0.2811, simple_loss=0.3184, pruned_loss=0.1219, over 12748.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3286, pruned_loss=0.1328, over 2580287.24 frames. ], batch size: 29, lr: 8.22e-03, grad_scale: 32.0 2024-06-20 03:50:47,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=124983.83333333333, ans=0.025 2024-06-20 03:50:59,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=125002.16666666667, ans=0.125 2024-06-20 03:51:00,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=125002.16666666667, ans=0.125 2024-06-20 03:51:01,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=125002.16666666667, ans=0.125 2024-06-20 03:51:01,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=125002.16666666667, ans=0.1 2024-06-20 03:51:11,255 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.228e+02 2.483e+02 2.678e+02 4.093e+02, threshold=4.966e+02, percent-clipped=0.0 2024-06-20 03:51:11,888 INFO [train.py:1028] (1/2) Epoch 7, batch 7500, loss[loss=0.3264, simple_loss=0.3327, pruned_loss=0.1601, over 10750.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3298, pruned_loss=0.134, over 2578582.40 frames. ], batch size: 303, lr: 8.21e-03, grad_scale: 32.0 2024-06-20 03:51:12,406 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.14 vs. limit=22.5 2024-06-20 03:51:23,993 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=5.278e-03 2024-06-20 03:51:30,684 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.78 vs. limit=22.5 2024-06-20 03:51:37,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=125093.83333333333, ans=0.125 2024-06-20 03:51:39,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=125112.16666666667, ans=0.125 2024-06-20 03:51:39,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=125112.16666666667, ans=0.0 2024-06-20 03:51:43,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=125112.16666666667, ans=0.1 2024-06-20 03:51:45,011 INFO [train.py:1028] (1/2) Epoch 7, batch 7550, loss[loss=0.306, simple_loss=0.3297, pruned_loss=0.1411, over 12950.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.3309, pruned_loss=0.1349, over 2577909.33 frames. ], batch size: 158, lr: 8.21e-03, grad_scale: 32.0 2024-06-20 03:51:49,203 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.83 vs. limit=22.5 2024-06-20 03:51:52,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=125148.83333333333, ans=0.05 2024-06-20 03:52:00,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=125167.16666666667, ans=0.5 2024-06-20 03:52:01,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=125167.16666666667, ans=0.125 2024-06-20 03:52:02,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=125167.16666666667, ans=0.0 2024-06-20 03:52:03,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=125167.16666666667, ans=0.125 2024-06-20 03:52:14,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=125203.83333333333, ans=0.125 2024-06-20 03:52:20,535 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.937e+02 2.339e+02 2.532e+02 2.868e+02 4.356e+02, threshold=5.063e+02, percent-clipped=0.0 2024-06-20 03:52:21,392 INFO [train.py:1028] (1/2) Epoch 7, batch 7600, loss[loss=0.3085, simple_loss=0.3345, pruned_loss=0.1413, over 13216.00 frames. ], tot_loss[loss=0.3005, simple_loss=0.331, pruned_loss=0.135, over 2576882.45 frames. ], batch size: 83, lr: 8.21e-03, grad_scale: 32.0 2024-06-20 03:52:30,004 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.33 vs. limit=22.5 2024-06-20 03:52:42,972 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.76 vs. limit=6.0 2024-06-20 03:52:57,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=125295.5, ans=0.125 2024-06-20 03:52:58,350 INFO [train.py:1028] (1/2) Epoch 7, batch 7650, loss[loss=0.2778, simple_loss=0.3191, pruned_loss=0.1182, over 12954.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3306, pruned_loss=0.1346, over 2573242.85 frames. ], batch size: 33, lr: 8.20e-03, grad_scale: 32.0 2024-06-20 03:53:13,162 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.14 vs. limit=10.0 2024-06-20 03:53:14,939 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=2.520e+02 2024-06-20 03:53:25,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=125387.16666666667, ans=0.0 2024-06-20 03:53:29,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=125387.16666666667, ans=0.2 2024-06-20 03:53:31,704 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.224e+02 2.553e+02 2.901e+02 4.356e+02, threshold=5.106e+02, percent-clipped=0.0 2024-06-20 03:53:31,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=125405.5, ans=0.1 2024-06-20 03:53:32,370 INFO [train.py:1028] (1/2) Epoch 7, batch 7700, loss[loss=0.2935, simple_loss=0.3321, pruned_loss=0.1274, over 13258.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3318, pruned_loss=0.1354, over 2569771.61 frames. ], batch size: 63, lr: 8.20e-03, grad_scale: 32.0 2024-06-20 03:53:33,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=125405.5, ans=0.125 2024-06-20 03:53:37,520 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.77 vs. limit=15.0 2024-06-20 03:53:39,080 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=2.445e-02 2024-06-20 03:53:39,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=125423.83333333333, ans=0.0 2024-06-20 03:53:42,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=125423.83333333333, ans=0.125 2024-06-20 03:53:55,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=125460.5, ans=0.2 2024-06-20 03:54:06,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=125478.83333333333, ans=0.125 2024-06-20 03:54:08,545 INFO [train.py:1028] (1/2) Epoch 7, batch 7750, loss[loss=0.2978, simple_loss=0.3366, pruned_loss=0.1295, over 13217.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3319, pruned_loss=0.1355, over 2574563.16 frames. ], batch size: 72, lr: 8.20e-03, grad_scale: 32.0 2024-06-20 03:54:15,596 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.72 vs. limit=15.0 2024-06-20 03:54:15,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=125515.5, ans=0.125 2024-06-20 03:54:43,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=125570.5, ans=8.0 2024-06-20 03:54:44,472 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.099e+02 2.254e+02 2.407e+02 3.252e+02, threshold=4.508e+02, percent-clipped=0.0 2024-06-20 03:54:44,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=125588.83333333333, ans=0.025 2024-06-20 03:54:45,168 INFO [train.py:1028] (1/2) Epoch 7, batch 7800, loss[loss=0.3237, simple_loss=0.3505, pruned_loss=0.1484, over 13165.00 frames. ], tot_loss[loss=0.302, simple_loss=0.3328, pruned_loss=0.1356, over 2579056.15 frames. ], batch size: 95, lr: 8.20e-03, grad_scale: 32.0 2024-06-20 03:54:45,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=125588.83333333333, ans=0.2 2024-06-20 03:54:47,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=125588.83333333333, ans=0.025 2024-06-20 03:54:52,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=125607.16666666667, ans=0.1 2024-06-20 03:54:57,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=125625.5, ans=0.0 2024-06-20 03:55:08,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=125643.83333333333, ans=0.0 2024-06-20 03:55:13,051 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2024-06-20 03:55:17,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=125662.16666666667, ans=0.0 2024-06-20 03:55:18,993 INFO [train.py:1028] (1/2) Epoch 7, batch 7850, loss[loss=0.2886, simple_loss=0.3171, pruned_loss=0.1301, over 11033.00 frames. ], tot_loss[loss=0.3031, simple_loss=0.3337, pruned_loss=0.1363, over 2573087.99 frames. ], batch size: 16, lr: 8.19e-03, grad_scale: 32.0 2024-06-20 03:55:23,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=125680.5, ans=0.2 2024-06-20 03:55:36,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=125717.16666666667, ans=0.125 2024-06-20 03:55:37,218 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.18 vs. limit=22.5 2024-06-20 03:55:39,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=125735.5, ans=0.1 2024-06-20 03:55:42,765 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.77 vs. limit=15.0 2024-06-20 03:55:47,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=125753.83333333333, ans=0.0 2024-06-20 03:55:50,803 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.218e+02 2.493e+02 2.893e+02 4.281e+02, threshold=4.985e+02, percent-clipped=0.0 2024-06-20 03:55:51,425 INFO [train.py:1028] (1/2) Epoch 7, batch 7900, loss[loss=0.3123, simple_loss=0.345, pruned_loss=0.1398, over 13214.00 frames. ], tot_loss[loss=0.3041, simple_loss=0.3346, pruned_loss=0.1368, over 2572390.00 frames. ], batch size: 77, lr: 8.19e-03, grad_scale: 32.0 2024-06-20 03:56:01,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=125772.16666666667, ans=0.125 2024-06-20 03:56:06,601 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.70 vs. limit=6.0 2024-06-20 03:56:22,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=125827.16666666667, ans=15.0 2024-06-20 03:56:23,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=125845.5, ans=0.125 2024-06-20 03:56:23,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=125845.5, ans=0.125 2024-06-20 03:56:28,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=125845.5, ans=0.125 2024-06-20 03:56:29,435 INFO [train.py:1028] (1/2) Epoch 7, batch 7950, loss[loss=0.3218, simple_loss=0.3465, pruned_loss=0.1485, over 10756.00 frames. ], tot_loss[loss=0.3034, simple_loss=0.3342, pruned_loss=0.1363, over 2575380.46 frames. ], batch size: 303, lr: 8.19e-03, grad_scale: 32.0 2024-06-20 03:56:46,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=125900.5, ans=0.125 2024-06-20 03:56:53,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=125918.83333333333, ans=0.125 2024-06-20 03:56:56,947 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.680e+00 2024-06-20 03:56:57,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=125918.83333333333, ans=0.125 2024-06-20 03:56:57,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=125918.83333333333, ans=0.0 2024-06-20 03:56:57,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=125918.83333333333, ans=0.125 2024-06-20 03:56:58,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=125937.16666666667, ans=0.0 2024-06-20 03:57:05,570 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.094e+02 2.302e+02 2.785e+02 4.899e+02, threshold=4.603e+02, percent-clipped=0.0 2024-06-20 03:57:06,250 INFO [train.py:1028] (1/2) Epoch 7, batch 8000, loss[loss=0.2519, simple_loss=0.2981, pruned_loss=0.1029, over 12709.00 frames. ], tot_loss[loss=0.304, simple_loss=0.3348, pruned_loss=0.1366, over 2572491.56 frames. ], batch size: 29, lr: 8.18e-03, grad_scale: 32.0 2024-06-20 03:57:29,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=126010.5, ans=0.025 2024-06-20 03:57:35,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=126028.83333333333, ans=0.025 2024-06-20 03:57:38,192 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.72 vs. limit=15.0 2024-06-20 03:57:39,089 INFO [train.py:1028] (1/2) Epoch 7, batch 8050, loss[loss=0.2958, simple_loss=0.3348, pruned_loss=0.1284, over 13218.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3337, pruned_loss=0.1357, over 2572383.08 frames. ], batch size: 83, lr: 8.18e-03, grad_scale: 32.0 2024-06-20 03:57:41,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=126047.16666666667, ans=0.1 2024-06-20 03:58:10,744 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2024-06-20 03:58:14,229 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.733e+02 2.136e+02 2.299e+02 2.519e+02 3.627e+02, threshold=4.598e+02, percent-clipped=0.0 2024-06-20 03:58:14,927 INFO [train.py:1028] (1/2) Epoch 7, batch 8100, loss[loss=0.3253, simple_loss=0.3517, pruned_loss=0.1494, over 13163.00 frames. ], tot_loss[loss=0.3034, simple_loss=0.3343, pruned_loss=0.1362, over 2575957.12 frames. ], batch size: 112, lr: 8.18e-03, grad_scale: 32.0 2024-06-20 03:58:28,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=126175.5, ans=0.0 2024-06-20 03:58:35,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=126193.83333333333, ans=0.2 2024-06-20 03:58:45,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=126193.83333333333, ans=0.125 2024-06-20 03:58:47,899 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.21 vs. limit=15.0 2024-06-20 03:58:53,419 INFO [train.py:1028] (1/2) Epoch 7, batch 8150, loss[loss=0.2817, simple_loss=0.3139, pruned_loss=0.1247, over 13064.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3347, pruned_loss=0.1359, over 2579378.41 frames. ], batch size: 121, lr: 8.18e-03, grad_scale: 32.0 2024-06-20 03:58:58,334 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=4.815e+01 2024-06-20 03:59:01,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=126248.83333333333, ans=0.125 2024-06-20 03:59:25,604 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.746e+02 2.102e+02 2.236e+02 2.437e+02 3.028e+02, threshold=4.471e+02, percent-clipped=0.0 2024-06-20 03:59:25,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=126322.16666666667, ans=0.0 2024-06-20 03:59:26,321 INFO [train.py:1028] (1/2) Epoch 7, batch 8200, loss[loss=0.3148, simple_loss=0.3468, pruned_loss=0.1414, over 13126.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3343, pruned_loss=0.1354, over 2582782.04 frames. ], batch size: 112, lr: 8.17e-03, grad_scale: 32.0 2024-06-20 03:59:31,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=126322.16666666667, ans=10.0 2024-06-20 03:59:42,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=126358.83333333333, ans=0.125 2024-06-20 03:59:44,878 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.42 vs. limit=15.0 2024-06-20 03:59:45,694 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.33 vs. limit=10.0 2024-06-20 03:59:46,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=126377.16666666667, ans=0.125 2024-06-20 03:59:48,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=126377.16666666667, ans=0.2 2024-06-20 03:59:51,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=126377.16666666667, ans=0.0 2024-06-20 04:00:02,879 INFO [train.py:1028] (1/2) Epoch 7, batch 8250, loss[loss=0.3058, simple_loss=0.3348, pruned_loss=0.1385, over 13232.00 frames. ], tot_loss[loss=0.303, simple_loss=0.3347, pruned_loss=0.1356, over 2583446.66 frames. ], batch size: 52, lr: 8.17e-03, grad_scale: 32.0 2024-06-20 04:00:10,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=126432.16666666667, ans=0.2 2024-06-20 04:00:12,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=126432.16666666667, ans=0.025 2024-06-20 04:00:19,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=126450.5, ans=0.0 2024-06-20 04:00:22,236 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2024-06-20 04:00:23,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=126468.83333333333, ans=0.0 2024-06-20 04:00:35,509 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 2.169e+02 2.350e+02 2.705e+02 3.497e+02, threshold=4.701e+02, percent-clipped=0.0 2024-06-20 04:00:36,259 INFO [train.py:1028] (1/2) Epoch 7, batch 8300, loss[loss=0.3012, simple_loss=0.3282, pruned_loss=0.1371, over 13027.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.3341, pruned_loss=0.1352, over 2579987.45 frames. ], batch size: 102, lr: 8.17e-03, grad_scale: 32.0 2024-06-20 04:00:36,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=126505.5, ans=0.0 2024-06-20 04:00:37,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=126505.5, ans=0.125 2024-06-20 04:00:37,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=126505.5, ans=0.1 2024-06-20 04:00:56,582 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2024-06-20 04:00:59,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=126560.5, ans=0.125 2024-06-20 04:01:03,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=126560.5, ans=0.1 2024-06-20 04:01:09,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=126578.83333333333, ans=0.125 2024-06-20 04:01:13,841 INFO [train.py:1028] (1/2) Epoch 7, batch 8350, loss[loss=0.3046, simple_loss=0.3316, pruned_loss=0.1388, over 13157.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.3342, pruned_loss=0.1351, over 2578512.13 frames. ], batch size: 112, lr: 8.16e-03, grad_scale: 32.0 2024-06-20 04:01:14,717 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:01:18,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=126597.16666666667, ans=0.07 2024-06-20 04:01:20,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=126615.5, ans=0.0 2024-06-20 04:01:20,869 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.51 vs. limit=15.0 2024-06-20 04:01:22,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=126615.5, ans=0.125 2024-06-20 04:01:26,761 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=3.097e+00 2024-06-20 04:01:29,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=126633.83333333333, ans=0.2 2024-06-20 04:01:37,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=126652.16666666667, ans=0.125 2024-06-20 04:01:45,092 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.99 vs. limit=15.0 2024-06-20 04:01:48,106 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.099e+02 2.322e+02 2.571e+02 3.909e+02, threshold=4.644e+02, percent-clipped=0.0 2024-06-20 04:01:48,896 INFO [train.py:1028] (1/2) Epoch 7, batch 8400, loss[loss=0.2743, simple_loss=0.3067, pruned_loss=0.1209, over 12898.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3342, pruned_loss=0.1353, over 2574054.51 frames. ], batch size: 39, lr: 8.16e-03, grad_scale: 32.0 2024-06-20 04:01:52,703 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.38 vs. limit=15.0 2024-06-20 04:01:53,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=126688.83333333333, ans=0.125 2024-06-20 04:01:56,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=126707.16666666667, ans=0.0 2024-06-20 04:02:03,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=126725.5, ans=0.125 2024-06-20 04:02:19,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=126762.16666666667, ans=0.2 2024-06-20 04:02:21,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=126762.16666666667, ans=0.2 2024-06-20 04:02:22,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=126762.16666666667, ans=0.0 2024-06-20 04:02:25,812 INFO [train.py:1028] (1/2) Epoch 7, batch 8450, loss[loss=0.3074, simple_loss=0.3437, pruned_loss=0.1356, over 13170.00 frames. ], tot_loss[loss=0.3042, simple_loss=0.3361, pruned_loss=0.1361, over 2575956.57 frames. ], batch size: 112, lr: 8.16e-03, grad_scale: 32.0 2024-06-20 04:02:44,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=126817.16666666667, ans=0.125 2024-06-20 04:03:01,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=126853.83333333333, ans=0.1 2024-06-20 04:03:04,471 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.162e+02 2.289e+02 2.556e+02 3.325e+02, threshold=4.578e+02, percent-clipped=0.0 2024-06-20 04:03:05,196 INFO [train.py:1028] (1/2) Epoch 7, batch 8500, loss[loss=0.3083, simple_loss=0.3384, pruned_loss=0.1391, over 12534.00 frames. ], tot_loss[loss=0.3062, simple_loss=0.3376, pruned_loss=0.1374, over 2574857.40 frames. ], batch size: 29, lr: 8.16e-03, grad_scale: 32.0 2024-06-20 04:03:19,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=126908.83333333333, ans=0.2 2024-06-20 04:03:22,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=126908.83333333333, ans=0.1 2024-06-20 04:03:27,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=126927.16666666667, ans=0.2 2024-06-20 04:03:28,093 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.74 vs. limit=10.0 2024-06-20 04:03:39,004 INFO [train.py:1028] (1/2) Epoch 7, batch 8550, loss[loss=0.271, simple_loss=0.3176, pruned_loss=0.1122, over 12578.00 frames. ], tot_loss[loss=0.305, simple_loss=0.3367, pruned_loss=0.1366, over 2572641.48 frames. ], batch size: 22, lr: 8.15e-03, grad_scale: 32.0 2024-06-20 04:03:39,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=126963.83333333333, ans=0.125 2024-06-20 04:03:45,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=126982.16666666667, ans=0.2 2024-06-20 04:03:48,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=126982.16666666667, ans=0.125 2024-06-20 04:03:51,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=126982.16666666667, ans=0.025 2024-06-20 04:04:08,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=127037.16666666667, ans=0.125 2024-06-20 04:04:10,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=127037.16666666667, ans=0.0 2024-06-20 04:04:11,478 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.065e+02 2.254e+02 2.560e+02 3.907e+02, threshold=4.508e+02, percent-clipped=0.0 2024-06-20 04:04:12,091 INFO [train.py:1028] (1/2) Epoch 7, batch 8600, loss[loss=0.2923, simple_loss=0.32, pruned_loss=0.1323, over 13044.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.3366, pruned_loss=0.1365, over 2571281.12 frames. ], batch size: 121, lr: 8.15e-03, grad_scale: 32.0 2024-06-20 04:04:14,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=127055.5, ans=0.125 2024-06-20 04:04:31,648 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.14 vs. limit=15.0 2024-06-20 04:04:33,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=127092.16666666667, ans=0.0 2024-06-20 04:04:41,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=127110.5, ans=0.0 2024-06-20 04:04:42,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=127110.5, ans=0.2 2024-06-20 04:04:49,222 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.44 vs. limit=15.0 2024-06-20 04:04:49,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=127147.16666666667, ans=0.125 2024-06-20 04:04:50,219 INFO [train.py:1028] (1/2) Epoch 7, batch 8650, loss[loss=0.3094, simple_loss=0.3368, pruned_loss=0.141, over 13139.00 frames. ], tot_loss[loss=0.3042, simple_loss=0.3364, pruned_loss=0.136, over 2575499.70 frames. ], batch size: 103, lr: 8.15e-03, grad_scale: 32.0 2024-06-20 04:05:13,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=127202.16666666667, ans=0.2 2024-06-20 04:05:24,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=127220.5, ans=0.025 2024-06-20 04:05:25,889 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.795e+02 2.093e+02 2.303e+02 2.586e+02 3.647e+02, threshold=4.606e+02, percent-clipped=0.0 2024-06-20 04:05:26,566 INFO [train.py:1028] (1/2) Epoch 7, batch 8700, loss[loss=0.2994, simple_loss=0.3418, pruned_loss=0.1285, over 13203.00 frames. ], tot_loss[loss=0.3043, simple_loss=0.3364, pruned_loss=0.136, over 2572896.50 frames. ], batch size: 59, lr: 8.14e-03, grad_scale: 32.0 2024-06-20 04:05:30,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=127238.83333333333, ans=0.0 2024-06-20 04:05:30,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=127238.83333333333, ans=0.05 2024-06-20 04:05:32,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127257.16666666667, ans=0.1 2024-06-20 04:05:39,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=127275.5, ans=0.125 2024-06-20 04:05:51,144 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.54 vs. limit=15.0 2024-06-20 04:05:59,727 INFO [train.py:1028] (1/2) Epoch 7, batch 8750, loss[loss=0.333, simple_loss=0.3524, pruned_loss=0.1568, over 13133.00 frames. ], tot_loss[loss=0.3047, simple_loss=0.337, pruned_loss=0.1362, over 2569212.99 frames. ], batch size: 121, lr: 8.14e-03, grad_scale: 32.0 2024-06-20 04:06:05,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=127348.83333333333, ans=0.0 2024-06-20 04:06:10,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=127348.83333333333, ans=0.0 2024-06-20 04:06:18,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=127367.16666666667, ans=0.0 2024-06-20 04:06:23,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=127385.5, ans=0.0 2024-06-20 04:06:31,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=127403.83333333333, ans=0.025 2024-06-20 04:06:35,853 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.072e+02 2.268e+02 2.504e+02 3.569e+02, threshold=4.537e+02, percent-clipped=0.0 2024-06-20 04:06:36,594 INFO [train.py:1028] (1/2) Epoch 7, batch 8800, loss[loss=0.2811, simple_loss=0.3223, pruned_loss=0.12, over 13064.00 frames. ], tot_loss[loss=0.305, simple_loss=0.337, pruned_loss=0.1365, over 2574849.22 frames. ], batch size: 71, lr: 8.14e-03, grad_scale: 32.0 2024-06-20 04:06:37,642 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.68 vs. limit=15.0 2024-06-20 04:06:40,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=127422.16666666667, ans=0.95 2024-06-20 04:06:42,289 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=9.84 vs. limit=12.0 2024-06-20 04:06:52,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=127440.5, ans=0.02 2024-06-20 04:06:57,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=127458.83333333333, ans=0.125 2024-06-20 04:06:58,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127458.83333333333, ans=0.1 2024-06-20 04:07:03,268 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.84 vs. limit=15.0 2024-06-20 04:07:10,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127495.5, ans=0.1 2024-06-20 04:07:13,455 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.51 vs. limit=15.0 2024-06-20 04:07:13,623 INFO [train.py:1028] (1/2) Epoch 7, batch 8850, loss[loss=0.3496, simple_loss=0.3721, pruned_loss=0.1635, over 12550.00 frames. ], tot_loss[loss=0.3051, simple_loss=0.3369, pruned_loss=0.1367, over 2564045.64 frames. ], batch size: 202, lr: 8.13e-03, grad_scale: 32.0 2024-06-20 04:07:14,664 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.11 vs. limit=12.0 2024-06-20 04:07:15,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=127513.83333333333, ans=0.125 2024-06-20 04:07:18,780 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.40 vs. limit=15.0 2024-06-20 04:07:34,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=127568.83333333333, ans=0.0 2024-06-20 04:07:38,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=127568.83333333333, ans=0.0 2024-06-20 04:07:42,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=127587.16666666667, ans=0.125 2024-06-20 04:07:46,207 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 2.197e+02 2.424e+02 2.632e+02 3.905e+02, threshold=4.848e+02, percent-clipped=0.0 2024-06-20 04:07:46,898 INFO [train.py:1028] (1/2) Epoch 7, batch 8900, loss[loss=0.3146, simple_loss=0.3557, pruned_loss=0.1368, over 12948.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.3383, pruned_loss=0.1373, over 2562066.19 frames. ], batch size: 33, lr: 8.13e-03, grad_scale: 32.0 2024-06-20 04:07:47,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=127605.5, ans=0.0 2024-06-20 04:08:04,757 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.55 vs. limit=10.0 2024-06-20 04:08:21,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=127678.83333333333, ans=0.125 2024-06-20 04:08:23,316 INFO [train.py:1028] (1/2) Epoch 7, batch 8950, loss[loss=0.3539, simple_loss=0.3739, pruned_loss=0.1669, over 12504.00 frames. ], tot_loss[loss=0.3072, simple_loss=0.3391, pruned_loss=0.1376, over 2561681.02 frames. ], batch size: 202, lr: 8.13e-03, grad_scale: 64.0 2024-06-20 04:08:25,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=127697.16666666667, ans=0.2 2024-06-20 04:08:25,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=127697.16666666667, ans=0.0 2024-06-20 04:08:33,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=127715.5, ans=0.0 2024-06-20 04:08:35,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=127733.83333333333, ans=0.125 2024-06-20 04:08:59,410 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.147e+02 2.353e+02 2.653e+02 5.180e+02, threshold=4.707e+02, percent-clipped=1.0 2024-06-20 04:09:00,035 INFO [train.py:1028] (1/2) Epoch 7, batch 9000, loss[loss=0.2779, simple_loss=0.3123, pruned_loss=0.1217, over 13278.00 frames. ], tot_loss[loss=0.3054, simple_loss=0.3381, pruned_loss=0.1364, over 2566633.08 frames. ], batch size: 46, lr: 8.13e-03, grad_scale: 64.0 2024-06-20 04:09:00,035 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 04:09:07,800 INFO [train.py:1060] (1/2) Epoch 7, validation: loss=0.2116, simple_loss=0.2733, pruned_loss=0.07497, over 351949.00 frames. 2024-06-20 04:09:07,800 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 04:09:11,936 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.520e+01 2024-06-20 04:09:25,452 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.77 vs. limit=22.5 2024-06-20 04:09:30,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=127843.83333333333, ans=0.0 2024-06-20 04:09:32,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=127843.83333333333, ans=0.025 2024-06-20 04:09:32,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=127843.83333333333, ans=0.125 2024-06-20 04:09:36,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=127862.16666666667, ans=0.025 2024-06-20 04:09:40,513 INFO [train.py:1028] (1/2) Epoch 7, batch 9050, loss[loss=0.2405, simple_loss=0.2836, pruned_loss=0.09867, over 11988.00 frames. ], tot_loss[loss=0.3056, simple_loss=0.3384, pruned_loss=0.1364, over 2567952.53 frames. ], batch size: 18, lr: 8.12e-03, grad_scale: 64.0 2024-06-20 04:09:41,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=127880.5, ans=0.0 2024-06-20 04:09:42,151 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2024-06-20 04:09:59,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=127935.5, ans=0.125 2024-06-20 04:10:09,193 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.81 vs. limit=15.0 2024-06-20 04:10:09,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=127953.83333333333, ans=0.09899494936611666 2024-06-20 04:10:12,124 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 2.022e+02 2.190e+02 2.409e+02 3.025e+02, threshold=4.381e+02, percent-clipped=0.0 2024-06-20 04:10:12,902 INFO [train.py:1028] (1/2) Epoch 7, batch 9100, loss[loss=0.2833, simple_loss=0.3209, pruned_loss=0.1229, over 13253.00 frames. ], tot_loss[loss=0.3043, simple_loss=0.3372, pruned_loss=0.1357, over 2569193.37 frames. ], batch size: 72, lr: 8.12e-03, grad_scale: 64.0 2024-06-20 04:10:20,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=127990.5, ans=0.1 2024-06-20 04:10:20,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=127990.5, ans=0.2 2024-06-20 04:10:21,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=127990.5, ans=0.125 2024-06-20 04:10:22,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=127990.5, ans=0.125 2024-06-20 04:10:33,014 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:10:35,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=128027.16666666667, ans=0.125 2024-06-20 04:10:39,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=128045.5, ans=0.125 2024-06-20 04:10:44,822 INFO [train.py:1028] (1/2) Epoch 7, batch 9150, loss[loss=0.2988, simple_loss=0.3354, pruned_loss=0.1311, over 13105.00 frames. ], tot_loss[loss=0.3038, simple_loss=0.337, pruned_loss=0.1353, over 2569391.88 frames. ], batch size: 77, lr: 8.12e-03, grad_scale: 64.0 2024-06-20 04:10:53,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128082.16666666667, ans=0.1 2024-06-20 04:10:54,775 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.19 vs. limit=6.0 2024-06-20 04:11:07,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=128118.83333333333, ans=0.025 2024-06-20 04:11:08,209 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.78 vs. limit=15.0 2024-06-20 04:11:10,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=128137.16666666667, ans=0.2 2024-06-20 04:11:10,623 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.00 vs. limit=15.0 2024-06-20 04:11:19,007 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.095e+02 2.387e+02 2.748e+02 4.088e+02, threshold=4.775e+02, percent-clipped=0.0 2024-06-20 04:11:19,748 INFO [train.py:1028] (1/2) Epoch 7, batch 9200, loss[loss=0.285, simple_loss=0.3205, pruned_loss=0.1248, over 13000.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3363, pruned_loss=0.1345, over 2572409.15 frames. ], batch size: 36, lr: 8.11e-03, grad_scale: 64.0 2024-06-20 04:11:26,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=128173.83333333333, ans=0.2 2024-06-20 04:11:27,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128173.83333333333, ans=0.1 2024-06-20 04:11:28,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=128173.83333333333, ans=0.0 2024-06-20 04:11:32,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=128192.16666666667, ans=0.0 2024-06-20 04:11:36,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=128192.16666666667, ans=0.125 2024-06-20 04:11:51,084 INFO [train.py:1028] (1/2) Epoch 7, batch 9250, loss[loss=0.3094, simple_loss=0.3426, pruned_loss=0.138, over 13253.00 frames. ], tot_loss[loss=0.3011, simple_loss=0.3351, pruned_loss=0.1335, over 2575121.47 frames. ], batch size: 67, lr: 8.11e-03, grad_scale: 64.0 2024-06-20 04:11:53,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=128247.16666666667, ans=0.125 2024-06-20 04:11:54,755 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:12:07,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=128283.83333333333, ans=0.125 2024-06-20 04:12:09,375 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.93 vs. limit=22.5 2024-06-20 04:12:12,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=128302.16666666667, ans=0.0 2024-06-20 04:12:17,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=128320.5, ans=0.0 2024-06-20 04:12:23,035 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 2.026e+02 2.193e+02 2.383e+02 3.728e+02, threshold=4.386e+02, percent-clipped=0.0 2024-06-20 04:12:23,705 INFO [train.py:1028] (1/2) Epoch 7, batch 9300, loss[loss=0.2692, simple_loss=0.3009, pruned_loss=0.1188, over 12900.00 frames. ], tot_loss[loss=0.3003, simple_loss=0.3344, pruned_loss=0.1331, over 2572281.34 frames. ], batch size: 39, lr: 8.11e-03, grad_scale: 64.0 2024-06-20 04:12:29,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=128338.83333333333, ans=0.2 2024-06-20 04:12:33,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=128357.16666666667, ans=0.125 2024-06-20 04:12:35,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=128357.16666666667, ans=0.125 2024-06-20 04:12:36,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=128357.16666666667, ans=0.125 2024-06-20 04:12:38,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=128375.5, ans=0.1 2024-06-20 04:12:57,429 INFO [train.py:1028] (1/2) Epoch 7, batch 9350, loss[loss=0.2916, simple_loss=0.3333, pruned_loss=0.125, over 12411.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3348, pruned_loss=0.1333, over 2569540.12 frames. ], batch size: 22, lr: 8.11e-03, grad_scale: 64.0 2024-06-20 04:13:12,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=128467.16666666667, ans=0.125 2024-06-20 04:13:13,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=128467.16666666667, ans=0.125 2024-06-20 04:13:27,766 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.132e+02 2.252e+02 2.529e+02 3.663e+02, threshold=4.504e+02, percent-clipped=0.0 2024-06-20 04:13:28,447 INFO [train.py:1028] (1/2) Epoch 7, batch 9400, loss[loss=0.3076, simple_loss=0.3477, pruned_loss=0.1338, over 13219.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.3355, pruned_loss=0.1336, over 2568019.87 frames. ], batch size: 52, lr: 8.10e-03, grad_scale: 64.0 2024-06-20 04:13:30,085 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.34 vs. limit=15.0 2024-06-20 04:13:59,194 INFO [train.py:1028] (1/2) Epoch 7, batch 9450, loss[loss=0.3085, simple_loss=0.344, pruned_loss=0.1365, over 12639.00 frames. ], tot_loss[loss=0.3027, simple_loss=0.3366, pruned_loss=0.1344, over 2568104.81 frames. ], batch size: 22, lr: 8.10e-03, grad_scale: 64.0 2024-06-20 04:14:07,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=128632.16666666667, ans=10.0 2024-06-20 04:14:11,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=128650.5, ans=0.0 2024-06-20 04:14:24,121 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.40 vs. limit=10.0 2024-06-20 04:14:24,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=128687.16666666667, ans=0.0 2024-06-20 04:14:29,347 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.111e+02 2.313e+02 2.707e+02 3.495e+02, threshold=4.627e+02, percent-clipped=0.0 2024-06-20 04:14:32,301 INFO [train.py:1028] (1/2) Epoch 7, batch 9500, loss[loss=0.2839, simple_loss=0.329, pruned_loss=0.1194, over 13235.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3363, pruned_loss=0.134, over 2576769.17 frames. ], batch size: 43, lr: 8.10e-03, grad_scale: 64.0 2024-06-20 04:14:36,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=128705.5, ans=0.0 2024-06-20 04:14:39,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=128723.83333333333, ans=0.125 2024-06-20 04:14:45,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=128742.16666666667, ans=0.125 2024-06-20 04:14:46,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=128742.16666666667, ans=0.07 2024-06-20 04:14:49,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=128742.16666666667, ans=0.1 2024-06-20 04:15:03,044 INFO [train.py:1028] (1/2) Epoch 7, batch 9550, loss[loss=0.2884, simple_loss=0.3262, pruned_loss=0.1252, over 12937.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3361, pruned_loss=0.134, over 2570961.50 frames. ], batch size: 39, lr: 8.09e-03, grad_scale: 64.0 2024-06-20 04:15:10,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=128815.5, ans=0.0 2024-06-20 04:15:19,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=128833.83333333333, ans=0.0 2024-06-20 04:15:20,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128833.83333333333, ans=0.1 2024-06-20 04:15:28,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=128870.5, ans=0.125 2024-06-20 04:15:30,306 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.73 vs. limit=15.0 2024-06-20 04:15:34,751 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.631e+02 2.183e+02 2.469e+02 2.901e+02 4.662e+02, threshold=4.938e+02, percent-clipped=1.0 2024-06-20 04:15:35,421 INFO [train.py:1028] (1/2) Epoch 7, batch 9600, loss[loss=0.332, simple_loss=0.3465, pruned_loss=0.1588, over 10343.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3358, pruned_loss=0.1342, over 2570176.73 frames. ], batch size: 303, lr: 8.09e-03, grad_scale: 64.0 2024-06-20 04:15:35,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=128888.83333333333, ans=0.2 2024-06-20 04:15:46,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=128907.16666666667, ans=0.0 2024-06-20 04:15:55,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=128943.83333333333, ans=0.125 2024-06-20 04:15:59,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=128962.16666666667, ans=0.125 2024-06-20 04:16:01,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=128962.16666666667, ans=0.125 2024-06-20 04:16:05,831 INFO [train.py:1028] (1/2) Epoch 7, batch 9650, loss[loss=0.2981, simple_loss=0.3221, pruned_loss=0.1371, over 13084.00 frames. ], tot_loss[loss=0.3037, simple_loss=0.3365, pruned_loss=0.1354, over 2560085.98 frames. ], batch size: 132, lr: 8.09e-03, grad_scale: 32.0 2024-06-20 04:16:07,725 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=3.376e+02 2024-06-20 04:16:10,277 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:16:11,612 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.46 vs. limit=6.0 2024-06-20 04:16:11,759 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.12 vs. limit=22.5 2024-06-20 04:16:16,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=128998.83333333333, ans=0.125 2024-06-20 04:16:18,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=129017.16666666667, ans=0.125 2024-06-20 04:16:24,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129035.5, ans=0.1 2024-06-20 04:16:25,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=129035.5, ans=0.125 2024-06-20 04:16:31,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=129053.83333333333, ans=0.2 2024-06-20 04:16:36,131 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.771e+02 2.126e+02 2.322e+02 2.631e+02 4.430e+02, threshold=4.643e+02, percent-clipped=0.0 2024-06-20 04:16:36,159 INFO [train.py:1028] (1/2) Epoch 7, batch 9700, loss[loss=0.3101, simple_loss=0.3335, pruned_loss=0.1434, over 12997.00 frames. ], tot_loss[loss=0.3035, simple_loss=0.3362, pruned_loss=0.1354, over 2556178.84 frames. ], batch size: 144, lr: 8.09e-03, grad_scale: 32.0 2024-06-20 04:16:39,344 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.17 vs. limit=15.0 2024-06-20 04:16:40,421 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2024-06-20 04:16:43,591 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=6.393e+01 2024-06-20 04:16:44,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129090.5, ans=0.1 2024-06-20 04:16:55,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=129127.16666666667, ans=0.0 2024-06-20 04:17:01,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=129145.5, ans=0.125 2024-06-20 04:17:09,932 INFO [train.py:1028] (1/2) Epoch 7, batch 9750, loss[loss=0.2915, simple_loss=0.3216, pruned_loss=0.1308, over 13047.00 frames. ], tot_loss[loss=0.3025, simple_loss=0.3354, pruned_loss=0.1348, over 2552419.58 frames. ], batch size: 132, lr: 8.08e-03, grad_scale: 32.0 2024-06-20 04:17:10,687 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:17:18,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=129182.16666666667, ans=0.04949747468305833 2024-06-20 04:17:26,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=129200.5, ans=0.125 2024-06-20 04:17:26,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=129200.5, ans=0.0 2024-06-20 04:17:30,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=129218.83333333333, ans=0.0 2024-06-20 04:17:32,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=129218.83333333333, ans=0.0 2024-06-20 04:17:34,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=129237.16666666667, ans=0.125 2024-06-20 04:17:38,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=129237.16666666667, ans=0.0 2024-06-20 04:17:40,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=129255.5, ans=0.125 2024-06-20 04:17:40,821 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.017e+02 2.158e+02 2.396e+02 3.987e+02, threshold=4.316e+02, percent-clipped=0.0 2024-06-20 04:17:40,849 INFO [train.py:1028] (1/2) Epoch 7, batch 9800, loss[loss=0.2953, simple_loss=0.3322, pruned_loss=0.1292, over 12849.00 frames. ], tot_loss[loss=0.3005, simple_loss=0.334, pruned_loss=0.1335, over 2545169.15 frames. ], batch size: 39, lr: 8.08e-03, grad_scale: 32.0 2024-06-20 04:17:44,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129255.5, ans=0.1 2024-06-20 04:17:47,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=129273.83333333333, ans=10.0 2024-06-20 04:17:50,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=129273.83333333333, ans=0.125 2024-06-20 04:17:57,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=129310.5, ans=0.04949747468305833 2024-06-20 04:18:10,582 INFO [train.py:1028] (1/2) Epoch 7, batch 9850, loss[loss=0.3206, simple_loss=0.3456, pruned_loss=0.1478, over 13171.00 frames. ], tot_loss[loss=0.3005, simple_loss=0.3336, pruned_loss=0.1337, over 2538944.02 frames. ], batch size: 103, lr: 8.08e-03, grad_scale: 32.0 2024-06-20 04:18:10,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129347.16666666667, ans=0.1 2024-06-20 04:18:23,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=129383.83333333333, ans=0.04949747468305833 2024-06-20 04:18:25,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=129383.83333333333, ans=0.025 2024-06-20 04:18:31,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.21 vs. limit=15.0 2024-06-20 04:18:37,195 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.65 vs. limit=22.5 2024-06-20 04:18:38,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=129420.5, ans=0.125 2024-06-20 04:18:41,649 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.089e+02 2.217e+02 2.392e+02 3.088e+02, threshold=4.434e+02, percent-clipped=0.0 2024-06-20 04:18:41,678 INFO [train.py:1028] (1/2) Epoch 7, batch 9900, loss[loss=0.2909, simple_loss=0.3289, pruned_loss=0.1264, over 12948.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3336, pruned_loss=0.134, over 2531766.98 frames. ], batch size: 39, lr: 8.07e-03, grad_scale: 32.0 2024-06-20 04:18:42,022 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.57 vs. limit=6.0 2024-06-20 04:18:46,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=129438.83333333333, ans=0.0 2024-06-20 04:18:50,336 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.22 vs. limit=15.0 2024-06-20 04:18:53,913 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:18:56,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=129475.5, ans=0.0 2024-06-20 04:19:00,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=129493.83333333333, ans=0.125 2024-06-20 04:19:02,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=129493.83333333333, ans=0.125 2024-06-20 04:19:04,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=129493.83333333333, ans=0.1 2024-06-20 04:19:12,881 INFO [train.py:1028] (1/2) Epoch 7, batch 9950, loss[loss=0.2958, simple_loss=0.3316, pruned_loss=0.13, over 12661.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.3315, pruned_loss=0.1335, over 2524085.05 frames. ], batch size: 29, lr: 8.07e-03, grad_scale: 32.0 2024-06-20 04:19:16,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=129530.5, ans=0.125 2024-06-20 04:19:27,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=129567.16666666667, ans=0.0 2024-06-20 04:19:32,872 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.756e+00 2024-06-20 04:19:34,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=129585.5, ans=0.1 2024-06-20 04:19:35,728 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.96 vs. limit=10.0 2024-06-20 04:19:38,018 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=1.235e+02 2024-06-20 04:19:39,522 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.40 vs. limit=15.0 2024-06-20 04:19:43,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=129622.16666666667, ans=0.125 2024-06-20 04:19:44,295 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.092e+02 2.284e+02 2.594e+02 3.631e+02, threshold=4.569e+02, percent-clipped=0.0 2024-06-20 04:19:44,325 INFO [train.py:1028] (1/2) Epoch 7, batch 10000, loss[loss=0.2845, simple_loss=0.3243, pruned_loss=0.1223, over 12625.00 frames. ], tot_loss[loss=0.2997, simple_loss=0.3315, pruned_loss=0.134, over 2486210.48 frames. ], batch size: 22, lr: 8.07e-03, grad_scale: 32.0 2024-06-20 04:19:46,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=129622.16666666667, ans=0.025 2024-06-20 04:19:54,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=129640.5, ans=0.125 2024-06-20 04:20:13,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=129695.5, ans=0.2 2024-06-20 04:20:13,391 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.78 vs. limit=10.0 2024-06-20 04:20:14,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=129695.5, ans=0.025 2024-06-20 04:20:16,587 INFO [train.py:1028] (1/2) Epoch 7, batch 10050, loss[loss=0.2832, simple_loss=0.33, pruned_loss=0.1182, over 12512.00 frames. ], tot_loss[loss=0.301, simple_loss=0.332, pruned_loss=0.135, over 2442220.34 frames. ], batch size: 22, lr: 8.07e-03, grad_scale: 32.0 2024-06-20 04:20:23,057 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=12.0 2024-06-20 04:20:33,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=129750.5, ans=0.125 2024-06-20 04:20:36,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=129768.83333333333, ans=0.2 2024-06-20 04:20:39,687 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.51 vs. limit=22.5 2024-06-20 04:20:47,040 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.086e+02 2.257e+02 2.510e+02 3.489e+02, threshold=4.515e+02, percent-clipped=0.0 2024-06-20 04:20:47,069 INFO [train.py:1028] (1/2) Epoch 7, batch 10100, loss[loss=0.3131, simple_loss=0.3404, pruned_loss=0.1429, over 11426.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.331, pruned_loss=0.1339, over 2423006.73 frames. ], batch size: 17, lr: 8.06e-03, grad_scale: 32.0 2024-06-20 04:20:47,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=129805.5, ans=0.125 2024-06-20 04:23:04,749 INFO [train.py:1028] (1/2) Epoch 8, batch 0, loss[loss=0.292, simple_loss=0.3256, pruned_loss=0.1291, over 12981.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3256, pruned_loss=0.1291, over 12981.00 frames. ], batch size: 36, lr: 7.60e-03, grad_scale: 32.0 2024-06-20 04:23:04,750 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 04:23:11,863 INFO [train.py:1060] (1/2) Epoch 8, validation: loss=0.2135, simple_loss=0.2755, pruned_loss=0.07574, over 351949.00 frames. 2024-06-20 04:23:11,864 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 04:23:12,247 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.68 vs. limit=15.0 2024-06-20 04:23:38,484 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=129910.0, ans=0.1 2024-06-20 04:23:38,735 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.86 vs. limit=10.0 2024-06-20 04:23:43,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=129910.0, ans=0.125 2024-06-20 04:23:45,377 INFO [train.py:1028] (1/2) Epoch 8, batch 50, loss[loss=0.2698, simple_loss=0.3106, pruned_loss=0.1145, over 12771.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3116, pruned_loss=0.1236, over 575307.39 frames. ], batch size: 29, lr: 7.59e-03, grad_scale: 32.0 2024-06-20 04:24:02,750 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=10.78 vs. limit=12.0 2024-06-20 04:24:03,261 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=19.17 vs. limit=15.0 2024-06-20 04:24:06,432 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 1.946e+02 2.137e+02 2.339e+02 3.831e+02, threshold=4.274e+02, percent-clipped=0.0 2024-06-20 04:24:07,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=129983.33333333333, ans=0.015 2024-06-20 04:24:18,668 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.96 vs. limit=10.0 2024-06-20 04:24:19,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=130001.66666666667, ans=0.07 2024-06-20 04:24:23,262 INFO [train.py:1028] (1/2) Epoch 8, batch 100, loss[loss=0.2782, simple_loss=0.3234, pruned_loss=0.1165, over 13346.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3092, pruned_loss=0.1222, over 1018048.12 frames. ], batch size: 46, lr: 7.59e-03, grad_scale: 32.0 2024-06-20 04:24:27,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=130020.0, ans=0.125 2024-06-20 04:24:36,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=130056.66666666667, ans=0.2 2024-06-20 04:24:41,411 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2024-06-20 04:24:47,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=130075.0, ans=0.0 2024-06-20 04:24:55,362 INFO [train.py:1028] (1/2) Epoch 8, batch 150, loss[loss=0.287, simple_loss=0.3229, pruned_loss=0.1256, over 13010.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3094, pruned_loss=0.1213, over 1365775.95 frames. ], batch size: 30, lr: 7.59e-03, grad_scale: 32.0 2024-06-20 04:25:02,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=130130.0, ans=0.0 2024-06-20 04:25:07,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130130.0, ans=0.1 2024-06-20 04:25:07,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=130148.33333333333, ans=0.04949747468305833 2024-06-20 04:25:14,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=130166.66666666667, ans=0.0 2024-06-20 04:25:14,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=130166.66666666667, ans=0.0 2024-06-20 04:25:16,748 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.845e+02 2.009e+02 2.176e+02 2.733e+02, threshold=4.017e+02, percent-clipped=0.0 2024-06-20 04:25:21,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=130185.0, ans=0.035 2024-06-20 04:25:27,792 INFO [train.py:1028] (1/2) Epoch 8, batch 200, loss[loss=0.3118, simple_loss=0.3283, pruned_loss=0.1476, over 12532.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.309, pruned_loss=0.1212, over 1634835.25 frames. ], batch size: 202, lr: 7.59e-03, grad_scale: 32.0 2024-06-20 04:25:34,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=130221.66666666667, ans=0.04949747468305833 2024-06-20 04:25:42,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=130240.0, ans=0.125 2024-06-20 04:25:44,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=130240.0, ans=0.025 2024-06-20 04:25:59,834 INFO [train.py:1028] (1/2) Epoch 8, batch 250, loss[loss=0.2705, simple_loss=0.2991, pruned_loss=0.1209, over 13042.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3079, pruned_loss=0.1201, over 1845973.17 frames. ], batch size: 144, lr: 7.58e-03, grad_scale: 32.0 2024-06-20 04:26:08,362 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.53 vs. limit=22.5 2024-06-20 04:26:08,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=130313.33333333333, ans=0.125 2024-06-20 04:26:15,491 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.69 vs. limit=15.0 2024-06-20 04:26:27,142 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.652e+02 1.953e+02 2.197e+02 2.441e+02 3.300e+02, threshold=4.394e+02, percent-clipped=0.0 2024-06-20 04:26:29,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=130350.0, ans=0.025 2024-06-20 04:26:30,170 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.22 vs. limit=22.5 2024-06-20 04:26:31,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=130368.33333333333, ans=0.0 2024-06-20 04:26:33,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=130368.33333333333, ans=0.125 2024-06-20 04:26:38,022 INFO [train.py:1028] (1/2) Epoch 8, batch 300, loss[loss=0.2788, simple_loss=0.3012, pruned_loss=0.1282, over 13157.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3072, pruned_loss=0.1198, over 2009585.18 frames. ], batch size: 112, lr: 7.58e-03, grad_scale: 32.0 2024-06-20 04:26:39,729 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.70 vs. limit=22.5 2024-06-20 04:26:45,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=130405.0, ans=0.025 2024-06-20 04:26:47,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=130405.0, ans=0.05 2024-06-20 04:26:49,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=130405.0, ans=0.125 2024-06-20 04:27:01,974 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.05 vs. limit=22.5 2024-06-20 04:27:03,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=130460.0, ans=0.125 2024-06-20 04:27:09,904 INFO [train.py:1028] (1/2) Epoch 8, batch 350, loss[loss=0.289, simple_loss=0.3277, pruned_loss=0.1251, over 12916.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3061, pruned_loss=0.1187, over 2138892.68 frames. ], batch size: 33, lr: 7.58e-03, grad_scale: 32.0 2024-06-20 04:27:16,550 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.75 vs. limit=15.0 2024-06-20 04:27:19,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=130496.66666666667, ans=0.5 2024-06-20 04:27:19,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=130496.66666666667, ans=0.125 2024-06-20 04:27:27,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=130515.0, ans=0.0 2024-06-20 04:27:31,390 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 1.931e+02 2.079e+02 2.331e+02 3.600e+02, threshold=4.157e+02, percent-clipped=0.0 2024-06-20 04:27:36,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=130551.66666666667, ans=0.0 2024-06-20 04:27:42,153 INFO [train.py:1028] (1/2) Epoch 8, batch 400, loss[loss=0.24, simple_loss=0.2889, pruned_loss=0.09548, over 13235.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3055, pruned_loss=0.1179, over 2239256.71 frames. ], batch size: 63, lr: 7.57e-03, grad_scale: 32.0 2024-06-20 04:27:44,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=130570.0, ans=0.0 2024-06-20 04:27:50,269 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.93 vs. limit=10.0 2024-06-20 04:27:52,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=130588.33333333333, ans=0.125 2024-06-20 04:27:55,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130606.66666666667, ans=0.1 2024-06-20 04:28:17,510 INFO [train.py:1028] (1/2) Epoch 8, batch 450, loss[loss=0.2714, simple_loss=0.3129, pruned_loss=0.1149, over 13156.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3049, pruned_loss=0.1174, over 2313527.96 frames. ], batch size: 67, lr: 7.57e-03, grad_scale: 32.0 2024-06-20 04:28:18,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=130661.66666666667, ans=0.2 2024-06-20 04:28:20,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=130661.66666666667, ans=0.125 2024-06-20 04:28:20,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=130661.66666666667, ans=0.125 2024-06-20 04:28:22,499 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.23 vs. limit=15.0 2024-06-20 04:28:24,954 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.66 vs. limit=15.0 2024-06-20 04:28:39,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=130716.66666666667, ans=0.125 2024-06-20 04:28:41,944 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 2.030e+02 2.195e+02 2.445e+02 3.763e+02, threshold=4.391e+02, percent-clipped=0.0 2024-06-20 04:28:42,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=130716.66666666667, ans=0.0 2024-06-20 04:28:44,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=130716.66666666667, ans=0.125 2024-06-20 04:28:48,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=130735.0, ans=0.125 2024-06-20 04:28:52,973 INFO [train.py:1028] (1/2) Epoch 8, batch 500, loss[loss=0.2778, simple_loss=0.3059, pruned_loss=0.1249, over 13149.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3054, pruned_loss=0.1174, over 2375592.87 frames. ], batch size: 121, lr: 7.57e-03, grad_scale: 32.0 2024-06-20 04:28:58,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=130771.66666666667, ans=0.0 2024-06-20 04:29:17,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=130826.66666666667, ans=0.0 2024-06-20 04:29:24,554 INFO [train.py:1028] (1/2) Epoch 8, batch 550, loss[loss=0.2868, simple_loss=0.3123, pruned_loss=0.1306, over 12949.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3058, pruned_loss=0.1179, over 2420766.35 frames. ], batch size: 158, lr: 7.57e-03, grad_scale: 32.0 2024-06-20 04:29:26,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=130845.0, ans=0.5 2024-06-20 04:29:32,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=130863.33333333333, ans=0.2 2024-06-20 04:29:37,878 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.12 vs. limit=15.0 2024-06-20 04:29:40,394 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2024-06-20 04:29:45,673 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 1.935e+02 2.087e+02 2.296e+02 3.128e+02, threshold=4.174e+02, percent-clipped=0.0 2024-06-20 04:29:56,716 INFO [train.py:1028] (1/2) Epoch 8, batch 600, loss[loss=0.2514, simple_loss=0.2812, pruned_loss=0.1108, over 13046.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3056, pruned_loss=0.1178, over 2459826.49 frames. ], batch size: 144, lr: 7.56e-03, grad_scale: 32.0 2024-06-20 04:30:17,195 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.25 vs. limit=15.0 2024-06-20 04:30:27,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=131010.0, ans=0.2 2024-06-20 04:30:33,019 INFO [train.py:1028] (1/2) Epoch 8, batch 650, loss[loss=0.2528, simple_loss=0.2963, pruned_loss=0.1047, over 13195.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3058, pruned_loss=0.1175, over 2490149.42 frames. ], batch size: 59, lr: 7.56e-03, grad_scale: 32.0 2024-06-20 04:30:38,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=131028.33333333333, ans=0.125 2024-06-20 04:30:46,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=131046.66666666667, ans=0.125 2024-06-20 04:30:46,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=131046.66666666667, ans=0.0 2024-06-20 04:30:50,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=131065.0, ans=0.125 2024-06-20 04:30:55,343 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.86 vs. limit=22.5 2024-06-20 04:30:57,406 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.662e+02 1.852e+02 1.968e+02 2.136e+02 2.544e+02, threshold=3.936e+02, percent-clipped=0.0 2024-06-20 04:31:01,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=131101.66666666666, ans=0.0 2024-06-20 04:31:02,849 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.39 vs. limit=15.0 2024-06-20 04:31:08,102 INFO [train.py:1028] (1/2) Epoch 8, batch 700, loss[loss=0.3201, simple_loss=0.3414, pruned_loss=0.1494, over 13262.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3055, pruned_loss=0.1177, over 2512153.80 frames. ], batch size: 46, lr: 7.56e-03, grad_scale: 32.0 2024-06-20 04:31:09,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=131120.0, ans=0.125 2024-06-20 04:31:40,205 INFO [train.py:1028] (1/2) Epoch 8, batch 750, loss[loss=0.2814, simple_loss=0.3225, pruned_loss=0.1201, over 13260.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3052, pruned_loss=0.1176, over 2528091.17 frames. ], batch size: 63, lr: 7.56e-03, grad_scale: 32.0 2024-06-20 04:31:43,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=131211.66666666666, ans=0.125 2024-06-20 04:31:46,341 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.79 vs. limit=22.5 2024-06-20 04:31:48,422 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=1.025e+02 2024-06-20 04:31:55,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.07 vs. limit=6.0 2024-06-20 04:32:01,256 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 1.971e+02 2.173e+02 2.578e+02 4.718e+02, threshold=4.347e+02, percent-clipped=3.0 2024-06-20 04:32:01,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=131266.66666666666, ans=0.0 2024-06-20 04:32:09,404 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.99 vs. limit=15.0 2024-06-20 04:32:12,326 INFO [train.py:1028] (1/2) Epoch 8, batch 800, loss[loss=0.2655, simple_loss=0.3043, pruned_loss=0.1134, over 12872.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3055, pruned_loss=0.1181, over 2541116.91 frames. ], batch size: 36, lr: 7.55e-03, grad_scale: 32.0 2024-06-20 04:32:12,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=131303.33333333334, ans=0.0 2024-06-20 04:32:20,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=131321.66666666666, ans=0.2 2024-06-20 04:32:22,362 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.96 vs. limit=15.0 2024-06-20 04:32:48,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=131376.66666666666, ans=0.125 2024-06-20 04:32:49,411 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.60 vs. limit=22.5 2024-06-20 04:32:49,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=131376.66666666666, ans=0.125 2024-06-20 04:32:51,042 INFO [train.py:1028] (1/2) Epoch 8, batch 850, loss[loss=0.2676, simple_loss=0.3054, pruned_loss=0.1149, over 13169.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3051, pruned_loss=0.1177, over 2550487.90 frames. ], batch size: 95, lr: 7.55e-03, grad_scale: 32.0 2024-06-20 04:32:52,477 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=9.877e+01 2024-06-20 04:32:59,138 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.01 vs. limit=22.5 2024-06-20 04:33:02,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=131413.33333333334, ans=15.0 2024-06-20 04:33:07,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=131431.66666666666, ans=0.125 2024-06-20 04:33:07,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=131431.66666666666, ans=15.0 2024-06-20 04:33:11,870 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 1.878e+02 2.052e+02 2.275e+02 3.212e+02, threshold=4.104e+02, percent-clipped=0.0 2024-06-20 04:33:15,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=131468.33333333334, ans=0.0 2024-06-20 04:33:16,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=131468.33333333334, ans=0.025 2024-06-20 04:33:17,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=131468.33333333334, ans=0.0 2024-06-20 04:33:23,091 INFO [train.py:1028] (1/2) Epoch 8, batch 900, loss[loss=0.2758, simple_loss=0.3156, pruned_loss=0.118, over 12995.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3043, pruned_loss=0.1176, over 2555999.12 frames. ], batch size: 36, lr: 7.55e-03, grad_scale: 32.0 2024-06-20 04:33:30,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=131505.0, ans=0.5 2024-06-20 04:33:31,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=131505.0, ans=0.07 2024-06-20 04:33:33,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=131505.0, ans=0.1 2024-06-20 04:33:37,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=131523.33333333334, ans=0.1 2024-06-20 04:33:47,130 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.09 vs. limit=15.0 2024-06-20 04:33:48,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=131541.66666666666, ans=0.125 2024-06-20 04:33:55,923 INFO [train.py:1028] (1/2) Epoch 8, batch 950, loss[loss=0.2716, simple_loss=0.3118, pruned_loss=0.1157, over 12998.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3049, pruned_loss=0.1177, over 2558936.65 frames. ], batch size: 39, lr: 7.55e-03, grad_scale: 32.0 2024-06-20 04:33:57,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=131578.33333333334, ans=0.0 2024-06-20 04:33:58,176 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.16 vs. limit=22.5 2024-06-20 04:34:03,262 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=15.23 vs. limit=15.0 2024-06-20 04:34:07,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=131596.66666666666, ans=0.0 2024-06-20 04:34:07,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=131596.66666666666, ans=0.2 2024-06-20 04:34:12,508 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.25 vs. limit=15.0 2024-06-20 04:34:16,855 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 1.900e+02 2.110e+02 2.325e+02 3.310e+02, threshold=4.221e+02, percent-clipped=0.0 2024-06-20 04:34:19,415 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.55 vs. limit=15.0 2024-06-20 04:34:19,993 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.11 vs. limit=22.5 2024-06-20 04:34:29,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=131651.66666666666, ans=0.0 2024-06-20 04:34:29,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=131651.66666666666, ans=0.1 2024-06-20 04:34:30,984 INFO [train.py:1028] (1/2) Epoch 8, batch 1000, loss[loss=0.2699, simple_loss=0.3134, pruned_loss=0.1132, over 13289.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3051, pruned_loss=0.118, over 2561442.13 frames. ], batch size: 49, lr: 7.54e-03, grad_scale: 32.0 2024-06-20 04:34:49,898 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=13.39 vs. limit=15.0 2024-06-20 04:34:55,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=131725.0, ans=0.125 2024-06-20 04:34:58,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=131725.0, ans=0.0 2024-06-20 04:35:06,785 INFO [train.py:1028] (1/2) Epoch 8, batch 1050, loss[loss=0.2562, simple_loss=0.2943, pruned_loss=0.1091, over 13116.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3054, pruned_loss=0.1178, over 2564194.05 frames. ], batch size: 77, lr: 7.54e-03, grad_scale: 32.0 2024-06-20 04:35:11,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=131761.66666666666, ans=0.125 2024-06-20 04:35:12,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=131761.66666666666, ans=0.1 2024-06-20 04:35:19,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=131798.33333333334, ans=0.025 2024-06-20 04:35:20,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=131798.33333333334, ans=0.0 2024-06-20 04:35:22,125 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=1.541e+02 2024-06-20 04:35:22,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=131798.33333333334, ans=0.1 2024-06-20 04:35:27,628 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.653e+02 1.910e+02 2.088e+02 2.447e+02 3.403e+02, threshold=4.176e+02, percent-clipped=0.0 2024-06-20 04:35:28,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=131816.66666666666, ans=0.07 2024-06-20 04:35:31,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=131835.0, ans=0.2 2024-06-20 04:35:39,424 INFO [train.py:1028] (1/2) Epoch 8, batch 1100, loss[loss=0.2677, simple_loss=0.3095, pruned_loss=0.1129, over 13286.00 frames. ], tot_loss[loss=0.271, simple_loss=0.306, pruned_loss=0.118, over 2568932.57 frames. ], batch size: 52, lr: 7.54e-03, grad_scale: 32.0 2024-06-20 04:36:03,826 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.47 vs. limit=15.0 2024-06-20 04:36:06,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=131926.66666666666, ans=0.0 2024-06-20 04:36:11,942 INFO [train.py:1028] (1/2) Epoch 8, batch 1150, loss[loss=0.2886, simple_loss=0.3254, pruned_loss=0.1259, over 13265.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3069, pruned_loss=0.1186, over 2571323.94 frames. ], batch size: 52, lr: 7.54e-03, grad_scale: 32.0 2024-06-20 04:36:22,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=131963.33333333334, ans=0.0 2024-06-20 04:36:30,023 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:36:32,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=131981.66666666666, ans=0.0 2024-06-20 04:36:44,049 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 1.893e+02 2.037e+02 2.311e+02 3.157e+02, threshold=4.073e+02, percent-clipped=0.0 2024-06-20 04:36:44,597 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.81 vs. limit=6.0 2024-06-20 04:36:53,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=132018.33333333334, ans=0.125 2024-06-20 04:36:54,345 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.78 vs. limit=6.0 2024-06-20 04:36:55,104 INFO [train.py:1028] (1/2) Epoch 8, batch 1200, loss[loss=0.2574, simple_loss=0.2962, pruned_loss=0.1093, over 13190.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.306, pruned_loss=0.1182, over 2573292.18 frames. ], batch size: 77, lr: 7.53e-03, grad_scale: 32.0 2024-06-20 04:37:07,310 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=132055.0, ans=0.0 2024-06-20 04:37:22,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=132110.0, ans=0.125 2024-06-20 04:37:25,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=132110.0, ans=0.0 2024-06-20 04:37:27,373 INFO [train.py:1028] (1/2) Epoch 8, batch 1250, loss[loss=0.2413, simple_loss=0.2817, pruned_loss=0.1005, over 13156.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3058, pruned_loss=0.118, over 2583063.63 frames. ], batch size: 112, lr: 7.53e-03, grad_scale: 32.0 2024-06-20 04:37:34,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=132146.66666666666, ans=0.125 2024-06-20 04:37:43,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=132165.0, ans=0.2 2024-06-20 04:37:49,085 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.880e+02 1.982e+02 2.228e+02 3.428e+02, threshold=3.964e+02, percent-clipped=0.0 2024-06-20 04:37:51,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=132183.33333333334, ans=0.125 2024-06-20 04:37:59,696 INFO [train.py:1028] (1/2) Epoch 8, batch 1300, loss[loss=0.2786, simple_loss=0.3074, pruned_loss=0.1248, over 12803.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3064, pruned_loss=0.1184, over 2582997.98 frames. ], batch size: 177, lr: 7.53e-03, grad_scale: 32.0 2024-06-20 04:38:01,459 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.32 vs. limit=15.0 2024-06-20 04:38:03,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=132220.0, ans=0.0 2024-06-20 04:38:04,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=132220.0, ans=0.1 2024-06-20 04:38:16,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=132256.66666666666, ans=0.025 2024-06-20 04:38:19,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=132275.0, ans=0.125 2024-06-20 04:38:20,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=132275.0, ans=0.125 2024-06-20 04:38:36,021 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.79 vs. limit=22.5 2024-06-20 04:38:36,297 INFO [train.py:1028] (1/2) Epoch 8, batch 1350, loss[loss=0.268, simple_loss=0.3069, pruned_loss=0.1146, over 13219.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3056, pruned_loss=0.1178, over 2584384.91 frames. ], batch size: 59, lr: 7.53e-03, grad_scale: 32.0 2024-06-20 04:38:43,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=132330.0, ans=0.125 2024-06-20 04:38:43,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=132330.0, ans=0.125 2024-06-20 04:38:43,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=132330.0, ans=0.0 2024-06-20 04:38:44,754 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.07 vs. limit=15.0 2024-06-20 04:38:45,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132330.0, ans=0.1 2024-06-20 04:38:48,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=132348.33333333334, ans=0.1 2024-06-20 04:38:54,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132348.33333333334, ans=0.1 2024-06-20 04:38:55,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=132348.33333333334, ans=0.125 2024-06-20 04:39:00,288 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.46 vs. limit=15.0 2024-06-20 04:39:01,728 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.826e+02 2.016e+02 2.228e+02 3.319e+02, threshold=4.032e+02, percent-clipped=0.0 2024-06-20 04:39:11,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=132385.0, ans=0.0 2024-06-20 04:39:13,081 INFO [train.py:1028] (1/2) Epoch 8, batch 1400, loss[loss=0.2965, simple_loss=0.3304, pruned_loss=0.1313, over 12871.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3057, pruned_loss=0.1183, over 2586534.00 frames. ], batch size: 26, lr: 7.52e-03, grad_scale: 32.0 2024-06-20 04:39:18,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=132421.66666666666, ans=0.0 2024-06-20 04:39:18,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=132421.66666666666, ans=0.2 2024-06-20 04:39:26,362 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.89 vs. limit=15.0 2024-06-20 04:39:28,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132440.0, ans=0.1 2024-06-20 04:39:32,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=132458.33333333334, ans=0.1 2024-06-20 04:39:36,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=132458.33333333334, ans=0.125 2024-06-20 04:39:44,564 INFO [train.py:1028] (1/2) Epoch 8, batch 1450, loss[loss=0.2371, simple_loss=0.2703, pruned_loss=0.102, over 13098.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.306, pruned_loss=0.1186, over 2587098.11 frames. ], batch size: 121, lr: 7.52e-03, grad_scale: 32.0 2024-06-20 04:39:54,244 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=4.868e-01 2024-06-20 04:39:56,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=132513.33333333334, ans=0.1 2024-06-20 04:40:05,100 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:40:05,495 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.666e+02 1.890e+02 2.006e+02 2.301e+02 3.397e+02, threshold=4.012e+02, percent-clipped=0.0 2024-06-20 04:40:07,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=132550.0, ans=0.125 2024-06-20 04:40:16,344 INFO [train.py:1028] (1/2) Epoch 8, batch 1500, loss[loss=0.261, simple_loss=0.2995, pruned_loss=0.1113, over 13262.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3062, pruned_loss=0.1186, over 2590215.51 frames. ], batch size: 83, lr: 7.52e-03, grad_scale: 64.0 2024-06-20 04:40:17,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=132586.66666666666, ans=0.125 2024-06-20 04:40:31,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=132623.33333333334, ans=0.025 2024-06-20 04:40:39,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132641.66666666666, ans=0.1 2024-06-20 04:40:41,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=132641.66666666666, ans=0.125 2024-06-20 04:40:46,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=132641.66666666666, ans=0.125 2024-06-20 04:40:50,564 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.90 vs. limit=15.0 2024-06-20 04:40:53,054 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.75 vs. limit=6.0 2024-06-20 04:40:54,454 INFO [train.py:1028] (1/2) Epoch 8, batch 1550, loss[loss=0.2691, simple_loss=0.2952, pruned_loss=0.1215, over 13128.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3071, pruned_loss=0.1193, over 2585909.50 frames. ], batch size: 103, lr: 7.51e-03, grad_scale: 64.0 2024-06-20 04:40:58,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=132678.33333333334, ans=0.1 2024-06-20 04:41:07,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=132715.0, ans=0.0 2024-06-20 04:41:16,194 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.869e+02 2.006e+02 2.177e+02 2.963e+02, threshold=4.011e+02, percent-clipped=0.0 2024-06-20 04:41:18,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=132733.33333333334, ans=0.125 2024-06-20 04:41:26,728 INFO [train.py:1028] (1/2) Epoch 8, batch 1600, loss[loss=0.2648, simple_loss=0.3011, pruned_loss=0.1143, over 13143.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3066, pruned_loss=0.119, over 2580714.57 frames. ], batch size: 77, lr: 7.51e-03, grad_scale: 64.0 2024-06-20 04:41:27,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=132770.0, ans=0.125 2024-06-20 04:41:34,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=132788.33333333334, ans=0.125 2024-06-20 04:41:35,963 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.24 vs. limit=10.0 2024-06-20 04:41:43,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=132806.66666666666, ans=0.1 2024-06-20 04:41:50,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=132825.0, ans=0.2 2024-06-20 04:41:51,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=132843.33333333334, ans=0.125 2024-06-20 04:41:54,669 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.53 vs. limit=15.0 2024-06-20 04:41:58,174 INFO [train.py:1028] (1/2) Epoch 8, batch 1650, loss[loss=0.2628, simple_loss=0.2982, pruned_loss=0.1137, over 13100.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3062, pruned_loss=0.119, over 2576066.39 frames. ], batch size: 95, lr: 7.51e-03, grad_scale: 64.0 2024-06-20 04:42:22,975 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.660e+02 2.000e+02 2.347e+02 2.779e+02 4.375e+02, threshold=4.694e+02, percent-clipped=3.0 2024-06-20 04:42:30,216 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=8.0 2024-06-20 04:42:34,094 INFO [train.py:1028] (1/2) Epoch 8, batch 1700, loss[loss=0.2736, simple_loss=0.3239, pruned_loss=0.1116, over 12374.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3061, pruned_loss=0.1184, over 2580975.65 frames. ], batch size: 25, lr: 7.51e-03, grad_scale: 64.0 2024-06-20 04:43:04,419 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.47 vs. limit=6.0 2024-06-20 04:43:09,330 INFO [train.py:1028] (1/2) Epoch 8, batch 1750, loss[loss=0.3014, simple_loss=0.3354, pruned_loss=0.1337, over 12491.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3065, pruned_loss=0.1183, over 2581899.20 frames. ], batch size: 22, lr: 7.50e-03, grad_scale: 64.0 2024-06-20 04:43:21,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=133081.66666666666, ans=0.125 2024-06-20 04:43:27,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133081.66666666666, ans=0.1 2024-06-20 04:43:30,767 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.681e+02 1.879e+02 2.009e+02 2.239e+02 3.803e+02, threshold=4.019e+02, percent-clipped=0.0 2024-06-20 04:43:41,293 INFO [train.py:1028] (1/2) Epoch 8, batch 1800, loss[loss=0.2457, simple_loss=0.2891, pruned_loss=0.1011, over 13281.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3069, pruned_loss=0.1186, over 2581664.20 frames. ], batch size: 67, lr: 7.50e-03, grad_scale: 64.0 2024-06-20 04:43:42,842 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:43:43,183 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.81 vs. limit=10.0 2024-06-20 04:43:44,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=133136.66666666666, ans=0.2 2024-06-20 04:43:54,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=133173.33333333334, ans=0.125 2024-06-20 04:43:54,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=133173.33333333334, ans=10.0 2024-06-20 04:43:55,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=133173.33333333334, ans=0.1 2024-06-20 04:44:11,701 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.79 vs. limit=6.0 2024-06-20 04:44:13,798 INFO [train.py:1028] (1/2) Epoch 8, batch 1850, loss[loss=0.2653, simple_loss=0.3035, pruned_loss=0.1136, over 13156.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3073, pruned_loss=0.1185, over 2583192.34 frames. ], batch size: 83, lr: 7.50e-03, grad_scale: 64.0 2024-06-20 04:44:38,410 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.652e+02 1.898e+02 2.086e+02 2.250e+02 3.242e+02, threshold=4.173e+02, percent-clipped=0.0 2024-06-20 04:44:43,242 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.74 vs. limit=15.0 2024-06-20 04:44:43,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=133301.66666666666, ans=0.125 2024-06-20 04:44:48,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=133301.66666666666, ans=0.1 2024-06-20 04:44:51,813 INFO [train.py:1028] (1/2) Epoch 8, batch 1900, loss[loss=0.2572, simple_loss=0.2935, pruned_loss=0.1105, over 13168.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3063, pruned_loss=0.1179, over 2585624.20 frames. ], batch size: 95, lr: 7.50e-03, grad_scale: 64.0 2024-06-20 04:44:52,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=133320.0, ans=10.0 2024-06-20 04:45:08,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=133356.66666666666, ans=0.125 2024-06-20 04:45:18,168 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.69 vs. limit=10.0 2024-06-20 04:45:24,141 INFO [train.py:1028] (1/2) Epoch 8, batch 1950, loss[loss=0.2554, simple_loss=0.2929, pruned_loss=0.1089, over 13263.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3057, pruned_loss=0.1176, over 2591408.55 frames. ], batch size: 52, lr: 7.49e-03, grad_scale: 64.0 2024-06-20 04:45:30,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=133430.0, ans=0.2 2024-06-20 04:45:31,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=133430.0, ans=0.0 2024-06-20 04:45:38,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=133448.33333333334, ans=0.0 2024-06-20 04:45:45,347 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.965e+02 2.146e+02 2.380e+02 3.098e+02, threshold=4.291e+02, percent-clipped=0.0 2024-06-20 04:45:45,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=133466.66666666666, ans=0.035 2024-06-20 04:45:56,506 INFO [train.py:1028] (1/2) Epoch 8, batch 2000, loss[loss=0.2994, simple_loss=0.3403, pruned_loss=0.1293, over 12650.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3056, pruned_loss=0.1176, over 2587432.68 frames. ], batch size: 22, lr: 7.49e-03, grad_scale: 64.0 2024-06-20 04:45:57,499 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.62 vs. limit=22.5 2024-06-20 04:45:59,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=133503.33333333334, ans=0.125 2024-06-20 04:46:01,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=133503.33333333334, ans=0.125 2024-06-20 04:46:02,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=133521.66666666666, ans=0.125 2024-06-20 04:46:08,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=133540.0, ans=0.0 2024-06-20 04:46:18,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=133558.33333333334, ans=0.2 2024-06-20 04:46:21,802 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.78 vs. limit=15.0 2024-06-20 04:46:26,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=133576.66666666666, ans=0.2 2024-06-20 04:46:26,382 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.35 vs. limit=15.0 2024-06-20 04:46:26,385 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=22.22 vs. limit=15.0 2024-06-20 04:46:31,806 INFO [train.py:1028] (1/2) Epoch 8, batch 2050, loss[loss=0.2752, simple_loss=0.3074, pruned_loss=0.1215, over 12750.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3064, pruned_loss=0.1182, over 2583677.30 frames. ], batch size: 29, lr: 7.49e-03, grad_scale: 64.0 2024-06-20 04:46:34,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=133595.0, ans=0.125 2024-06-20 04:46:43,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=133613.33333333334, ans=0.125 2024-06-20 04:46:46,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=133613.33333333334, ans=0.0 2024-06-20 04:46:49,233 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.71 vs. limit=15.0 2024-06-20 04:46:55,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=133650.0, ans=0.125 2024-06-20 04:46:57,761 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 1.843e+02 1.980e+02 2.155e+02 2.736e+02, threshold=3.959e+02, percent-clipped=0.0 2024-06-20 04:47:08,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133668.33333333334, ans=0.1 2024-06-20 04:47:09,140 INFO [train.py:1028] (1/2) Epoch 8, batch 2100, loss[loss=0.2596, simple_loss=0.2944, pruned_loss=0.1123, over 13188.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3066, pruned_loss=0.1177, over 2585898.55 frames. ], batch size: 59, lr: 7.49e-03, grad_scale: 64.0 2024-06-20 04:47:09,493 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2024-06-20 04:47:14,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=133686.66666666666, ans=0.125 2024-06-20 04:47:15,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=133705.0, ans=0.125 2024-06-20 04:47:23,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=133723.33333333334, ans=0.1 2024-06-20 04:47:27,409 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:47:30,049 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=22.5 2024-06-20 04:47:35,166 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.26 vs. limit=12.0 2024-06-20 04:47:36,239 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:47:41,094 INFO [train.py:1028] (1/2) Epoch 8, batch 2150, loss[loss=0.2703, simple_loss=0.3126, pruned_loss=0.114, over 13267.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.306, pruned_loss=0.1176, over 2588265.99 frames. ], batch size: 52, lr: 7.48e-03, grad_scale: 64.0 2024-06-20 04:47:42,155 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.19 vs. limit=10.0 2024-06-20 04:47:43,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=133778.33333333334, ans=0.1 2024-06-20 04:47:56,600 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=12.13 vs. limit=12.0 2024-06-20 04:48:02,684 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.825e+02 1.965e+02 2.203e+02 3.024e+02, threshold=3.930e+02, percent-clipped=0.0 2024-06-20 04:48:05,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=133833.33333333334, ans=0.125 2024-06-20 04:48:08,251 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.97 vs. limit=22.5 2024-06-20 04:48:12,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=133851.66666666666, ans=0.125 2024-06-20 04:48:13,768 INFO [train.py:1028] (1/2) Epoch 8, batch 2200, loss[loss=0.2703, simple_loss=0.3032, pruned_loss=0.1187, over 13189.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3061, pruned_loss=0.1177, over 2587743.44 frames. ], batch size: 83, lr: 7.48e-03, grad_scale: 64.0 2024-06-20 04:48:17,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=133870.0, ans=0.125 2024-06-20 04:48:17,624 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.55 vs. limit=22.5 2024-06-20 04:48:17,686 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.62 vs. limit=15.0 2024-06-20 04:48:20,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=133888.33333333334, ans=0.04949747468305833 2024-06-20 04:48:20,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=133888.33333333334, ans=0.025 2024-06-20 04:48:32,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=133906.66666666666, ans=0.0 2024-06-20 04:48:40,277 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.97 vs. limit=22.5 2024-06-20 04:48:42,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=133943.33333333334, ans=10.0 2024-06-20 04:48:48,862 INFO [train.py:1028] (1/2) Epoch 8, batch 2250, loss[loss=0.2608, simple_loss=0.3033, pruned_loss=0.1092, over 13259.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.306, pruned_loss=0.1176, over 2586596.18 frames. ], batch size: 63, lr: 7.48e-03, grad_scale: 64.0 2024-06-20 04:48:56,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=133961.66666666666, ans=0.0 2024-06-20 04:48:57,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=133980.0, ans=0.125 2024-06-20 04:49:01,710 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.65 vs. limit=15.0 2024-06-20 04:49:04,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=133998.33333333334, ans=0.125 2024-06-20 04:49:07,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=133998.33333333334, ans=0.0 2024-06-20 04:49:10,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=134016.66666666666, ans=0.07 2024-06-20 04:49:13,135 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.662e+02 1.937e+02 2.149e+02 2.614e+02 4.057e+02, threshold=4.298e+02, percent-clipped=1.0 2024-06-20 04:49:23,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=134035.0, ans=0.09899494936611666 2024-06-20 04:49:24,211 INFO [train.py:1028] (1/2) Epoch 8, batch 2300, loss[loss=0.2687, simple_loss=0.3069, pruned_loss=0.1153, over 12848.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3056, pruned_loss=0.1172, over 2579801.74 frames. ], batch size: 33, lr: 7.48e-03, grad_scale: 64.0 2024-06-20 04:49:24,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=134053.33333333334, ans=0.125 2024-06-20 04:49:25,865 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.89 vs. limit=15.0 2024-06-20 04:49:33,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=134071.66666666666, ans=0.125 2024-06-20 04:49:44,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=134108.33333333334, ans=0.125 2024-06-20 04:49:57,043 INFO [train.py:1028] (1/2) Epoch 8, batch 2350, loss[loss=0.2902, simple_loss=0.328, pruned_loss=0.1262, over 13195.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3057, pruned_loss=0.1172, over 2583591.51 frames. ], batch size: 67, lr: 7.47e-03, grad_scale: 64.0 2024-06-20 04:50:15,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=134181.66666666666, ans=0.125 2024-06-20 04:50:23,215 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.881e+02 2.019e+02 2.216e+02 2.953e+02, threshold=4.038e+02, percent-clipped=0.0 2024-06-20 04:50:23,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=134200.0, ans=0.125 2024-06-20 04:50:26,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=134200.0, ans=0.0 2024-06-20 04:50:34,373 INFO [train.py:1028] (1/2) Epoch 8, batch 2400, loss[loss=0.2753, simple_loss=0.3094, pruned_loss=0.1206, over 13358.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3047, pruned_loss=0.1168, over 2586599.23 frames. ], batch size: 46, lr: 7.47e-03, grad_scale: 64.0 2024-06-20 04:50:50,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134273.33333333334, ans=0.1 2024-06-20 04:51:02,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=134310.0, ans=0.125 2024-06-20 04:51:02,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=134310.0, ans=0.0 2024-06-20 04:51:04,182 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=27.40 vs. limit=22.5 2024-06-20 04:51:06,184 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.03 vs. limit=15.0 2024-06-20 04:51:06,816 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.96 vs. limit=15.0 2024-06-20 04:51:08,847 INFO [train.py:1028] (1/2) Epoch 8, batch 2450, loss[loss=0.2494, simple_loss=0.283, pruned_loss=0.1079, over 13255.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3033, pruned_loss=0.1169, over 2582701.39 frames. ], batch size: 63, lr: 7.47e-03, grad_scale: 64.0 2024-06-20 04:51:10,736 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.38 vs. limit=10.0 2024-06-20 04:51:23,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134365.0, ans=0.1 2024-06-20 04:51:25,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=134365.0, ans=0.0 2024-06-20 04:51:30,658 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.874e+02 1.988e+02 2.256e+02 3.124e+02, threshold=3.975e+02, percent-clipped=0.0 2024-06-20 04:51:33,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=134383.33333333334, ans=0.125 2024-06-20 04:51:36,453 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=134401.66666666666, ans=0.2 2024-06-20 04:51:41,473 INFO [train.py:1028] (1/2) Epoch 8, batch 2500, loss[loss=0.2659, simple_loss=0.2958, pruned_loss=0.1181, over 13265.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3025, pruned_loss=0.1165, over 2584809.68 frames. ], batch size: 83, lr: 7.47e-03, grad_scale: 64.0 2024-06-20 04:51:58,984 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.80 vs. limit=6.0 2024-06-20 04:52:17,176 INFO [train.py:1028] (1/2) Epoch 8, batch 2550, loss[loss=0.2531, simple_loss=0.2986, pruned_loss=0.1038, over 12611.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3015, pruned_loss=0.1161, over 2585905.94 frames. ], batch size: 22, lr: 7.46e-03, grad_scale: 64.0 2024-06-20 04:52:18,327 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=15.0 2024-06-20 04:52:19,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=134511.66666666666, ans=0.125 2024-06-20 04:52:22,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=134511.66666666666, ans=0.5 2024-06-20 04:52:22,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=134511.66666666666, ans=0.025 2024-06-20 04:52:24,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=134530.0, ans=0.125 2024-06-20 04:52:41,744 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.25 vs. limit=22.5 2024-06-20 04:52:41,920 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.867e+02 2.044e+02 2.234e+02 2.958e+02, threshold=4.088e+02, percent-clipped=0.0 2024-06-20 04:52:45,840 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.14 vs. limit=22.5 2024-06-20 04:52:45,880 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.14 vs. limit=22.5 2024-06-20 04:52:53,211 INFO [train.py:1028] (1/2) Epoch 8, batch 2600, loss[loss=0.252, simple_loss=0.2927, pruned_loss=0.1056, over 13292.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3005, pruned_loss=0.1156, over 2585907.04 frames. ], batch size: 52, lr: 7.46e-03, grad_scale: 64.0 2024-06-20 04:52:54,245 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.65 vs. limit=15.0 2024-06-20 04:53:03,802 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.23 vs. limit=22.5 2024-06-20 04:53:08,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=134640.0, ans=0.0 2024-06-20 04:53:13,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=134658.33333333334, ans=0.125 2024-06-20 04:53:15,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=134658.33333333334, ans=0.0 2024-06-20 04:53:18,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=134658.33333333334, ans=0.125 2024-06-20 04:53:24,355 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.57 vs. limit=15.0 2024-06-20 04:53:24,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=134676.66666666666, ans=0.125 2024-06-20 04:53:25,934 INFO [train.py:1028] (1/2) Epoch 8, batch 2650, loss[loss=0.2465, simple_loss=0.2752, pruned_loss=0.1089, over 13059.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.2991, pruned_loss=0.1148, over 2586075.35 frames. ], batch size: 144, lr: 7.46e-03, grad_scale: 64.0 2024-06-20 04:53:33,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=134713.33333333334, ans=0.125 2024-06-20 04:53:34,723 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.52 vs. limit=22.5 2024-06-20 04:53:38,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=134731.66666666666, ans=0.1 2024-06-20 04:53:41,754 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=20.95 vs. limit=15.0 2024-06-20 04:53:42,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=134731.66666666666, ans=0.125 2024-06-20 04:53:47,045 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.870e+02 1.982e+02 2.217e+02 3.323e+02, threshold=3.965e+02, percent-clipped=0.0 2024-06-20 04:53:51,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=134768.33333333334, ans=0.125 2024-06-20 04:53:51,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=134768.33333333334, ans=0.1 2024-06-20 04:53:58,363 INFO [train.py:1028] (1/2) Epoch 8, batch 2700, loss[loss=0.2554, simple_loss=0.2918, pruned_loss=0.1095, over 13273.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.2972, pruned_loss=0.1143, over 2584723.81 frames. ], batch size: 89, lr: 7.46e-03, grad_scale: 64.0 2024-06-20 04:54:04,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=134786.66666666666, ans=0.1 2024-06-20 04:54:04,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134786.66666666666, ans=0.1 2024-06-20 04:54:08,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=134805.0, ans=0.5 2024-06-20 04:54:15,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=134823.33333333334, ans=0.125 2024-06-20 04:54:23,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=134841.66666666666, ans=0.025 2024-06-20 04:54:37,051 INFO [train.py:1028] (1/2) Epoch 8, batch 2750, loss[loss=0.2568, simple_loss=0.2864, pruned_loss=0.1136, over 13249.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.2959, pruned_loss=0.1133, over 2581835.26 frames. ], batch size: 43, lr: 7.45e-03, grad_scale: 64.0 2024-06-20 04:54:40,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=134878.33333333334, ans=0.95 2024-06-20 04:54:42,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=134878.33333333334, ans=22.5 2024-06-20 04:54:55,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=134915.0, ans=0.125 2024-06-20 04:54:58,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=134933.33333333334, ans=0.125 2024-06-20 04:54:58,717 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.772e+02 1.968e+02 2.085e+02 2.881e+02, threshold=3.936e+02, percent-clipped=0.0 2024-06-20 04:54:58,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=134933.33333333334, ans=0.2 2024-06-20 04:55:09,797 INFO [train.py:1028] (1/2) Epoch 8, batch 2800, loss[loss=0.2897, simple_loss=0.2982, pruned_loss=0.1406, over 10773.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.2946, pruned_loss=0.1126, over 2579362.30 frames. ], batch size: 303, lr: 7.45e-03, grad_scale: 64.0 2024-06-20 04:55:20,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=134988.33333333334, ans=0.125 2024-06-20 04:55:29,900 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.255e+01 2024-06-20 04:55:31,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=135025.0, ans=0.0 2024-06-20 04:55:34,693 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=12.0 2024-06-20 04:55:41,605 INFO [train.py:1028] (1/2) Epoch 8, batch 2850, loss[loss=0.2664, simple_loss=0.3018, pruned_loss=0.1156, over 13251.00 frames. ], tot_loss[loss=0.26, simple_loss=0.2941, pruned_loss=0.113, over 2576651.00 frames. ], batch size: 49, lr: 7.45e-03, grad_scale: 64.0 2024-06-20 04:55:41,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=135061.66666666666, ans=0.125 2024-06-20 04:55:53,751 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=12.0 2024-06-20 04:55:54,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=135080.0, ans=0.07 2024-06-20 04:55:56,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=135098.33333333334, ans=0.2 2024-06-20 04:55:58,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=135098.33333333334, ans=0.125 2024-06-20 04:56:04,769 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.18 vs. limit=15.0 2024-06-20 04:56:05,533 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.884e+02 2.077e+02 2.389e+02 3.806e+02, threshold=4.155e+02, percent-clipped=0.0 2024-06-20 04:56:09,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=135135.0, ans=0.0 2024-06-20 04:56:19,161 INFO [train.py:1028] (1/2) Epoch 8, batch 2900, loss[loss=0.2561, simple_loss=0.299, pruned_loss=0.1066, over 13159.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.2927, pruned_loss=0.1124, over 2584758.97 frames. ], batch size: 55, lr: 7.45e-03, grad_scale: 64.0 2024-06-20 04:56:20,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=135153.33333333334, ans=0.125 2024-06-20 04:56:21,567 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.25 vs. limit=15.0 2024-06-20 04:56:39,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=135208.33333333334, ans=0.2 2024-06-20 04:56:48,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=135226.66666666666, ans=0.2 2024-06-20 04:56:51,973 INFO [train.py:1028] (1/2) Epoch 8, batch 2950, loss[loss=0.2498, simple_loss=0.2936, pruned_loss=0.103, over 13272.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.2924, pruned_loss=0.1125, over 2579138.73 frames. ], batch size: 43, lr: 7.44e-03, grad_scale: 64.0 2024-06-20 04:56:58,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=135263.33333333334, ans=0.0 2024-06-20 04:57:00,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=135263.33333333334, ans=0.0 2024-06-20 04:57:08,262 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.65 vs. limit=10.0 2024-06-20 04:57:12,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=135300.0, ans=0.125 2024-06-20 04:57:13,758 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.806e+02 1.932e+02 2.126e+02 3.138e+02, threshold=3.865e+02, percent-clipped=0.0 2024-06-20 04:57:16,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=135300.0, ans=0.125 2024-06-20 04:57:16,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=135300.0, ans=0.125 2024-06-20 04:57:18,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=135318.33333333334, ans=0.0 2024-06-20 04:57:20,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=135318.33333333334, ans=0.125 2024-06-20 04:57:24,563 INFO [train.py:1028] (1/2) Epoch 8, batch 3000, loss[loss=0.246, simple_loss=0.2882, pruned_loss=0.1019, over 13198.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.2906, pruned_loss=0.1116, over 2577455.69 frames. ], batch size: 59, lr: 7.44e-03, grad_scale: 64.0 2024-06-20 04:57:24,564 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 04:57:30,234 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.2801, 3.8153, 4.2365, 3.8866], device='cuda:1') 2024-06-20 04:57:32,457 INFO [train.py:1060] (1/2) Epoch 8, validation: loss=0.208, simple_loss=0.2702, pruned_loss=0.07292, over 351949.00 frames. 2024-06-20 04:57:32,457 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 04:57:37,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=135336.66666666666, ans=0.0 2024-06-20 04:58:08,945 INFO [train.py:1028] (1/2) Epoch 8, batch 3050, loss[loss=0.251, simple_loss=0.2822, pruned_loss=0.1099, over 13249.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.2896, pruned_loss=0.1114, over 2578207.19 frames. ], batch size: 46, lr: 7.44e-03, grad_scale: 64.0 2024-06-20 04:58:09,401 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.91 vs. limit=15.0 2024-06-20 04:58:20,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=135446.66666666666, ans=0.0 2024-06-20 04:58:21,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135446.66666666666, ans=0.1 2024-06-20 04:58:33,500 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=18.98 vs. limit=15.0 2024-06-20 04:58:33,720 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 1.797e+02 1.907e+02 2.114e+02 2.900e+02, threshold=3.814e+02, percent-clipped=0.0 2024-06-20 04:58:35,450 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.22 vs. limit=6.0 2024-06-20 04:58:36,215 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.23 vs. limit=22.5 2024-06-20 04:58:39,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=135501.66666666666, ans=0.125 2024-06-20 04:58:44,688 INFO [train.py:1028] (1/2) Epoch 8, batch 3100, loss[loss=0.2433, simple_loss=0.269, pruned_loss=0.1088, over 13038.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.2881, pruned_loss=0.1106, over 2578357.65 frames. ], batch size: 144, lr: 7.44e-03, grad_scale: 64.0 2024-06-20 04:58:51,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=135538.33333333334, ans=0.125 2024-06-20 04:58:52,182 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.93 vs. limit=15.0 2024-06-20 04:58:53,399 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.61 vs. limit=15.0 2024-06-20 04:58:54,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=135538.33333333334, ans=6.0 2024-06-20 04:59:09,717 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.14 vs. limit=10.0 2024-06-20 04:59:10,280 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.63 vs. limit=15.0 2024-06-20 04:59:12,030 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:59:17,082 INFO [train.py:1028] (1/2) Epoch 8, batch 3150, loss[loss=0.2431, simple_loss=0.274, pruned_loss=0.1061, over 12935.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.2873, pruned_loss=0.11, over 2581174.54 frames. ], batch size: 158, lr: 7.43e-03, grad_scale: 64.0 2024-06-20 04:59:38,538 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.909e+02 2.068e+02 2.341e+02 3.439e+02, threshold=4.137e+02, percent-clipped=0.0 2024-06-20 04:59:39,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=135666.66666666666, ans=0.0 2024-06-20 04:59:40,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=135666.66666666666, ans=0.125 2024-06-20 04:59:50,798 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=12.81 vs. limit=15.0 2024-06-20 04:59:52,837 INFO [train.py:1028] (1/2) Epoch 8, batch 3200, loss[loss=0.2346, simple_loss=0.2734, pruned_loss=0.09785, over 13164.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.2874, pruned_loss=0.1099, over 2582344.96 frames. ], batch size: 55, lr: 7.43e-03, grad_scale: 64.0 2024-06-20 04:59:54,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=135703.33333333334, ans=0.2 2024-06-20 04:59:56,287 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2024-06-20 05:00:01,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=135721.66666666666, ans=0.125 2024-06-20 05:00:08,690 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.09 vs. limit=12.0 2024-06-20 05:00:13,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=135758.33333333334, ans=0.0 2024-06-20 05:00:23,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=135776.66666666666, ans=6.0 2024-06-20 05:00:24,722 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=135776.66666666666, ans=0.1 2024-06-20 05:00:29,061 INFO [train.py:1028] (1/2) Epoch 8, batch 3250, loss[loss=0.24, simple_loss=0.2774, pruned_loss=0.1013, over 13145.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.2866, pruned_loss=0.1097, over 2585392.02 frames. ], batch size: 72, lr: 7.43e-03, grad_scale: 64.0 2024-06-20 05:00:30,914 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.22 vs. limit=15.0 2024-06-20 05:00:34,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=135795.0, ans=0.09899494936611666 2024-06-20 05:00:39,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=135813.33333333334, ans=0.125 2024-06-20 05:00:44,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=135831.66666666666, ans=0.0 2024-06-20 05:00:44,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=135831.66666666666, ans=22.5 2024-06-20 05:00:48,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=135831.66666666666, ans=0.04949747468305833 2024-06-20 05:00:49,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=135831.66666666666, ans=0.0 2024-06-20 05:00:52,680 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.820e+02 2.058e+02 2.250e+02 4.292e+02, threshold=4.116e+02, percent-clipped=1.0 2024-06-20 05:00:54,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=135850.0, ans=0.0 2024-06-20 05:00:54,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=135850.0, ans=0.125 2024-06-20 05:01:04,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=135886.66666666666, ans=0.125 2024-06-20 05:01:04,576 INFO [train.py:1028] (1/2) Epoch 8, batch 3300, loss[loss=0.2578, simple_loss=0.2877, pruned_loss=0.1139, over 12765.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.2863, pruned_loss=0.1092, over 2581035.42 frames. ], batch size: 176, lr: 7.43e-03, grad_scale: 64.0 2024-06-20 05:01:07,916 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 05:01:17,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135923.33333333334, ans=0.1 2024-06-20 05:01:29,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=135941.66666666666, ans=0.035 2024-06-20 05:01:40,417 INFO [train.py:1028] (1/2) Epoch 8, batch 3350, loss[loss=0.2531, simple_loss=0.2827, pruned_loss=0.1118, over 12887.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.286, pruned_loss=0.1094, over 2575691.71 frames. ], batch size: 158, lr: 7.42e-03, grad_scale: 64.0 2024-06-20 05:01:41,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=135978.33333333334, ans=0.0 2024-06-20 05:01:42,785 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2024-06-20 05:01:49,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=135996.66666666666, ans=0.125 2024-06-20 05:01:55,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=136015.0, ans=0.125 2024-06-20 05:01:56,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=136015.0, ans=0.025 2024-06-20 05:02:02,193 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.848e+02 1.987e+02 2.217e+02 3.289e+02, threshold=3.973e+02, percent-clipped=0.0 2024-06-20 05:02:15,861 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=22.5 2024-06-20 05:02:16,193 INFO [train.py:1028] (1/2) Epoch 8, batch 3400, loss[loss=0.2417, simple_loss=0.2767, pruned_loss=0.1033, over 12649.00 frames. ], tot_loss[loss=0.253, simple_loss=0.2862, pruned_loss=0.1099, over 2575035.42 frames. ], batch size: 22, lr: 7.42e-03, grad_scale: 64.0 2024-06-20 05:02:16,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=136070.0, ans=0.125 2024-06-20 05:02:16,561 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.62 vs. limit=22.5 2024-06-20 05:02:23,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=136088.33333333334, ans=0.125 2024-06-20 05:02:24,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=136088.33333333334, ans=0.125 2024-06-20 05:02:28,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=136088.33333333334, ans=0.125 2024-06-20 05:02:30,629 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.83 vs. limit=15.0 2024-06-20 05:02:36,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=136125.0, ans=0.025 2024-06-20 05:02:41,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=136143.33333333334, ans=0.05 2024-06-20 05:02:47,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=136143.33333333334, ans=0.125 2024-06-20 05:02:49,126 INFO [train.py:1028] (1/2) Epoch 8, batch 3450, loss[loss=0.2738, simple_loss=0.2968, pruned_loss=0.1254, over 12761.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.2848, pruned_loss=0.109, over 2576294.89 frames. ], batch size: 176, lr: 7.42e-03, grad_scale: 64.0 2024-06-20 05:02:51,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=136161.66666666666, ans=0.125 2024-06-20 05:02:53,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=136161.66666666666, ans=0.125 2024-06-20 05:03:03,118 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.79 vs. limit=10.0 2024-06-20 05:03:10,750 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.781e+02 1.932e+02 2.153e+02 2.844e+02, threshold=3.864e+02, percent-clipped=0.0 2024-06-20 05:03:17,518 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.46 vs. limit=15.0 2024-06-20 05:03:22,690 INFO [train.py:1028] (1/2) Epoch 8, batch 3500, loss[loss=0.2691, simple_loss=0.2955, pruned_loss=0.1214, over 12952.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.2841, pruned_loss=0.1084, over 2575953.15 frames. ], batch size: 33, lr: 7.42e-03, grad_scale: 128.0 2024-06-20 05:03:29,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.92 vs. limit=15.0 2024-06-20 05:03:42,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=136290.0, ans=0.02 2024-06-20 05:03:56,183 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.41 vs. limit=10.0 2024-06-20 05:03:58,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=136326.66666666666, ans=0.125 2024-06-20 05:03:59,373 INFO [train.py:1028] (1/2) Epoch 8, batch 3550, loss[loss=0.2446, simple_loss=0.2752, pruned_loss=0.107, over 13148.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.2836, pruned_loss=0.1081, over 2576801.09 frames. ], batch size: 95, lr: 7.41e-03, grad_scale: 128.0 2024-06-20 05:04:08,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.01 vs. limit=15.0 2024-06-20 05:04:16,611 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.23 vs. limit=12.0 2024-06-20 05:04:24,123 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.786e+02 1.910e+02 2.334e+02 3.313e+02, threshold=3.819e+02, percent-clipped=0.0 2024-06-20 05:04:35,465 INFO [train.py:1028] (1/2) Epoch 8, batch 3600, loss[loss=0.2263, simple_loss=0.2648, pruned_loss=0.0939, over 13300.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.2833, pruned_loss=0.1082, over 2580271.56 frames. ], batch size: 49, lr: 7.41e-03, grad_scale: 128.0 2024-06-20 05:04:45,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=136455.0, ans=0.125 2024-06-20 05:04:50,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=136473.33333333334, ans=0.5 2024-06-20 05:04:51,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=136473.33333333334, ans=0.1 2024-06-20 05:04:51,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=136473.33333333334, ans=0.1 2024-06-20 05:04:57,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=136491.66666666666, ans=0.2 2024-06-20 05:05:05,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=136510.0, ans=0.125 2024-06-20 05:05:08,168 INFO [train.py:1028] (1/2) Epoch 8, batch 3650, loss[loss=0.2553, simple_loss=0.2819, pruned_loss=0.1143, over 13037.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.2832, pruned_loss=0.1079, over 2578258.46 frames. ], batch size: 102, lr: 7.41e-03, grad_scale: 128.0 2024-06-20 05:05:08,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=136528.33333333334, ans=0.125 2024-06-20 05:05:08,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=136528.33333333334, ans=0.1 2024-06-20 05:05:08,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=136528.33333333334, ans=0.125 2024-06-20 05:05:10,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=136528.33333333334, ans=0.0 2024-06-20 05:05:24,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136565.0, ans=0.1 2024-06-20 05:05:30,574 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.720e+02 1.874e+02 2.055e+02 2.444e+02, threshold=3.749e+02, percent-clipped=0.0 2024-06-20 05:05:31,035 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.13 vs. limit=15.0 2024-06-20 05:05:32,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=136583.33333333334, ans=0.125 2024-06-20 05:05:41,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=136601.66666666666, ans=0.125 2024-06-20 05:05:42,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=136601.66666666666, ans=0.125 2024-06-20 05:05:42,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136601.66666666666, ans=0.1 2024-06-20 05:05:42,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=136601.66666666666, ans=0.0 2024-06-20 05:05:45,022 INFO [train.py:1028] (1/2) Epoch 8, batch 3700, loss[loss=0.2269, simple_loss=0.267, pruned_loss=0.09338, over 13223.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.2814, pruned_loss=0.1068, over 2583632.66 frames. ], batch size: 72, lr: 7.41e-03, grad_scale: 64.0 2024-06-20 05:05:56,778 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=136638.33333333334, ans=0.125 2024-06-20 05:06:04,526 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.87 vs. limit=22.5 2024-06-20 05:06:07,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=136675.0, ans=0.125 2024-06-20 05:06:11,637 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=7.821e+01 2024-06-20 05:06:19,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=136693.33333333334, ans=0.0 2024-06-20 05:06:21,500 INFO [train.py:1028] (1/2) Epoch 8, batch 3750, loss[loss=0.2242, simple_loss=0.2638, pruned_loss=0.09229, over 12618.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.2807, pruned_loss=0.1065, over 2586041.41 frames. ], batch size: 22, lr: 7.40e-03, grad_scale: 64.0 2024-06-20 05:06:26,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=136711.66666666666, ans=0.125 2024-06-20 05:06:27,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=136730.0, ans=0.09899494936611666 2024-06-20 05:06:28,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=136730.0, ans=0.125 2024-06-20 05:06:43,735 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.818e+02 1.938e+02 2.146e+02 2.930e+02, threshold=3.875e+02, percent-clipped=0.0 2024-06-20 05:06:54,359 INFO [train.py:1028] (1/2) Epoch 8, batch 3800, loss[loss=0.2654, simple_loss=0.2943, pruned_loss=0.1182, over 13284.00 frames. ], tot_loss[loss=0.247, simple_loss=0.2808, pruned_loss=0.1066, over 2585059.75 frames. ], batch size: 83, lr: 7.40e-03, grad_scale: 64.0 2024-06-20 05:06:56,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=136803.33333333334, ans=0.2 2024-06-20 05:07:03,987 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.28 vs. limit=15.0 2024-06-20 05:07:16,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=136858.33333333334, ans=0.0 2024-06-20 05:07:26,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=136895.0, ans=0.1 2024-06-20 05:07:26,628 INFO [train.py:1028] (1/2) Epoch 8, batch 3850, loss[loss=0.2573, simple_loss=0.2815, pruned_loss=0.1166, over 13002.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.2809, pruned_loss=0.1066, over 2584238.11 frames. ], batch size: 144, lr: 7.40e-03, grad_scale: 64.0 2024-06-20 05:07:50,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=136950.0, ans=0.125 2024-06-20 05:07:51,550 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.737e+02 1.900e+02 2.204e+02 3.164e+02, threshold=3.801e+02, percent-clipped=0.0 2024-06-20 05:07:55,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=136968.33333333334, ans=0.0 2024-06-20 05:07:57,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=136968.33333333334, ans=0.125 2024-06-20 05:08:01,771 INFO [train.py:1028] (1/2) Epoch 8, batch 3900, loss[loss=0.2233, simple_loss=0.2609, pruned_loss=0.09285, over 13166.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.2805, pruned_loss=0.1063, over 2586687.85 frames. ], batch size: 83, lr: 7.40e-03, grad_scale: 64.0 2024-06-20 05:08:05,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=136986.66666666666, ans=0.0 2024-06-20 05:08:18,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=137023.33333333334, ans=0.025 2024-06-20 05:08:23,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=137041.66666666666, ans=0.125 2024-06-20 05:08:30,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=137041.66666666666, ans=0.125 2024-06-20 05:08:37,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=137078.33333333334, ans=15.0 2024-06-20 05:08:37,731 INFO [train.py:1028] (1/2) Epoch 8, batch 3950, loss[loss=0.2672, simple_loss=0.2945, pruned_loss=0.1199, over 13130.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.2795, pruned_loss=0.1056, over 2587425.57 frames. ], batch size: 132, lr: 7.39e-03, grad_scale: 64.0 2024-06-20 05:08:42,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=137078.33333333334, ans=15.0 2024-06-20 05:08:44,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=137096.66666666666, ans=0.0 2024-06-20 05:08:45,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=137096.66666666666, ans=0.025 2024-06-20 05:08:46,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=137096.66666666666, ans=0.0 2024-06-20 05:08:57,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=137133.33333333334, ans=0.125 2024-06-20 05:09:00,214 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.776e+02 1.920e+02 2.146e+02 2.812e+02, threshold=3.840e+02, percent-clipped=0.0 2024-06-20 05:09:06,927 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.06 vs. limit=22.5 2024-06-20 05:09:07,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=137151.66666666666, ans=0.125 2024-06-20 05:09:10,410 INFO [train.py:1028] (1/2) Epoch 8, batch 4000, loss[loss=0.2611, simple_loss=0.2955, pruned_loss=0.1134, over 12883.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.2792, pruned_loss=0.1057, over 2582336.49 frames. ], batch size: 39, lr: 7.39e-03, grad_scale: 64.0 2024-06-20 05:09:26,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=137206.66666666666, ans=0.125 2024-06-20 05:09:27,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=137206.66666666666, ans=0.0 2024-06-20 05:09:36,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=137243.33333333334, ans=0.125 2024-06-20 05:09:36,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=137243.33333333334, ans=0.07 2024-06-20 05:09:41,797 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.05 vs. limit=15.0 2024-06-20 05:09:47,229 INFO [train.py:1028] (1/2) Epoch 8, batch 4050, loss[loss=0.2533, simple_loss=0.2751, pruned_loss=0.1157, over 11038.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.2792, pruned_loss=0.1057, over 2580973.03 frames. ], batch size: 304, lr: 7.39e-03, grad_scale: 64.0 2024-06-20 05:09:48,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=137261.66666666666, ans=0.0 2024-06-20 05:09:53,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=137280.0, ans=0.1 2024-06-20 05:10:07,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=137316.66666666666, ans=0.125 2024-06-20 05:10:09,392 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.844e+02 2.040e+02 2.255e+02 2.893e+02, threshold=4.079e+02, percent-clipped=0.0 2024-06-20 05:10:20,076 INFO [train.py:1028] (1/2) Epoch 8, batch 4100, loss[loss=0.2507, simple_loss=0.2776, pruned_loss=0.1119, over 12997.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.2792, pruned_loss=0.1061, over 2577392.70 frames. ], batch size: 102, lr: 7.39e-03, grad_scale: 64.0 2024-06-20 05:10:20,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=137353.33333333334, ans=0.125 2024-06-20 05:10:26,328 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.27 vs. limit=22.5 2024-06-20 05:10:29,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=137371.66666666666, ans=0.0 2024-06-20 05:10:42,793 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.72 vs. limit=6.0 2024-06-20 05:10:47,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=137408.33333333334, ans=0.2 2024-06-20 05:10:56,819 INFO [train.py:1028] (1/2) Epoch 8, batch 4150, loss[loss=0.2268, simple_loss=0.2701, pruned_loss=0.09176, over 13129.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.2791, pruned_loss=0.1061, over 2576316.92 frames. ], batch size: 55, lr: 7.38e-03, grad_scale: 64.0 2024-06-20 05:11:02,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=137445.0, ans=0.0 2024-06-20 05:11:13,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137481.66666666666, ans=0.1 2024-06-20 05:11:18,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=137500.0, ans=0.0 2024-06-20 05:11:19,359 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.819e+02 1.960e+02 2.131e+02 3.025e+02, threshold=3.921e+02, percent-clipped=0.0 2024-06-20 05:11:28,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=137518.33333333334, ans=0.0 2024-06-20 05:11:29,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137536.66666666666, ans=0.1 2024-06-20 05:11:29,918 INFO [train.py:1028] (1/2) Epoch 8, batch 4200, loss[loss=0.2443, simple_loss=0.2779, pruned_loss=0.1054, over 13038.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.2783, pruned_loss=0.1057, over 2579258.66 frames. ], batch size: 102, lr: 7.38e-03, grad_scale: 64.0 2024-06-20 05:11:34,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=137536.66666666666, ans=0.0 2024-06-20 05:11:38,942 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=4.768e+01 2024-06-20 05:11:55,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137591.66666666666, ans=0.1 2024-06-20 05:11:58,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=137591.66666666666, ans=0.0 2024-06-20 05:12:06,714 INFO [train.py:1028] (1/2) Epoch 8, batch 4250, loss[loss=0.2469, simple_loss=0.2812, pruned_loss=0.1063, over 13282.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.2778, pruned_loss=0.1053, over 2581389.64 frames. ], batch size: 46, lr: 7.38e-03, grad_scale: 64.0 2024-06-20 05:12:28,895 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.736e+02 1.867e+02 2.078e+02 2.917e+02, threshold=3.733e+02, percent-clipped=0.0 2024-06-20 05:12:29,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=137683.33333333334, ans=0.0 2024-06-20 05:12:33,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=137683.33333333334, ans=0.125 2024-06-20 05:12:42,921 INFO [train.py:1028] (1/2) Epoch 8, batch 4300, loss[loss=0.2312, simple_loss=0.2677, pruned_loss=0.09733, over 13179.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.2775, pruned_loss=0.1049, over 2581192.47 frames. ], batch size: 59, lr: 7.38e-03, grad_scale: 64.0 2024-06-20 05:12:49,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=137738.33333333334, ans=0.2 2024-06-20 05:12:50,557 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.30 vs. limit=15.0 2024-06-20 05:12:54,538 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.57 vs. limit=15.0 2024-06-20 05:13:09,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=137793.33333333334, ans=0.2 2024-06-20 05:13:10,036 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.26 vs. limit=22.5 2024-06-20 05:13:13,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=137793.33333333334, ans=0.125 2024-06-20 05:13:15,292 INFO [train.py:1028] (1/2) Epoch 8, batch 4350, loss[loss=0.2556, simple_loss=0.2917, pruned_loss=0.1097, over 13204.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.2768, pruned_loss=0.1047, over 2585534.41 frames. ], batch size: 59, lr: 7.37e-03, grad_scale: 64.0 2024-06-20 05:13:23,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=137830.0, ans=0.0 2024-06-20 05:13:27,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=137848.33333333334, ans=0.0 2024-06-20 05:13:28,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=137848.33333333334, ans=0.0 2024-06-20 05:13:33,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137848.33333333334, ans=0.1 2024-06-20 05:13:34,589 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.38 vs. limit=10.0 2024-06-20 05:13:37,319 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.788e+02 1.998e+02 2.264e+02 3.018e+02, threshold=3.995e+02, percent-clipped=0.0 2024-06-20 05:13:44,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=137885.0, ans=0.09899494936611666 2024-06-20 05:13:45,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=137885.0, ans=0.0 2024-06-20 05:13:50,878 INFO [train.py:1028] (1/2) Epoch 8, batch 4400, loss[loss=0.2249, simple_loss=0.2607, pruned_loss=0.09459, over 13213.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.276, pruned_loss=0.1042, over 2585879.22 frames. ], batch size: 83, lr: 7.37e-03, grad_scale: 64.0 2024-06-20 05:13:52,410 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.11 vs. limit=15.0 2024-06-20 05:14:01,984 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.76 vs. limit=6.0 2024-06-20 05:14:03,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=137921.66666666666, ans=0.125 2024-06-20 05:14:11,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=137958.33333333334, ans=0.05 2024-06-20 05:14:11,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=137958.33333333334, ans=0.2 2024-06-20 05:14:20,361 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.33 vs. limit=22.5 2024-06-20 05:14:24,581 INFO [train.py:1028] (1/2) Epoch 8, batch 4450, loss[loss=0.2581, simple_loss=0.2952, pruned_loss=0.1105, over 12928.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.2764, pruned_loss=0.1049, over 2579742.48 frames. ], batch size: 33, lr: 7.37e-03, grad_scale: 64.0 2024-06-20 05:14:42,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=138031.66666666666, ans=0.0 2024-06-20 05:14:45,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=138031.66666666666, ans=0.125 2024-06-20 05:14:48,508 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.78 vs. limit=12.0 2024-06-20 05:14:49,461 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.796e+02 2.028e+02 2.357e+02 4.135e+02, threshold=4.056e+02, percent-clipped=2.0 2024-06-20 05:14:57,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=138068.33333333334, ans=0.0 2024-06-20 05:14:59,658 INFO [train.py:1028] (1/2) Epoch 8, batch 4500, loss[loss=0.2351, simple_loss=0.2689, pruned_loss=0.1006, over 13276.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.2754, pruned_loss=0.1044, over 2584646.63 frames. ], batch size: 89, lr: 7.37e-03, grad_scale: 64.0 2024-06-20 05:15:10,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=138105.0, ans=0.0 2024-06-20 05:15:12,285 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2024-06-20 05:15:12,437 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.03 vs. limit=15.0 2024-06-20 05:15:14,251 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.32 vs. limit=22.5 2024-06-20 05:15:15,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=138123.33333333334, ans=0.0 2024-06-20 05:15:20,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=138141.66666666666, ans=0.0 2024-06-20 05:15:32,270 INFO [train.py:1028] (1/2) Epoch 8, batch 4550, loss[loss=0.2317, simple_loss=0.2712, pruned_loss=0.09614, over 13291.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.2763, pruned_loss=0.1046, over 2589171.88 frames. ], batch size: 52, lr: 7.37e-03, grad_scale: 64.0 2024-06-20 05:15:39,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=138196.66666666666, ans=0.125 2024-06-20 05:15:46,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=138215.0, ans=0.125 2024-06-20 05:15:58,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=138233.33333333334, ans=0.125 2024-06-20 05:15:59,808 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.708e+02 1.866e+02 2.088e+02 3.348e+02, threshold=3.731e+02, percent-clipped=0.0 2024-06-20 05:16:03,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=138251.66666666666, ans=0.2 2024-06-20 05:16:06,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=138251.66666666666, ans=0.2 2024-06-20 05:16:07,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=138251.66666666666, ans=0.0 2024-06-20 05:16:10,327 INFO [train.py:1028] (1/2) Epoch 8, batch 4600, loss[loss=0.288, simple_loss=0.3079, pruned_loss=0.134, over 12480.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.2769, pruned_loss=0.1048, over 2585805.25 frames. ], batch size: 202, lr: 7.36e-03, grad_scale: 64.0 2024-06-20 05:16:13,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=138270.0, ans=0.1 2024-06-20 05:16:23,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=138306.66666666666, ans=0.04949747468305833 2024-06-20 05:16:23,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=138306.66666666666, ans=0.125 2024-06-20 05:16:34,422 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.18 vs. limit=15.0 2024-06-20 05:16:39,263 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.07 vs. limit=15.0 2024-06-20 05:16:45,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=138343.33333333334, ans=0.125 2024-06-20 05:16:47,225 INFO [train.py:1028] (1/2) Epoch 8, batch 4650, loss[loss=0.2413, simple_loss=0.2678, pruned_loss=0.1074, over 13077.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.276, pruned_loss=0.1043, over 2588346.15 frames. ], batch size: 132, lr: 7.36e-03, grad_scale: 64.0 2024-06-20 05:16:49,007 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2024-06-20 05:16:53,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=138380.0, ans=0.125 2024-06-20 05:16:58,198 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.95 vs. limit=15.0 2024-06-20 05:17:08,418 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.00 vs. limit=6.0 2024-06-20 05:17:09,876 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.740e+02 1.898e+02 2.187e+02 3.144e+02, threshold=3.796e+02, percent-clipped=0.0 2024-06-20 05:17:19,063 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.83 vs. limit=22.5 2024-06-20 05:17:20,914 INFO [train.py:1028] (1/2) Epoch 8, batch 4700, loss[loss=0.2751, simple_loss=0.311, pruned_loss=0.1196, over 12361.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.2757, pruned_loss=0.1042, over 2583503.19 frames. ], batch size: 25, lr: 7.36e-03, grad_scale: 64.0 2024-06-20 05:17:21,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=138453.33333333334, ans=0.5 2024-06-20 05:17:43,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=138508.33333333334, ans=0.125 2024-06-20 05:17:58,737 INFO [train.py:1028] (1/2) Epoch 8, batch 4750, loss[loss=0.2574, simple_loss=0.2811, pruned_loss=0.1169, over 12514.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.275, pruned_loss=0.1039, over 2580630.80 frames. ], batch size: 202, lr: 7.36e-03, grad_scale: 64.0 2024-06-20 05:18:07,472 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.49 vs. limit=22.5 2024-06-20 05:18:09,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=138563.33333333334, ans=0.125 2024-06-20 05:18:21,612 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.740e+02 1.921e+02 2.083e+02 3.210e+02, threshold=3.841e+02, percent-clipped=0.0 2024-06-20 05:18:30,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=138618.33333333334, ans=0.2 2024-06-20 05:18:35,903 INFO [train.py:1028] (1/2) Epoch 8, batch 4800, loss[loss=0.224, simple_loss=0.2597, pruned_loss=0.09416, over 13311.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.2745, pruned_loss=0.1037, over 2576613.77 frames. ], batch size: 63, lr: 7.35e-03, grad_scale: 64.0 2024-06-20 05:18:37,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=138636.66666666666, ans=0.0 2024-06-20 05:18:48,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=138655.0, ans=0.125 2024-06-20 05:18:51,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=138673.33333333334, ans=0.025 2024-06-20 05:18:51,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=138673.33333333334, ans=0.1 2024-06-20 05:18:59,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=138691.66666666666, ans=0.0 2024-06-20 05:19:01,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=138710.0, ans=0.125 2024-06-20 05:19:02,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=138710.0, ans=0.5 2024-06-20 05:19:03,614 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.32 vs. limit=15.0 2024-06-20 05:19:08,482 INFO [train.py:1028] (1/2) Epoch 8, batch 4850, loss[loss=0.2412, simple_loss=0.2722, pruned_loss=0.1051, over 13216.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.2736, pruned_loss=0.1031, over 2573655.67 frames. ], batch size: 89, lr: 7.35e-03, grad_scale: 64.0 2024-06-20 05:19:21,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=138765.0, ans=0.125 2024-06-20 05:19:31,438 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.849e+02 2.019e+02 2.279e+02 3.031e+02, threshold=4.038e+02, percent-clipped=0.0 2024-06-20 05:19:36,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=138801.66666666666, ans=0.125 2024-06-20 05:19:42,204 INFO [train.py:1028] (1/2) Epoch 8, batch 4900, loss[loss=0.2378, simple_loss=0.2719, pruned_loss=0.1018, over 13187.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.273, pruned_loss=0.1027, over 2573745.86 frames. ], batch size: 59, lr: 7.35e-03, grad_scale: 64.0 2024-06-20 05:19:43,879 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=22.5 2024-06-20 05:19:50,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=138820.0, ans=0.0 2024-06-20 05:19:58,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=138856.66666666666, ans=0.0 2024-06-20 05:19:58,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=138856.66666666666, ans=0.125 2024-06-20 05:20:01,715 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2024-06-20 05:20:02,561 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.07 vs. limit=15.0 2024-06-20 05:20:09,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=138875.0, ans=0.0 2024-06-20 05:20:12,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=138893.33333333334, ans=0.125 2024-06-20 05:20:16,280 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.96 vs. limit=22.5 2024-06-20 05:20:18,490 INFO [train.py:1028] (1/2) Epoch 8, batch 4950, loss[loss=0.2481, simple_loss=0.2719, pruned_loss=0.1122, over 10987.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.2738, pruned_loss=0.1034, over 2567574.31 frames. ], batch size: 304, lr: 7.35e-03, grad_scale: 64.0 2024-06-20 05:20:34,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=138948.33333333334, ans=0.0 2024-06-20 05:20:35,862 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.04 vs. limit=15.0 2024-06-20 05:20:40,525 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.46 vs. limit=15.0 2024-06-20 05:20:44,315 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.741e+02 1.903e+02 2.103e+02 2.533e+02, threshold=3.806e+02, percent-clipped=0.0 2024-06-20 05:20:48,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=138985.0, ans=0.025 2024-06-20 05:20:49,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=138985.0, ans=0.2 2024-06-20 05:20:54,747 INFO [train.py:1028] (1/2) Epoch 8, batch 5000, loss[loss=0.2308, simple_loss=0.2649, pruned_loss=0.09832, over 13165.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.2735, pruned_loss=0.1031, over 2572074.22 frames. ], batch size: 95, lr: 7.34e-03, grad_scale: 64.0 2024-06-20 05:21:09,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=139040.0, ans=0.125 2024-06-20 05:21:10,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=139040.0, ans=0.0 2024-06-20 05:21:11,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=139040.0, ans=0.0 2024-06-20 05:21:13,377 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 05:21:16,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=139058.33333333334, ans=0.0 2024-06-20 05:21:17,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=139058.33333333334, ans=0.5 2024-06-20 05:21:21,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=139076.66666666666, ans=0.125 2024-06-20 05:21:24,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=139076.66666666666, ans=0.125 2024-06-20 05:21:28,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=139095.0, ans=0.125 2024-06-20 05:21:28,524 INFO [train.py:1028] (1/2) Epoch 8, batch 5050, loss[loss=0.2175, simple_loss=0.2597, pruned_loss=0.08767, over 12906.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.2735, pruned_loss=0.1029, over 2570835.71 frames. ], batch size: 36, lr: 7.34e-03, grad_scale: 64.0 2024-06-20 05:21:28,956 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.94 vs. limit=6.0 2024-06-20 05:21:31,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139095.0, ans=0.1 2024-06-20 05:21:42,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=139113.33333333334, ans=0.125 2024-06-20 05:21:45,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139131.66666666666, ans=0.1 2024-06-20 05:21:51,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=139150.0, ans=0.125 2024-06-20 05:21:54,753 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.734e+02 1.926e+02 2.133e+02 3.181e+02, threshold=3.853e+02, percent-clipped=0.0 2024-06-20 05:22:05,347 INFO [train.py:1028] (1/2) Epoch 8, batch 5100, loss[loss=0.2289, simple_loss=0.2726, pruned_loss=0.09263, over 13192.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.2737, pruned_loss=0.1035, over 2567122.69 frames. ], batch size: 40, lr: 7.34e-03, grad_scale: 64.0 2024-06-20 05:22:16,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=139205.0, ans=0.125 2024-06-20 05:22:29,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=139241.66666666666, ans=0.025 2024-06-20 05:22:41,119 INFO [train.py:1028] (1/2) Epoch 8, batch 5150, loss[loss=0.2393, simple_loss=0.2704, pruned_loss=0.1042, over 13132.00 frames. ], tot_loss[loss=0.24, simple_loss=0.2734, pruned_loss=0.1034, over 2569752.09 frames. ], batch size: 132, lr: 7.34e-03, grad_scale: 64.0 2024-06-20 05:22:53,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=139315.0, ans=0.0 2024-06-20 05:22:56,613 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.82 vs. limit=22.5 2024-06-20 05:22:58,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=139315.0, ans=0.125 2024-06-20 05:23:00,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=139333.33333333334, ans=0.1 2024-06-20 05:23:00,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=139333.33333333334, ans=0.125 2024-06-20 05:23:00,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=139333.33333333334, ans=0.05 2024-06-20 05:23:07,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=139333.33333333334, ans=0.035 2024-06-20 05:23:08,287 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.791e+02 1.973e+02 2.192e+02 3.813e+02, threshold=3.947e+02, percent-clipped=0.0 2024-06-20 05:23:19,088 INFO [train.py:1028] (1/2) Epoch 8, batch 5200, loss[loss=0.2342, simple_loss=0.2667, pruned_loss=0.1009, over 13145.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.2738, pruned_loss=0.1036, over 2573796.44 frames. ], batch size: 95, lr: 7.33e-03, grad_scale: 64.0 2024-06-20 05:23:23,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.21 vs. limit=10.0 2024-06-20 05:23:26,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139388.33333333334, ans=0.1 2024-06-20 05:23:26,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=139388.33333333334, ans=0.025 2024-06-20 05:23:28,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=139388.33333333334, ans=0.125 2024-06-20 05:23:38,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=139425.0, ans=0.125 2024-06-20 05:23:54,768 INFO [train.py:1028] (1/2) Epoch 8, batch 5250, loss[loss=0.2299, simple_loss=0.2677, pruned_loss=0.09605, over 13240.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.2739, pruned_loss=0.1038, over 2568863.57 frames. ], batch size: 52, lr: 7.33e-03, grad_scale: 64.0 2024-06-20 05:23:58,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=139461.66666666666, ans=0.0 2024-06-20 05:24:02,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139480.0, ans=0.1 2024-06-20 05:24:07,574 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.64 vs. limit=6.0 2024-06-20 05:24:10,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.71 vs. limit=10.0 2024-06-20 05:24:16,944 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.725e+02 1.843e+02 1.998e+02 2.562e+02, threshold=3.687e+02, percent-clipped=0.0 2024-06-20 05:24:17,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=139516.66666666666, ans=0.125 2024-06-20 05:24:26,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139535.0, ans=0.1 2024-06-20 05:24:30,504 INFO [train.py:1028] (1/2) Epoch 8, batch 5300, loss[loss=0.2461, simple_loss=0.2735, pruned_loss=0.1094, over 13108.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.2737, pruned_loss=0.1036, over 2565319.13 frames. ], batch size: 144, lr: 7.33e-03, grad_scale: 64.0 2024-06-20 05:24:42,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=139590.0, ans=0.0 2024-06-20 05:24:42,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=139590.0, ans=0.125 2024-06-20 05:24:43,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139590.0, ans=0.1 2024-06-20 05:24:44,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=139590.0, ans=0.125 2024-06-20 05:24:54,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=139608.33333333334, ans=0.07 2024-06-20 05:24:55,224 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.01 vs. limit=10.0 2024-06-20 05:25:00,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=139626.66666666666, ans=0.025 2024-06-20 05:25:02,998 INFO [train.py:1028] (1/2) Epoch 8, batch 5350, loss[loss=0.2544, simple_loss=0.2969, pruned_loss=0.1059, over 11514.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.2735, pruned_loss=0.1036, over 2571946.84 frames. ], batch size: 16, lr: 7.33e-03, grad_scale: 64.0 2024-06-20 05:25:04,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=139645.0, ans=0.125 2024-06-20 05:25:07,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=139645.0, ans=0.07 2024-06-20 05:25:12,761 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.09 vs. limit=15.0 2024-06-20 05:25:19,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=139681.66666666666, ans=0.2 2024-06-20 05:25:22,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=139700.0, ans=0.2 2024-06-20 05:25:24,755 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.717e+02 1.878e+02 2.102e+02 2.786e+02, threshold=3.757e+02, percent-clipped=0.0 2024-06-20 05:25:28,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=139718.33333333334, ans=0.125 2024-06-20 05:25:38,477 INFO [train.py:1028] (1/2) Epoch 8, batch 5400, loss[loss=0.2603, simple_loss=0.2789, pruned_loss=0.1208, over 12291.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.2738, pruned_loss=0.104, over 2565492.25 frames. ], batch size: 241, lr: 7.32e-03, grad_scale: 64.0 2024-06-20 05:25:52,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=139773.33333333334, ans=0.125 2024-06-20 05:25:57,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=139773.33333333334, ans=0.125 2024-06-20 05:25:58,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=139791.66666666666, ans=0.2 2024-06-20 05:26:12,271 INFO [train.py:1028] (1/2) Epoch 8, batch 5450, loss[loss=0.2353, simple_loss=0.2731, pruned_loss=0.09878, over 12872.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.2732, pruned_loss=0.1033, over 2569389.64 frames. ], batch size: 26, lr: 7.32e-03, grad_scale: 64.0 2024-06-20 05:26:15,205 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.36 vs. limit=15.0 2024-06-20 05:26:28,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139865.0, ans=0.1 2024-06-20 05:26:30,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=139865.0, ans=0.125 2024-06-20 05:26:31,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=139865.0, ans=0.025 2024-06-20 05:26:31,927 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.43 vs. limit=22.5 2024-06-20 05:26:36,430 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=15.0 2024-06-20 05:26:37,349 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.762e+02 1.942e+02 2.178e+02 3.493e+02, threshold=3.884e+02, percent-clipped=0.0 2024-06-20 05:26:40,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=139901.66666666666, ans=0.07 2024-06-20 05:26:45,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=139901.66666666666, ans=0.2 2024-06-20 05:26:48,110 INFO [train.py:1028] (1/2) Epoch 8, batch 5500, loss[loss=0.2711, simple_loss=0.2878, pruned_loss=0.1272, over 12213.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.273, pruned_loss=0.1031, over 2563970.27 frames. ], batch size: 241, lr: 7.32e-03, grad_scale: 64.0 2024-06-20 05:26:55,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=139938.33333333334, ans=0.125 2024-06-20 05:26:56,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=139938.33333333334, ans=0.5 2024-06-20 05:26:57,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=139938.33333333334, ans=0.2 2024-06-20 05:27:04,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=139956.66666666666, ans=0.2 2024-06-20 05:27:06,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139956.66666666666, ans=0.1 2024-06-20 05:27:14,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139993.33333333334, ans=0.1 2024-06-20 05:27:21,073 INFO [train.py:1028] (1/2) Epoch 8, batch 5550, loss[loss=0.2356, simple_loss=0.2748, pruned_loss=0.09815, over 13269.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.2729, pruned_loss=0.1027, over 2567934.85 frames. ], batch size: 43, lr: 7.32e-03, grad_scale: 64.0 2024-06-20 05:27:22,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=140011.66666666666, ans=0.125 2024-06-20 05:27:23,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=140011.66666666666, ans=0.0 2024-06-20 05:27:25,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=140011.66666666666, ans=0.0 2024-06-20 05:27:29,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=140030.0, ans=0.0 2024-06-20 05:27:41,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=140048.33333333334, ans=0.2 2024-06-20 05:27:46,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=140066.66666666666, ans=0.2 2024-06-20 05:27:47,900 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.818e+02 2.048e+02 2.320e+02 3.177e+02, threshold=4.095e+02, percent-clipped=0.0 2024-06-20 05:27:55,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.34 vs. limit=22.5 2024-06-20 05:27:58,309 INFO [train.py:1028] (1/2) Epoch 8, batch 5600, loss[loss=0.2282, simple_loss=0.264, pruned_loss=0.09615, over 13212.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.2726, pruned_loss=0.1025, over 2569822.36 frames. ], batch size: 89, lr: 7.31e-03, grad_scale: 64.0 2024-06-20 05:28:06,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=140121.66666666666, ans=0.09899494936611666 2024-06-20 05:28:16,166 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.89 vs. limit=22.5 2024-06-20 05:28:24,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=140176.66666666666, ans=0.125 2024-06-20 05:28:30,608 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.93 vs. limit=15.0 2024-06-20 05:28:33,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=140195.0, ans=0.125 2024-06-20 05:28:34,385 INFO [train.py:1028] (1/2) Epoch 8, batch 5650, loss[loss=0.2731, simple_loss=0.2974, pruned_loss=0.1244, over 12563.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.2727, pruned_loss=0.1024, over 2574579.04 frames. ], batch size: 203, lr: 7.31e-03, grad_scale: 64.0 2024-06-20 05:28:35,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=140195.0, ans=0.0 2024-06-20 05:28:38,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=140195.0, ans=0.2 2024-06-20 05:28:43,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=140213.33333333334, ans=0.1 2024-06-20 05:28:51,663 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.67 vs. limit=10.0 2024-06-20 05:28:56,973 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.746e+02 1.903e+02 2.194e+02 3.581e+02, threshold=3.805e+02, percent-clipped=0.0 2024-06-20 05:28:59,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=140250.0, ans=0.0 2024-06-20 05:29:00,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=140268.33333333334, ans=0.09899494936611666 2024-06-20 05:29:05,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=140268.33333333334, ans=0.025 2024-06-20 05:29:07,397 INFO [train.py:1028] (1/2) Epoch 8, batch 5700, loss[loss=0.2504, simple_loss=0.2831, pruned_loss=0.1088, over 13261.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.2729, pruned_loss=0.1026, over 2577526.31 frames. ], batch size: 63, lr: 7.31e-03, grad_scale: 128.0 2024-06-20 05:29:16,467 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.07 vs. limit=15.0 2024-06-20 05:29:28,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=140341.66666666666, ans=0.125 2024-06-20 05:29:28,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=140341.66666666666, ans=0.125 2024-06-20 05:29:39,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=140360.0, ans=0.0 2024-06-20 05:29:43,974 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.08 vs. limit=10.0 2024-06-20 05:29:44,150 INFO [train.py:1028] (1/2) Epoch 8, batch 5750, loss[loss=0.2551, simple_loss=0.2856, pruned_loss=0.1123, over 12883.00 frames. ], tot_loss[loss=0.24, simple_loss=0.274, pruned_loss=0.103, over 2578097.86 frames. ], batch size: 177, lr: 7.31e-03, grad_scale: 128.0 2024-06-20 05:29:45,369 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.10 vs. limit=15.0 2024-06-20 05:29:46,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=140378.33333333334, ans=0.0 2024-06-20 05:29:55,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=140396.66666666666, ans=0.025 2024-06-20 05:29:56,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=140415.0, ans=0.125 2024-06-20 05:30:06,492 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.788e+02 1.951e+02 2.216e+02 3.537e+02, threshold=3.903e+02, percent-clipped=0.0 2024-06-20 05:30:15,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=140451.66666666666, ans=0.125 2024-06-20 05:30:20,685 INFO [train.py:1028] (1/2) Epoch 8, batch 5800, loss[loss=0.2581, simple_loss=0.2822, pruned_loss=0.117, over 12794.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.2754, pruned_loss=0.1042, over 2578190.88 frames. ], batch size: 176, lr: 7.31e-03, grad_scale: 128.0 2024-06-20 05:30:28,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=140488.33333333334, ans=10.0 2024-06-20 05:30:29,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=140488.33333333334, ans=0.0 2024-06-20 05:30:31,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=140488.33333333334, ans=0.125 2024-06-20 05:30:40,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=140525.0, ans=0.0 2024-06-20 05:30:54,249 INFO [train.py:1028] (1/2) Epoch 8, batch 5850, loss[loss=0.2659, simple_loss=0.2868, pruned_loss=0.1225, over 12477.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.2774, pruned_loss=0.105, over 2575826.98 frames. ], batch size: 202, lr: 7.30e-03, grad_scale: 128.0 2024-06-20 05:31:04,931 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.58 vs. limit=15.0 2024-06-20 05:31:09,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=140598.33333333334, ans=0.125 2024-06-20 05:31:14,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=140616.66666666666, ans=0.0 2024-06-20 05:31:17,532 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 1.965e+02 2.248e+02 2.647e+02 4.334e+02, threshold=4.497e+02, percent-clipped=1.0 2024-06-20 05:31:19,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=140616.66666666666, ans=0.125 2024-06-20 05:31:28,251 INFO [train.py:1028] (1/2) Epoch 8, batch 5900, loss[loss=0.2225, simple_loss=0.254, pruned_loss=0.09548, over 13102.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.2792, pruned_loss=0.1055, over 2575742.29 frames. ], batch size: 121, lr: 7.30e-03, grad_scale: 128.0 2024-06-20 05:31:35,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=140653.33333333334, ans=0.0 2024-06-20 05:31:36,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=140653.33333333334, ans=0.04949747468305833 2024-06-20 05:31:37,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=140671.66666666666, ans=0.125 2024-06-20 05:31:38,534 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.86 vs. limit=22.5 2024-06-20 05:31:43,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=140671.66666666666, ans=0.125 2024-06-20 05:31:49,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=140690.0, ans=0.2 2024-06-20 05:31:58,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=140726.66666666666, ans=0.0 2024-06-20 05:32:04,699 INFO [train.py:1028] (1/2) Epoch 8, batch 5950, loss[loss=0.2484, simple_loss=0.2763, pruned_loss=0.1103, over 13090.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.281, pruned_loss=0.1064, over 2580537.58 frames. ], batch size: 121, lr: 7.30e-03, grad_scale: 128.0 2024-06-20 05:32:16,355 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.09 vs. limit=15.0 2024-06-20 05:32:19,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=140781.66666666666, ans=0.2 2024-06-20 05:32:22,679 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.78 vs. limit=15.0 2024-06-20 05:32:29,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=140800.0, ans=0.0 2024-06-20 05:32:30,708 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.882e+02 2.019e+02 2.236e+02 3.501e+02, threshold=4.039e+02, percent-clipped=0.0 2024-06-20 05:32:41,578 INFO [train.py:1028] (1/2) Epoch 8, batch 6000, loss[loss=0.3005, simple_loss=0.3248, pruned_loss=0.1381, over 12206.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.2823, pruned_loss=0.1072, over 2574063.43 frames. ], batch size: 240, lr: 7.30e-03, grad_scale: 128.0 2024-06-20 05:32:41,578 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 05:32:49,447 INFO [train.py:1060] (1/2) Epoch 8, validation: loss=0.2064, simple_loss=0.2689, pruned_loss=0.07195, over 351949.00 frames. 2024-06-20 05:32:49,447 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 05:32:57,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=140855.0, ans=0.125 2024-06-20 05:32:58,484 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=140855.0, ans=0.0 2024-06-20 05:33:08,614 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=19.92 vs. limit=22.5 2024-06-20 05:33:11,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=140891.66666666666, ans=0.0 2024-06-20 05:33:23,014 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.84 vs. limit=10.0 2024-06-20 05:33:24,051 INFO [train.py:1028] (1/2) Epoch 8, batch 6050, loss[loss=0.2645, simple_loss=0.297, pruned_loss=0.116, over 12883.00 frames. ], tot_loss[loss=0.25, simple_loss=0.284, pruned_loss=0.108, over 2577398.51 frames. ], batch size: 39, lr: 7.29e-03, grad_scale: 128.0 2024-06-20 05:33:25,244 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.41 vs. limit=15.0 2024-06-20 05:33:49,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=140983.33333333334, ans=0.1 2024-06-20 05:33:50,091 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 1.839e+02 2.010e+02 2.266e+02 3.304e+02, threshold=4.020e+02, percent-clipped=0.0 2024-06-20 05:33:52,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=140983.33333333334, ans=0.125 2024-06-20 05:33:56,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=141001.66666666666, ans=0.125 2024-06-20 05:33:58,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=141001.66666666666, ans=0.0 2024-06-20 05:34:00,726 INFO [train.py:1028] (1/2) Epoch 8, batch 6100, loss[loss=0.2527, simple_loss=0.2782, pruned_loss=0.1136, over 13129.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.2859, pruned_loss=0.1088, over 2579615.80 frames. ], batch size: 121, lr: 7.29e-03, grad_scale: 128.0 2024-06-20 05:34:00,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=141020.0, ans=0.125 2024-06-20 05:34:02,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=141020.0, ans=0.125 2024-06-20 05:34:07,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=141038.33333333334, ans=0.125 2024-06-20 05:34:08,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=141038.33333333334, ans=0.0 2024-06-20 05:34:08,891 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.54 vs. limit=15.0 2024-06-20 05:34:09,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=141038.33333333334, ans=0.125 2024-06-20 05:34:11,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=141038.33333333334, ans=0.125 2024-06-20 05:34:37,546 INFO [train.py:1028] (1/2) Epoch 8, batch 6150, loss[loss=0.2731, simple_loss=0.2887, pruned_loss=0.1288, over 10800.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.2883, pruned_loss=0.1102, over 2578563.14 frames. ], batch size: 303, lr: 7.29e-03, grad_scale: 128.0 2024-06-20 05:34:48,437 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.61 vs. limit=10.0 2024-06-20 05:34:53,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=141148.33333333334, ans=0.125 2024-06-20 05:35:00,508 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 1.965e+02 2.194e+02 2.508e+02 4.834e+02, threshold=4.388e+02, percent-clipped=2.0 2024-06-20 05:35:11,079 INFO [train.py:1028] (1/2) Epoch 8, batch 6200, loss[loss=0.2584, simple_loss=0.301, pruned_loss=0.1079, over 13217.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.2901, pruned_loss=0.111, over 2574741.96 frames. ], batch size: 89, lr: 7.29e-03, grad_scale: 128.0 2024-06-20 05:35:11,520 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.95 vs. limit=10.0 2024-06-20 05:35:17,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=141221.66666666666, ans=0.5 2024-06-20 05:35:19,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=141221.66666666666, ans=0.0 2024-06-20 05:35:30,019 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.73 vs. limit=15.0 2024-06-20 05:35:40,504 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 05:35:46,382 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.76 vs. limit=15.0 2024-06-20 05:35:48,646 INFO [train.py:1028] (1/2) Epoch 8, batch 6250, loss[loss=0.2538, simple_loss=0.2953, pruned_loss=0.1062, over 13194.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.2909, pruned_loss=0.1114, over 2566975.26 frames. ], batch size: 83, lr: 7.28e-03, grad_scale: 128.0 2024-06-20 05:35:51,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=141295.0, ans=0.125 2024-06-20 05:36:03,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=141331.66666666666, ans=0.125 2024-06-20 05:36:03,520 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=15.0 2024-06-20 05:36:04,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=141331.66666666666, ans=0.125 2024-06-20 05:36:09,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=141350.0, ans=0.125 2024-06-20 05:36:11,464 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.893e+02 2.016e+02 2.226e+02 3.019e+02, threshold=4.032e+02, percent-clipped=0.0 2024-06-20 05:36:14,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=141368.33333333334, ans=0.2 2024-06-20 05:36:21,857 INFO [train.py:1028] (1/2) Epoch 8, batch 6300, loss[loss=0.2214, simple_loss=0.2613, pruned_loss=0.09074, over 11417.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.2924, pruned_loss=0.112, over 2561257.93 frames. ], batch size: 16, lr: 7.28e-03, grad_scale: 128.0 2024-06-20 05:36:25,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=141386.66666666666, ans=0.125 2024-06-20 05:36:26,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=141386.66666666666, ans=0.05 2024-06-20 05:36:30,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=141405.0, ans=0.0 2024-06-20 05:36:58,078 INFO [train.py:1028] (1/2) Epoch 8, batch 6350, loss[loss=0.2979, simple_loss=0.3156, pruned_loss=0.1401, over 12543.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.2936, pruned_loss=0.112, over 2571300.85 frames. ], batch size: 202, lr: 7.28e-03, grad_scale: 128.0 2024-06-20 05:37:00,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=141478.33333333334, ans=0.95 2024-06-20 05:37:08,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=141496.66666666666, ans=0.0 2024-06-20 05:37:13,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141515.0, ans=0.1 2024-06-20 05:37:18,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=141533.33333333334, ans=0.125 2024-06-20 05:37:20,361 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.860e+02 2.051e+02 2.229e+02 3.594e+02, threshold=4.103e+02, percent-clipped=0.0 2024-06-20 05:37:22,812 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.13 vs. limit=15.0 2024-06-20 05:37:24,033 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.81 vs. limit=15.0 2024-06-20 05:37:25,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=141551.66666666666, ans=22.5 2024-06-20 05:37:31,001 INFO [train.py:1028] (1/2) Epoch 8, batch 6400, loss[loss=0.2432, simple_loss=0.2837, pruned_loss=0.1013, over 13243.00 frames. ], tot_loss[loss=0.261, simple_loss=0.2958, pruned_loss=0.1131, over 2572949.29 frames. ], batch size: 67, lr: 7.28e-03, grad_scale: 128.0 2024-06-20 05:37:33,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=141570.0, ans=0.1 2024-06-20 05:37:35,596 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.75 vs. limit=15.0 2024-06-20 05:37:41,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=141588.33333333334, ans=0.0 2024-06-20 05:37:43,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=141588.33333333334, ans=0.125 2024-06-20 05:38:00,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=141643.33333333334, ans=0.125 2024-06-20 05:38:02,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=141643.33333333334, ans=0.2 2024-06-20 05:38:04,395 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.63 vs. limit=15.0 2024-06-20 05:38:06,402 INFO [train.py:1028] (1/2) Epoch 8, batch 6450, loss[loss=0.2806, simple_loss=0.3084, pruned_loss=0.1264, over 12543.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.2978, pruned_loss=0.1137, over 2578744.76 frames. ], batch size: 202, lr: 7.27e-03, grad_scale: 128.0 2024-06-20 05:38:09,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=141661.66666666666, ans=0.2 2024-06-20 05:38:18,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=141680.0, ans=0.1 2024-06-20 05:38:18,185 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.97 vs. limit=22.5 2024-06-20 05:38:19,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=141698.33333333334, ans=0.1 2024-06-20 05:38:21,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=141698.33333333334, ans=0.2 2024-06-20 05:38:26,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=141716.66666666666, ans=0.125 2024-06-20 05:38:28,722 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.978e+02 2.153e+02 2.645e+02 4.398e+02, threshold=4.307e+02, percent-clipped=1.0 2024-06-20 05:38:29,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=141716.66666666666, ans=0.1 2024-06-20 05:38:42,535 INFO [train.py:1028] (1/2) Epoch 8, batch 6500, loss[loss=0.2828, simple_loss=0.2978, pruned_loss=0.1339, over 10710.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.2997, pruned_loss=0.1143, over 2582944.07 frames. ], batch size: 303, lr: 7.27e-03, grad_scale: 128.0 2024-06-20 05:38:48,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=141771.66666666666, ans=0.1 2024-06-20 05:38:52,858 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.35 vs. limit=22.5 2024-06-20 05:38:58,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=141790.0, ans=0.125 2024-06-20 05:39:00,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=141790.0, ans=0.2 2024-06-20 05:39:03,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=141808.33333333334, ans=0.125 2024-06-20 05:39:06,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=141808.33333333334, ans=0.0 2024-06-20 05:39:14,913 INFO [train.py:1028] (1/2) Epoch 8, batch 6550, loss[loss=0.2622, simple_loss=0.3038, pruned_loss=0.1103, over 12498.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3015, pruned_loss=0.115, over 2586241.89 frames. ], batch size: 22, lr: 7.27e-03, grad_scale: 128.0 2024-06-20 05:39:16,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=141845.0, ans=0.2 2024-06-20 05:39:22,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=141863.33333333334, ans=0.125 2024-06-20 05:39:27,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=141881.66666666666, ans=0.0 2024-06-20 05:39:32,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=141881.66666666666, ans=0.025 2024-06-20 05:39:34,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=141900.0, ans=0.125 2024-06-20 05:39:36,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=141900.0, ans=0.025 2024-06-20 05:39:36,692 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.899e+02 2.049e+02 2.219e+02 3.326e+02, threshold=4.097e+02, percent-clipped=0.0 2024-06-20 05:39:37,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=141900.0, ans=0.125 2024-06-20 05:39:37,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=141900.0, ans=0.1 2024-06-20 05:39:52,237 INFO [train.py:1028] (1/2) Epoch 8, batch 6600, loss[loss=0.2642, simple_loss=0.3077, pruned_loss=0.1103, over 13248.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3019, pruned_loss=0.1152, over 2588717.33 frames. ], batch size: 72, lr: 7.27e-03, grad_scale: 128.0 2024-06-20 05:39:53,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=141936.66666666666, ans=0.0 2024-06-20 05:39:53,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=141936.66666666666, ans=0.125 2024-06-20 05:40:02,193 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.61 vs. limit=12.0 2024-06-20 05:40:10,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=141973.33333333334, ans=0.0 2024-06-20 05:40:15,788 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2024-06-20 05:40:17,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141991.66666666666, ans=0.1 2024-06-20 05:40:17,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=141991.66666666666, ans=0.2 2024-06-20 05:40:23,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=142010.0, ans=0.025 2024-06-20 05:40:26,259 INFO [train.py:1028] (1/2) Epoch 8, batch 6650, loss[loss=0.2731, simple_loss=0.308, pruned_loss=0.1191, over 12931.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3031, pruned_loss=0.1156, over 2583015.55 frames. ], batch size: 158, lr: 7.27e-03, grad_scale: 128.0 2024-06-20 05:40:29,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=142028.33333333334, ans=0.0 2024-06-20 05:40:30,247 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.80 vs. limit=6.0 2024-06-20 05:40:33,936 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.85 vs. limit=22.5 2024-06-20 05:40:35,303 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.52 vs. limit=15.0 2024-06-20 05:40:41,218 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.62 vs. limit=22.5 2024-06-20 05:40:52,989 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.959e+02 2.113e+02 2.309e+02 3.321e+02, threshold=4.226e+02, percent-clipped=0.0 2024-06-20 05:40:54,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.68 vs. limit=15.0 2024-06-20 05:40:55,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=142083.33333333334, ans=0.125 2024-06-20 05:41:03,827 INFO [train.py:1028] (1/2) Epoch 8, batch 6700, loss[loss=0.3191, simple_loss=0.3502, pruned_loss=0.144, over 12808.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3041, pruned_loss=0.1158, over 2582931.63 frames. ], batch size: 177, lr: 7.26e-03, grad_scale: 128.0 2024-06-20 05:41:06,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=142120.0, ans=0.125 2024-06-20 05:41:10,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=142138.33333333334, ans=0.025 2024-06-20 05:41:18,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=142156.66666666666, ans=0.125 2024-06-20 05:41:37,450 INFO [train.py:1028] (1/2) Epoch 8, batch 6750, loss[loss=0.3456, simple_loss=0.3602, pruned_loss=0.1654, over 12243.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3049, pruned_loss=0.1166, over 2576873.00 frames. ], batch size: 241, lr: 7.26e-03, grad_scale: 128.0 2024-06-20 05:41:39,157 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.67 vs. limit=15.0 2024-06-20 05:41:39,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=142211.66666666666, ans=0.0 2024-06-20 05:41:43,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=142230.0, ans=0.07 2024-06-20 05:42:02,751 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.939e+02 2.153e+02 2.424e+02 3.567e+02, threshold=4.305e+02, percent-clipped=0.0 2024-06-20 05:42:08,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=142285.0, ans=0.0 2024-06-20 05:42:13,160 INFO [train.py:1028] (1/2) Epoch 8, batch 6800, loss[loss=0.2369, simple_loss=0.2812, pruned_loss=0.09634, over 13244.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3066, pruned_loss=0.1171, over 2579325.32 frames. ], batch size: 67, lr: 7.26e-03, grad_scale: 128.0 2024-06-20 05:42:19,538 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=1.820e+02 2024-06-20 05:42:37,713 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.51 vs. limit=10.0 2024-06-20 05:42:45,586 INFO [train.py:1028] (1/2) Epoch 8, batch 6850, loss[loss=0.2787, simple_loss=0.3279, pruned_loss=0.1148, over 13274.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3076, pruned_loss=0.1171, over 2582932.92 frames. ], batch size: 63, lr: 7.26e-03, grad_scale: 128.0 2024-06-20 05:42:45,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=142395.0, ans=0.125 2024-06-20 05:42:45,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=142395.0, ans=0.125 2024-06-20 05:42:58,113 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.66 vs. limit=15.0 2024-06-20 05:43:10,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=142450.0, ans=0.0 2024-06-20 05:43:11,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=142450.0, ans=0.125 2024-06-20 05:43:11,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=142450.0, ans=0.2 2024-06-20 05:43:12,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=142450.0, ans=0.125 2024-06-20 05:43:12,877 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.631e+02 1.885e+02 2.030e+02 2.236e+02 2.788e+02, threshold=4.059e+02, percent-clipped=0.0 2024-06-20 05:43:13,346 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.00 vs. limit=12.0 2024-06-20 05:43:13,886 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.07 vs. limit=15.0 2024-06-20 05:43:16,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=142468.33333333334, ans=0.0 2024-06-20 05:43:20,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=142468.33333333334, ans=0.2 2024-06-20 05:43:23,550 INFO [train.py:1028] (1/2) Epoch 8, batch 6900, loss[loss=0.269, simple_loss=0.3072, pruned_loss=0.1154, over 12989.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3087, pruned_loss=0.1176, over 2585417.51 frames. ], batch size: 48, lr: 7.25e-03, grad_scale: 128.0 2024-06-20 05:43:25,693 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=15.0 2024-06-20 05:43:30,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=142505.0, ans=0.125 2024-06-20 05:43:36,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=142523.33333333334, ans=0.125 2024-06-20 05:43:39,127 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.61 vs. limit=22.5 2024-06-20 05:43:47,459 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.72 vs. limit=6.0 2024-06-20 05:43:51,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=142560.0, ans=0.1 2024-06-20 05:44:00,529 INFO [train.py:1028] (1/2) Epoch 8, batch 6950, loss[loss=0.2543, simple_loss=0.2938, pruned_loss=0.1074, over 11294.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3095, pruned_loss=0.1178, over 2579590.65 frames. ], batch size: 16, lr: 7.25e-03, grad_scale: 128.0 2024-06-20 05:44:06,734 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.46 vs. limit=12.0 2024-06-20 05:44:22,712 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.895e+02 2.030e+02 2.279e+02 3.825e+02, threshold=4.060e+02, percent-clipped=0.0 2024-06-20 05:44:25,836 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=12.0 2024-06-20 05:44:32,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=142651.66666666666, ans=0.125 2024-06-20 05:44:33,442 INFO [train.py:1028] (1/2) Epoch 8, batch 7000, loss[loss=0.2696, simple_loss=0.3068, pruned_loss=0.1162, over 12930.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3084, pruned_loss=0.1171, over 2577319.75 frames. ], batch size: 158, lr: 7.25e-03, grad_scale: 64.0 2024-06-20 05:44:35,826 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.94 vs. limit=15.0 2024-06-20 05:44:38,872 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.56 vs. limit=15.0 2024-06-20 05:44:50,334 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.36 vs. limit=10.0 2024-06-20 05:45:06,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=142743.33333333334, ans=0.2 2024-06-20 05:45:10,819 INFO [train.py:1028] (1/2) Epoch 8, batch 7050, loss[loss=0.3001, simple_loss=0.3308, pruned_loss=0.1347, over 12705.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3095, pruned_loss=0.1175, over 2583388.27 frames. ], batch size: 176, lr: 7.25e-03, grad_scale: 64.0 2024-06-20 05:45:29,811 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.79 vs. limit=12.0 2024-06-20 05:45:32,504 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.42 vs. limit=22.5 2024-06-20 05:45:33,350 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 1.986e+02 2.162e+02 2.563e+02 3.817e+02, threshold=4.323e+02, percent-clipped=0.0 2024-06-20 05:45:35,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=142816.66666666666, ans=0.09899494936611666 2024-06-20 05:45:38,514 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.78 vs. limit=15.0 2024-06-20 05:45:39,310 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=142835.0, ans=0.1 2024-06-20 05:45:40,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=142835.0, ans=0.2 2024-06-20 05:45:40,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=142835.0, ans=0.0 2024-06-20 05:45:43,119 INFO [train.py:1028] (1/2) Epoch 8, batch 7100, loss[loss=0.298, simple_loss=0.3342, pruned_loss=0.1309, over 13219.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3109, pruned_loss=0.1187, over 2574502.51 frames. ], batch size: 112, lr: 7.24e-03, grad_scale: 64.0 2024-06-20 05:45:49,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=142871.66666666666, ans=0.125 2024-06-20 05:45:56,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=142890.0, ans=0.0 2024-06-20 05:46:09,565 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.27 vs. limit=15.0 2024-06-20 05:46:14,085 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2024-06-20 05:46:15,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=142926.66666666666, ans=0.125 2024-06-20 05:46:17,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=142926.66666666666, ans=0.125 2024-06-20 05:46:19,679 INFO [train.py:1028] (1/2) Epoch 8, batch 7150, loss[loss=0.3481, simple_loss=0.3699, pruned_loss=0.1632, over 12448.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3108, pruned_loss=0.1184, over 2573351.95 frames. ], batch size: 202, lr: 7.24e-03, grad_scale: 64.0 2024-06-20 05:46:27,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=142963.33333333334, ans=0.125 2024-06-20 05:46:29,893 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=142963.33333333334, ans=0.125 2024-06-20 05:46:36,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=142981.66666666666, ans=0.0 2024-06-20 05:46:39,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=143000.0, ans=0.07 2024-06-20 05:46:43,688 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.707e+02 1.903e+02 2.096e+02 2.339e+02 3.206e+02, threshold=4.191e+02, percent-clipped=0.0 2024-06-20 05:46:46,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=143018.33333333334, ans=0.09899494936611666 2024-06-20 05:46:46,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=143018.33333333334, ans=0.125 2024-06-20 05:46:52,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=143018.33333333334, ans=0.125 2024-06-20 05:46:53,309 INFO [train.py:1028] (1/2) Epoch 8, batch 7200, loss[loss=0.2922, simple_loss=0.3273, pruned_loss=0.1285, over 13163.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3127, pruned_loss=0.1193, over 2578355.89 frames. ], batch size: 112, lr: 7.24e-03, grad_scale: 64.0 2024-06-20 05:46:56,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=143036.66666666666, ans=0.125 2024-06-20 05:47:06,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=143055.0, ans=0.2 2024-06-20 05:47:16,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=143073.33333333334, ans=0.1 2024-06-20 05:47:28,232 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.09 vs. limit=15.0 2024-06-20 05:47:28,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=143110.0, ans=0.125 2024-06-20 05:47:30,576 INFO [train.py:1028] (1/2) Epoch 8, batch 7250, loss[loss=0.2607, simple_loss=0.3063, pruned_loss=0.1075, over 13009.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3132, pruned_loss=0.1192, over 2578399.05 frames. ], batch size: 36, lr: 7.24e-03, grad_scale: 64.0 2024-06-20 05:47:34,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=143128.33333333334, ans=0.1 2024-06-20 05:47:34,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=143128.33333333334, ans=0.125 2024-06-20 05:47:40,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=143146.66666666666, ans=0.125 2024-06-20 05:47:47,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=143165.0, ans=0.125 2024-06-20 05:47:49,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=143183.33333333334, ans=0.125 2024-06-20 05:47:50,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=143183.33333333334, ans=0.04949747468305833 2024-06-20 05:47:53,684 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 1.876e+02 2.052e+02 2.258e+02 3.578e+02, threshold=4.104e+02, percent-clipped=0.0 2024-06-20 05:47:55,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=143183.33333333334, ans=0.1 2024-06-20 05:47:56,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=143201.66666666666, ans=0.125 2024-06-20 05:48:03,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=143220.0, ans=0.0 2024-06-20 05:48:03,615 INFO [train.py:1028] (1/2) Epoch 8, batch 7300, loss[loss=0.2824, simple_loss=0.3281, pruned_loss=0.1183, over 12952.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3144, pruned_loss=0.1197, over 2578970.36 frames. ], batch size: 36, lr: 7.24e-03, grad_scale: 64.0 2024-06-20 05:48:19,455 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=22.5 2024-06-20 05:48:19,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=143256.66666666666, ans=0.0 2024-06-20 05:48:22,913 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.08 vs. limit=22.5 2024-06-20 05:48:27,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=143275.0, ans=0.125 2024-06-20 05:48:35,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=143293.33333333334, ans=0.1 2024-06-20 05:48:40,317 INFO [train.py:1028] (1/2) Epoch 8, batch 7350, loss[loss=0.2692, simple_loss=0.3041, pruned_loss=0.1171, over 13251.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3151, pruned_loss=0.12, over 2580961.96 frames. ], batch size: 46, lr: 7.23e-03, grad_scale: 64.0 2024-06-20 05:48:41,938 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.01 vs. limit=15.0 2024-06-20 05:48:44,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=143311.66666666666, ans=0.125 2024-06-20 05:48:45,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=143311.66666666666, ans=0.025 2024-06-20 05:48:48,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=143330.0, ans=0.0 2024-06-20 05:49:03,057 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 1.924e+02 2.048e+02 2.252e+02 3.780e+02, threshold=4.096e+02, percent-clipped=0.0 2024-06-20 05:49:09,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=143385.0, ans=0.0 2024-06-20 05:49:10,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=143385.0, ans=0.125 2024-06-20 05:49:12,401 INFO [train.py:1028] (1/2) Epoch 8, batch 7400, loss[loss=0.2697, simple_loss=0.3136, pruned_loss=0.1129, over 13262.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3144, pruned_loss=0.1193, over 2585880.14 frames. ], batch size: 63, lr: 7.23e-03, grad_scale: 64.0 2024-06-20 05:49:14,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=143403.33333333334, ans=0.125 2024-06-20 05:49:33,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=143440.0, ans=0.025 2024-06-20 05:49:33,292 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.97 vs. limit=12.0 2024-06-20 05:49:49,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=143495.0, ans=0.05 2024-06-20 05:49:49,666 INFO [train.py:1028] (1/2) Epoch 8, batch 7450, loss[loss=0.25, simple_loss=0.2911, pruned_loss=0.1045, over 12697.00 frames. ], tot_loss[loss=0.2763, simple_loss=0.3141, pruned_loss=0.1192, over 2580737.15 frames. ], batch size: 29, lr: 7.23e-03, grad_scale: 64.0 2024-06-20 05:49:58,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=143513.33333333334, ans=0.0 2024-06-20 05:50:14,194 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.41 vs. limit=15.0 2024-06-20 05:50:16,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=143550.0, ans=0.2 2024-06-20 05:50:18,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=143550.0, ans=0.05 2024-06-20 05:50:18,898 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.653e+02 1.965e+02 2.100e+02 2.447e+02 3.091e+02, threshold=4.201e+02, percent-clipped=0.0 2024-06-20 05:50:21,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=143550.0, ans=0.0 2024-06-20 05:50:21,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=143550.0, ans=0.0 2024-06-20 05:50:21,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=143568.33333333334, ans=0.1 2024-06-20 05:50:28,785 INFO [train.py:1028] (1/2) Epoch 8, batch 7500, loss[loss=0.291, simple_loss=0.3124, pruned_loss=0.1348, over 10622.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.315, pruned_loss=0.1197, over 2578564.70 frames. ], batch size: 303, lr: 7.23e-03, grad_scale: 64.0 2024-06-20 05:50:34,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=143605.0, ans=0.2 2024-06-20 05:50:38,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=143605.0, ans=0.1 2024-06-20 05:50:46,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=143623.33333333334, ans=0.0 2024-06-20 05:50:49,236 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-20 05:50:53,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=143641.66666666666, ans=0.5 2024-06-20 05:50:53,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=143641.66666666666, ans=0.125 2024-06-20 05:50:54,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.46 vs. limit=15.0 2024-06-20 05:51:01,370 INFO [train.py:1028] (1/2) Epoch 8, batch 7550, loss[loss=0.2912, simple_loss=0.324, pruned_loss=0.1292, over 12947.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3161, pruned_loss=0.1207, over 2578119.34 frames. ], batch size: 158, lr: 7.22e-03, grad_scale: 64.0 2024-06-20 05:51:01,773 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.73 vs. limit=22.5 2024-06-20 05:51:11,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=143696.66666666666, ans=0.0 2024-06-20 05:51:19,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=143715.0, ans=0.0 2024-06-20 05:51:27,938 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.902e+02 2.095e+02 2.248e+02 3.118e+02, threshold=4.190e+02, percent-clipped=0.0 2024-06-20 05:51:37,608 INFO [train.py:1028] (1/2) Epoch 8, batch 7600, loss[loss=0.2736, simple_loss=0.3111, pruned_loss=0.1181, over 13196.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3167, pruned_loss=0.121, over 2577403.03 frames. ], batch size: 83, lr: 7.22e-03, grad_scale: 64.0 2024-06-20 05:51:48,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=143788.33333333334, ans=0.125 2024-06-20 05:51:57,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=143825.0, ans=0.1 2024-06-20 05:52:02,851 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.17 vs. limit=15.0 2024-06-20 05:52:07,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=143843.33333333334, ans=0.125 2024-06-20 05:52:14,595 INFO [train.py:1028] (1/2) Epoch 8, batch 7650, loss[loss=0.2985, simple_loss=0.3364, pruned_loss=0.1303, over 12913.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3169, pruned_loss=0.121, over 2572830.97 frames. ], batch size: 33, lr: 7.22e-03, grad_scale: 64.0 2024-06-20 05:52:27,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.85 vs. limit=15.0 2024-06-20 05:52:35,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=143916.66666666666, ans=0.0 2024-06-20 05:52:37,624 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 2.005e+02 2.196e+02 2.544e+02 4.080e+02, threshold=4.392e+02, percent-clipped=0.0 2024-06-20 05:52:43,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=143935.0, ans=0.1 2024-06-20 05:52:47,653 INFO [train.py:1028] (1/2) Epoch 8, batch 7700, loss[loss=0.294, simple_loss=0.3416, pruned_loss=0.1232, over 13260.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3171, pruned_loss=0.1209, over 2568616.01 frames. ], batch size: 63, lr: 7.22e-03, grad_scale: 64.0 2024-06-20 05:52:53,032 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.17 vs. limit=15.0 2024-06-20 05:52:59,293 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.89 vs. limit=15.0 2024-06-20 05:53:01,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=143990.0, ans=0.2 2024-06-20 05:53:01,371 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.97 vs. limit=22.5 2024-06-20 05:53:10,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=144008.33333333334, ans=0.2 2024-06-20 05:53:17,179 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.94 vs. limit=15.0 2024-06-20 05:53:23,421 INFO [train.py:1028] (1/2) Epoch 8, batch 7750, loss[loss=0.2594, simple_loss=0.3032, pruned_loss=0.1078, over 13277.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3181, pruned_loss=0.1218, over 2572647.96 frames. ], batch size: 72, lr: 7.22e-03, grad_scale: 64.0 2024-06-20 05:53:24,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=144045.0, ans=0.0 2024-06-20 05:53:30,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.85 vs. limit=15.0 2024-06-20 05:53:32,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=144063.33333333334, ans=0.0 2024-06-20 05:53:42,168 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.55 vs. limit=15.0 2024-06-20 05:53:46,681 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.988e+02 2.328e+02 2.652e+02 4.025e+02, threshold=4.655e+02, percent-clipped=0.0 2024-06-20 05:53:47,138 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.47 vs. limit=15.0 2024-06-20 05:53:48,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=144100.0, ans=0.125 2024-06-20 05:53:56,401 INFO [train.py:1028] (1/2) Epoch 8, batch 7800, loss[loss=0.2804, simple_loss=0.3192, pruned_loss=0.1208, over 13165.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3185, pruned_loss=0.1216, over 2577309.17 frames. ], batch size: 95, lr: 7.21e-03, grad_scale: 64.0 2024-06-20 05:54:12,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=144173.33333333334, ans=0.125 2024-06-20 05:54:27,456 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.53 vs. limit=22.5 2024-06-20 05:54:32,075 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.33 vs. limit=15.0 2024-06-20 05:54:32,983 INFO [train.py:1028] (1/2) Epoch 8, batch 7850, loss[loss=0.2838, simple_loss=0.3286, pruned_loss=0.1195, over 11322.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3196, pruned_loss=0.1223, over 2571821.94 frames. ], batch size: 17, lr: 7.21e-03, grad_scale: 64.0 2024-06-20 05:54:38,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=144246.66666666666, ans=0.125 2024-06-20 05:54:39,249 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.62 vs. limit=22.5 2024-06-20 05:54:39,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=144246.66666666666, ans=0.0 2024-06-20 05:54:40,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=144246.66666666666, ans=0.0 2024-06-20 05:54:44,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=144246.66666666666, ans=0.025 2024-06-20 05:54:48,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=144265.0, ans=0.0 2024-06-20 05:54:51,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=144283.33333333334, ans=0.125 2024-06-20 05:54:52,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=144283.33333333334, ans=0.1 2024-06-20 05:54:54,306 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.75 vs. limit=22.5 2024-06-20 05:54:55,149 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 1.930e+02 2.140e+02 2.417e+02 3.663e+02, threshold=4.281e+02, percent-clipped=0.0 2024-06-20 05:55:07,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=144320.0, ans=0.125 2024-06-20 05:55:08,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=144320.0, ans=0.0 2024-06-20 05:55:08,579 INFO [train.py:1028] (1/2) Epoch 8, batch 7900, loss[loss=0.2624, simple_loss=0.312, pruned_loss=0.1065, over 13177.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3196, pruned_loss=0.1224, over 2570894.36 frames. ], batch size: 77, lr: 7.21e-03, grad_scale: 64.0 2024-06-20 05:55:09,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=144320.0, ans=0.95 2024-06-20 05:55:13,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=144320.0, ans=0.125 2024-06-20 05:55:27,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=144375.0, ans=0.025 2024-06-20 05:55:28,111 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.86 vs. limit=15.0 2024-06-20 05:55:41,320 INFO [train.py:1028] (1/2) Epoch 8, batch 7950, loss[loss=0.2667, simple_loss=0.297, pruned_loss=0.1182, over 10455.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3198, pruned_loss=0.1221, over 2574343.20 frames. ], batch size: 303, lr: 7.21e-03, grad_scale: 64.0 2024-06-20 05:55:44,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=144411.66666666666, ans=0.2 2024-06-20 05:55:46,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=144411.66666666666, ans=0.125 2024-06-20 05:55:47,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=144430.0, ans=0.04949747468305833 2024-06-20 05:56:05,066 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.65 vs. limit=15.0 2024-06-20 05:56:09,637 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 2.001e+02 2.236e+02 2.541e+02 3.791e+02, threshold=4.473e+02, percent-clipped=0.0 2024-06-20 05:56:09,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=144466.66666666666, ans=0.125 2024-06-20 05:56:19,854 INFO [train.py:1028] (1/2) Epoch 8, batch 8000, loss[loss=0.2457, simple_loss=0.2981, pruned_loss=0.09666, over 12991.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3207, pruned_loss=0.1226, over 2571116.53 frames. ], batch size: 30, lr: 7.20e-03, grad_scale: 64.0 2024-06-20 05:56:24,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=144503.33333333334, ans=0.2 2024-06-20 05:56:24,824 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.37 vs. limit=12.0 2024-06-20 05:56:25,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=144521.66666666666, ans=0.025 2024-06-20 05:56:28,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=144521.66666666666, ans=0.2 2024-06-20 05:56:30,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=144521.66666666666, ans=0.0 2024-06-20 05:56:36,259 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.03 vs. limit=15.0 2024-06-20 05:56:38,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=144540.0, ans=0.2 2024-06-20 05:56:38,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=144540.0, ans=0.025 2024-06-20 05:56:38,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=144540.0, ans=0.0 2024-06-20 05:56:39,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=144558.33333333334, ans=0.2 2024-06-20 05:56:45,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144558.33333333334, ans=0.1 2024-06-20 05:56:46,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=144576.66666666666, ans=0.0 2024-06-20 05:56:49,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=144576.66666666666, ans=0.04949747468305833 2024-06-20 05:56:51,190 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.12 vs. limit=22.5 2024-06-20 05:56:53,481 INFO [train.py:1028] (1/2) Epoch 8, batch 8050, loss[loss=0.2718, simple_loss=0.3148, pruned_loss=0.1144, over 13263.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3211, pruned_loss=0.1227, over 2569727.35 frames. ], batch size: 83, lr: 7.20e-03, grad_scale: 64.0 2024-06-20 05:56:59,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144613.33333333334, ans=0.1 2024-06-20 05:57:16,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=144650.0, ans=0.025 2024-06-20 05:57:19,411 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 2.012e+02 2.211e+02 2.540e+02 3.507e+02, threshold=4.423e+02, percent-clipped=0.0 2024-06-20 05:57:28,970 INFO [train.py:1028] (1/2) Epoch 8, batch 8100, loss[loss=0.2874, simple_loss=0.3211, pruned_loss=0.1268, over 13162.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3218, pruned_loss=0.1232, over 2574546.19 frames. ], batch size: 112, lr: 7.20e-03, grad_scale: 64.0 2024-06-20 05:57:39,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=144705.0, ans=0.125 2024-06-20 05:57:44,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=144723.33333333334, ans=0.07 2024-06-20 05:57:45,714 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.07 vs. limit=15.0 2024-06-20 05:57:46,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=144723.33333333334, ans=0.0 2024-06-20 05:57:52,748 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 05:58:02,708 INFO [train.py:1028] (1/2) Epoch 8, batch 8150, loss[loss=0.2913, simple_loss=0.3294, pruned_loss=0.1266, over 13100.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3213, pruned_loss=0.1223, over 2577429.67 frames. ], batch size: 121, lr: 7.20e-03, grad_scale: 64.0 2024-06-20 05:58:10,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=144778.33333333334, ans=0.125 2024-06-20 05:58:20,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=144815.0, ans=0.025 2024-06-20 05:58:28,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=144833.33333333334, ans=15.0 2024-06-20 05:58:29,258 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.010e+02 2.176e+02 2.502e+02 3.519e+02, threshold=4.353e+02, percent-clipped=0.0 2024-06-20 05:58:33,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=144851.66666666666, ans=0.125 2024-06-20 05:58:39,211 INFO [train.py:1028] (1/2) Epoch 8, batch 8200, loss[loss=0.2924, simple_loss=0.3325, pruned_loss=0.1262, over 13149.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3213, pruned_loss=0.1221, over 2581685.61 frames. ], batch size: 112, lr: 7.19e-03, grad_scale: 64.0 2024-06-20 05:58:41,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=144870.0, ans=0.0 2024-06-20 05:59:09,151 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.83 vs. limit=15.0 2024-06-20 05:59:10,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=144943.33333333334, ans=0.125 2024-06-20 05:59:15,742 INFO [train.py:1028] (1/2) Epoch 8, batch 8250, loss[loss=0.2825, simple_loss=0.3222, pruned_loss=0.1214, over 13245.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.322, pruned_loss=0.1227, over 2582194.02 frames. ], batch size: 52, lr: 7.19e-03, grad_scale: 64.0 2024-06-20 05:59:20,976 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.18 vs. limit=22.5 2024-06-20 05:59:21,610 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.15 vs. limit=22.5 2024-06-20 05:59:21,678 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.42 vs. limit=15.0 2024-06-20 05:59:35,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145016.66666666666, ans=0.1 2024-06-20 05:59:38,274 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 1.889e+02 2.044e+02 2.251e+02 2.932e+02, threshold=4.088e+02, percent-clipped=0.0 2024-06-20 05:59:48,325 INFO [train.py:1028] (1/2) Epoch 8, batch 8300, loss[loss=0.2657, simple_loss=0.3025, pruned_loss=0.1145, over 13033.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.321, pruned_loss=0.1218, over 2580007.93 frames. ], batch size: 102, lr: 7.19e-03, grad_scale: 64.0 2024-06-20 05:59:50,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=145053.33333333334, ans=0.0 2024-06-20 05:59:51,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=145053.33333333334, ans=0.0 2024-06-20 05:59:57,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=145071.66666666666, ans=0.125 2024-06-20 06:00:02,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=145071.66666666666, ans=0.125 2024-06-20 06:00:14,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=145108.33333333334, ans=0.125 2024-06-20 06:00:16,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=145108.33333333334, ans=0.0 2024-06-20 06:00:26,583 INFO [train.py:1028] (1/2) Epoch 8, batch 8350, loss[loss=0.2571, simple_loss=0.2952, pruned_loss=0.1095, over 13174.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3202, pruned_loss=0.1211, over 2579799.51 frames. ], batch size: 112, lr: 7.19e-03, grad_scale: 64.0 2024-06-20 06:00:28,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=145145.0, ans=0.125 2024-06-20 06:00:36,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=145163.33333333334, ans=0.125 2024-06-20 06:00:44,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=145181.66666666666, ans=0.0 2024-06-20 06:00:50,507 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.942e+02 2.124e+02 2.391e+02 3.448e+02, threshold=4.248e+02, percent-clipped=0.0 2024-06-20 06:00:59,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=145218.33333333334, ans=0.125 2024-06-20 06:01:00,614 INFO [train.py:1028] (1/2) Epoch 8, batch 8400, loss[loss=0.293, simple_loss=0.3218, pruned_loss=0.1321, over 12983.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3209, pruned_loss=0.1216, over 2576174.77 frames. ], batch size: 39, lr: 7.19e-03, grad_scale: 64.0 2024-06-20 06:01:01,712 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.32 vs. limit=15.0 2024-06-20 06:01:04,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=145236.66666666666, ans=0.09899494936611666 2024-06-20 06:01:10,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=145255.0, ans=0.0 2024-06-20 06:01:19,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=145273.33333333334, ans=0.125 2024-06-20 06:01:22,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=145273.33333333334, ans=0.0 2024-06-20 06:01:23,039 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.63 vs. limit=15.0 2024-06-20 06:01:24,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=145291.66666666666, ans=0.125 2024-06-20 06:01:25,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=145291.66666666666, ans=0.2 2024-06-20 06:01:30,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=145310.0, ans=0.0 2024-06-20 06:01:31,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=145310.0, ans=0.1 2024-06-20 06:01:32,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=145310.0, ans=0.0 2024-06-20 06:01:37,408 INFO [train.py:1028] (1/2) Epoch 8, batch 8450, loss[loss=0.2659, simple_loss=0.3123, pruned_loss=0.1098, over 13160.00 frames. ], tot_loss[loss=0.2826, simple_loss=0.3216, pruned_loss=0.1218, over 2578590.97 frames. ], batch size: 112, lr: 7.18e-03, grad_scale: 64.0 2024-06-20 06:01:40,517 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.93 vs. limit=6.0 2024-06-20 06:01:46,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=145346.66666666666, ans=0.0 2024-06-20 06:01:50,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=145365.0, ans=0.125 2024-06-20 06:01:55,917 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.44 vs. limit=15.0 2024-06-20 06:01:58,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=145383.33333333334, ans=0.125 2024-06-20 06:02:00,643 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 1.986e+02 2.160e+02 2.375e+02 3.461e+02, threshold=4.320e+02, percent-clipped=0.0 2024-06-20 06:02:09,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=145401.66666666666, ans=0.125 2024-06-20 06:02:14,247 INFO [train.py:1028] (1/2) Epoch 8, batch 8500, loss[loss=0.2809, simple_loss=0.3264, pruned_loss=0.1177, over 12618.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3229, pruned_loss=0.1224, over 2576694.64 frames. ], batch size: 29, lr: 7.18e-03, grad_scale: 64.0 2024-06-20 06:02:15,414 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.17 vs. limit=15.0 2024-06-20 06:02:24,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=145438.33333333334, ans=0.5 2024-06-20 06:02:35,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=145475.0, ans=0.1 2024-06-20 06:02:35,535 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.91 vs. limit=6.0 2024-06-20 06:02:38,852 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:02:48,774 INFO [train.py:1028] (1/2) Epoch 8, batch 8550, loss[loss=0.2858, simple_loss=0.3218, pruned_loss=0.1249, over 12616.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3222, pruned_loss=0.1222, over 2575113.06 frames. ], batch size: 22, lr: 7.18e-03, grad_scale: 64.0 2024-06-20 06:02:51,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145511.66666666666, ans=0.1 2024-06-20 06:02:55,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=145530.0, ans=0.025 2024-06-20 06:02:55,936 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.76 vs. limit=22.5 2024-06-20 06:03:03,337 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.00 vs. limit=15.0 2024-06-20 06:03:09,020 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.10 vs. limit=15.0 2024-06-20 06:03:09,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=145566.66666666666, ans=0.125 2024-06-20 06:03:12,930 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 2.171e+02 2.366e+02 2.808e+02 3.872e+02, threshold=4.732e+02, percent-clipped=0.0 2024-06-20 06:03:13,316 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.83 vs. limit=22.5 2024-06-20 06:03:15,304 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.19 vs. limit=22.5 2024-06-20 06:03:25,702 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.55 vs. limit=15.0 2024-06-20 06:03:26,102 INFO [train.py:1028] (1/2) Epoch 8, batch 8600, loss[loss=0.2881, simple_loss=0.3332, pruned_loss=0.1215, over 13097.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3226, pruned_loss=0.1223, over 2572926.01 frames. ], batch size: 121, lr: 7.18e-03, grad_scale: 64.0 2024-06-20 06:03:41,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=145640.0, ans=0.125 2024-06-20 06:03:43,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=145640.0, ans=0.0 2024-06-20 06:03:43,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=145640.0, ans=0.125 2024-06-20 06:03:52,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145676.66666666666, ans=0.1 2024-06-20 06:03:55,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=145676.66666666666, ans=0.1 2024-06-20 06:03:56,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=145676.66666666666, ans=0.0 2024-06-20 06:03:57,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=145676.66666666666, ans=0.125 2024-06-20 06:03:59,832 INFO [train.py:1028] (1/2) Epoch 8, batch 8650, loss[loss=0.2943, simple_loss=0.3279, pruned_loss=0.1304, over 13034.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3228, pruned_loss=0.1223, over 2576417.45 frames. ], batch size: 102, lr: 7.17e-03, grad_scale: 64.0 2024-06-20 06:04:01,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=145695.0, ans=0.125 2024-06-20 06:04:03,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=145695.0, ans=0.04949747468305833 2024-06-20 06:04:18,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=145731.66666666666, ans=0.1 2024-06-20 06:04:25,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=145750.0, ans=0.125 2024-06-20 06:04:25,803 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 2.098e+02 2.285e+02 2.659e+02 3.625e+02, threshold=4.570e+02, percent-clipped=0.0 2024-06-20 06:04:26,289 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.99 vs. limit=15.0 2024-06-20 06:04:27,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=145750.0, ans=0.09899494936611666 2024-06-20 06:04:33,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=145768.33333333334, ans=0.0 2024-06-20 06:04:35,715 INFO [train.py:1028] (1/2) Epoch 8, batch 8700, loss[loss=0.2941, simple_loss=0.3417, pruned_loss=0.1233, over 13165.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3238, pruned_loss=0.1229, over 2573374.62 frames. ], batch size: 59, lr: 7.17e-03, grad_scale: 64.0 2024-06-20 06:04:39,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=145786.66666666666, ans=0.0 2024-06-20 06:04:40,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=145786.66666666666, ans=0.1 2024-06-20 06:04:43,228 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.28 vs. limit=22.5 2024-06-20 06:04:50,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=145823.33333333334, ans=0.0 2024-06-20 06:05:04,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.61 vs. limit=6.0 2024-06-20 06:05:12,004 INFO [train.py:1028] (1/2) Epoch 8, batch 8750, loss[loss=0.2751, simple_loss=0.3139, pruned_loss=0.1181, over 13111.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3237, pruned_loss=0.123, over 2567647.02 frames. ], batch size: 121, lr: 7.17e-03, grad_scale: 64.0 2024-06-20 06:05:15,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=145878.33333333334, ans=0.125 2024-06-20 06:05:20,394 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.94 vs. limit=15.0 2024-06-20 06:05:20,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=145896.66666666666, ans=0.2 2024-06-20 06:05:27,425 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2024-06-20 06:05:33,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.09 vs. limit=10.0 2024-06-20 06:05:35,241 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.71 vs. limit=15.0 2024-06-20 06:05:35,344 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 1.980e+02 2.148e+02 2.405e+02 3.825e+02, threshold=4.296e+02, percent-clipped=0.0 2024-06-20 06:05:35,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=145933.33333333334, ans=0.0 2024-06-20 06:05:37,448 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.29 vs. limit=15.0 2024-06-20 06:05:38,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=145951.66666666666, ans=0.0 2024-06-20 06:05:43,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=145951.66666666666, ans=0.125 2024-06-20 06:05:46,345 INFO [train.py:1028] (1/2) Epoch 8, batch 8800, loss[loss=0.2748, simple_loss=0.3197, pruned_loss=0.1149, over 13241.00 frames. ], tot_loss[loss=0.2856, simple_loss=0.3241, pruned_loss=0.1235, over 2572759.87 frames. ], batch size: 72, lr: 7.17e-03, grad_scale: 64.0 2024-06-20 06:05:47,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=145970.0, ans=0.125 2024-06-20 06:05:51,388 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.04 vs. limit=15.0 2024-06-20 06:06:10,139 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.69 vs. limit=15.0 2024-06-20 06:06:11,441 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.05 vs. limit=22.5 2024-06-20 06:06:12,532 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=146025.0, ans=0.125 2024-06-20 06:06:24,097 INFO [train.py:1028] (1/2) Epoch 8, batch 8850, loss[loss=0.3218, simple_loss=0.351, pruned_loss=0.1463, over 12523.00 frames. ], tot_loss[loss=0.2867, simple_loss=0.3247, pruned_loss=0.1243, over 2561132.77 frames. ], batch size: 202, lr: 7.17e-03, grad_scale: 64.0 2024-06-20 06:06:24,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=146061.66666666666, ans=0.95 2024-06-20 06:06:26,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=146061.66666666666, ans=0.125 2024-06-20 06:06:31,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=146080.0, ans=0.125 2024-06-20 06:06:31,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=146080.0, ans=0.125 2024-06-20 06:06:36,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=146080.0, ans=0.125 2024-06-20 06:06:42,430 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.63 vs. limit=6.0 2024-06-20 06:06:47,440 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 1.971e+02 2.144e+02 2.478e+02 3.434e+02, threshold=4.288e+02, percent-clipped=0.0 2024-06-20 06:06:52,504 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.55 vs. limit=15.0 2024-06-20 06:06:55,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=146135.0, ans=0.125 2024-06-20 06:06:57,479 INFO [train.py:1028] (1/2) Epoch 8, batch 8900, loss[loss=0.2812, simple_loss=0.322, pruned_loss=0.1202, over 12910.00 frames. ], tot_loss[loss=0.2867, simple_loss=0.3247, pruned_loss=0.1244, over 2558317.46 frames. ], batch size: 33, lr: 7.16e-03, grad_scale: 64.0 2024-06-20 06:07:15,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=146190.0, ans=0.0 2024-06-20 06:07:21,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=146190.0, ans=0.125 2024-06-20 06:07:21,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=146190.0, ans=0.0 2024-06-20 06:07:22,949 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.97 vs. limit=15.0 2024-06-20 06:07:26,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=146208.33333333334, ans=0.0 2024-06-20 06:07:28,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=146226.66666666666, ans=0.0 2024-06-20 06:07:35,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=146226.66666666666, ans=0.07 2024-06-20 06:07:36,446 INFO [train.py:1028] (1/2) Epoch 8, batch 8950, loss[loss=0.3239, simple_loss=0.3463, pruned_loss=0.1508, over 12521.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3247, pruned_loss=0.124, over 2559759.90 frames. ], batch size: 202, lr: 7.16e-03, grad_scale: 64.0 2024-06-20 06:07:38,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=146245.0, ans=0.125 2024-06-20 06:07:42,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=146263.33333333334, ans=0.125 2024-06-20 06:07:44,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=146263.33333333334, ans=10.0 2024-06-20 06:08:00,092 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 2.017e+02 2.283e+02 2.688e+02 4.509e+02, threshold=4.567e+02, percent-clipped=1.0 2024-06-20 06:08:02,553 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.64 vs. limit=22.5 2024-06-20 06:08:13,456 INFO [train.py:1028] (1/2) Epoch 8, batch 9000, loss[loss=0.2921, simple_loss=0.3277, pruned_loss=0.1283, over 13285.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3246, pruned_loss=0.1236, over 2566797.17 frames. ], batch size: 46, lr: 7.16e-03, grad_scale: 128.0 2024-06-20 06:08:13,456 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 06:08:21,242 INFO [train.py:1060] (1/2) Epoch 8, validation: loss=0.2055, simple_loss=0.2684, pruned_loss=0.07135, over 351949.00 frames. 2024-06-20 06:08:21,242 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 06:08:21,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=146336.66666666666, ans=0.0 2024-06-20 06:08:39,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=146373.33333333334, ans=0.125 2024-06-20 06:08:53,446 INFO [train.py:1028] (1/2) Epoch 8, batch 9050, loss[loss=0.2638, simple_loss=0.3047, pruned_loss=0.1115, over 11044.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3257, pruned_loss=0.1243, over 2567307.32 frames. ], batch size: 16, lr: 7.16e-03, grad_scale: 128.0 2024-06-20 06:09:00,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=146446.66666666666, ans=0.125 2024-06-20 06:09:09,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=146465.0, ans=0.125 2024-06-20 06:09:13,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=146483.33333333334, ans=0.0 2024-06-20 06:09:16,522 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 1.982e+02 2.156e+02 2.383e+02 3.049e+02, threshold=4.313e+02, percent-clipped=0.0 2024-06-20 06:09:26,570 INFO [train.py:1028] (1/2) Epoch 8, batch 9100, loss[loss=0.2698, simple_loss=0.3195, pruned_loss=0.1101, over 13237.00 frames. ], tot_loss[loss=0.2861, simple_loss=0.3251, pruned_loss=0.1235, over 2569558.43 frames. ], batch size: 72, lr: 7.15e-03, grad_scale: 128.0 2024-06-20 06:09:44,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=146556.66666666666, ans=0.95 2024-06-20 06:09:44,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=146556.66666666666, ans=0.125 2024-06-20 06:09:48,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=146575.0, ans=0.125 2024-06-20 06:09:52,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=146593.33333333334, ans=0.09899494936611666 2024-06-20 06:09:52,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=146593.33333333334, ans=0.125 2024-06-20 06:09:54,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=146593.33333333334, ans=0.5 2024-06-20 06:09:58,265 INFO [train.py:1028] (1/2) Epoch 8, batch 9150, loss[loss=0.2682, simple_loss=0.303, pruned_loss=0.1167, over 13168.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3247, pruned_loss=0.1232, over 2570312.49 frames. ], batch size: 77, lr: 7.15e-03, grad_scale: 64.0 2024-06-20 06:10:00,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=146611.66666666666, ans=0.125 2024-06-20 06:10:03,237 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=12.0 2024-06-20 06:10:05,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.27 vs. limit=15.0 2024-06-20 06:10:09,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=146630.0, ans=0.1 2024-06-20 06:10:09,484 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.51 vs. limit=15.0 2024-06-20 06:10:12,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=12.82 vs. limit=12.0 2024-06-20 06:10:28,209 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.06 vs. limit=22.5 2024-06-20 06:10:29,869 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.894e+02 2.037e+02 2.253e+02 2.783e+02, threshold=4.075e+02, percent-clipped=0.0 2024-06-20 06:10:38,789 INFO [train.py:1028] (1/2) Epoch 8, batch 9200, loss[loss=0.2553, simple_loss=0.3102, pruned_loss=0.1002, over 12874.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3242, pruned_loss=0.1225, over 2572484.27 frames. ], batch size: 36, lr: 7.15e-03, grad_scale: 64.0 2024-06-20 06:10:48,871 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.66 vs. limit=15.0 2024-06-20 06:10:50,520 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:10:51,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=146740.0, ans=0.2 2024-06-20 06:11:01,831 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.57 vs. limit=22.5 2024-06-20 06:11:10,281 INFO [train.py:1028] (1/2) Epoch 8, batch 9250, loss[loss=0.2645, simple_loss=0.3162, pruned_loss=0.1065, over 13268.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.324, pruned_loss=0.1223, over 2575554.69 frames. ], batch size: 67, lr: 7.15e-03, grad_scale: 64.0 2024-06-20 06:11:12,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=146795.0, ans=0.1 2024-06-20 06:11:14,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=146795.0, ans=0.1 2024-06-20 06:11:14,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.76 vs. limit=10.0 2024-06-20 06:11:16,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=146813.33333333334, ans=0.125 2024-06-20 06:11:20,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=146813.33333333334, ans=0.1 2024-06-20 06:11:28,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=146850.0, ans=0.125 2024-06-20 06:11:30,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=146850.0, ans=0.125 2024-06-20 06:11:32,362 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.49 vs. limit=15.0 2024-06-20 06:11:32,512 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 1.923e+02 2.143e+02 2.473e+02 3.321e+02, threshold=4.286e+02, percent-clipped=0.0 2024-06-20 06:11:35,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=146868.33333333334, ans=0.07 2024-06-20 06:11:40,209 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.70 vs. limit=15.0 2024-06-20 06:11:41,436 INFO [train.py:1028] (1/2) Epoch 8, batch 9300, loss[loss=0.282, simple_loss=0.3194, pruned_loss=0.1223, over 12973.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3235, pruned_loss=0.1217, over 2571408.90 frames. ], batch size: 39, lr: 7.15e-03, grad_scale: 64.0 2024-06-20 06:11:41,912 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.90 vs. limit=10.0 2024-06-20 06:11:49,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=146905.0, ans=0.125 2024-06-20 06:12:02,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=146941.66666666666, ans=0.125 2024-06-20 06:12:07,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=146941.66666666666, ans=0.2 2024-06-20 06:12:15,362 INFO [train.py:1028] (1/2) Epoch 8, batch 9350, loss[loss=0.248, simple_loss=0.2996, pruned_loss=0.09817, over 12501.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3241, pruned_loss=0.1222, over 2569469.36 frames. ], batch size: 22, lr: 7.14e-03, grad_scale: 64.0 2024-06-20 06:12:15,698 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.56 vs. limit=22.5 2024-06-20 06:12:16,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=146978.33333333334, ans=0.0 2024-06-20 06:12:22,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=146996.66666666666, ans=0.2 2024-06-20 06:12:26,003 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.33 vs. limit=15.0 2024-06-20 06:12:30,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=147015.0, ans=0.025 2024-06-20 06:12:37,815 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.98 vs. limit=15.0 2024-06-20 06:12:37,917 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.755e+02 1.948e+02 2.046e+02 2.282e+02 3.279e+02, threshold=4.091e+02, percent-clipped=0.0 2024-06-20 06:12:46,398 INFO [train.py:1028] (1/2) Epoch 8, batch 9400, loss[loss=0.2859, simple_loss=0.328, pruned_loss=0.1219, over 13260.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3244, pruned_loss=0.1225, over 2568217.05 frames. ], batch size: 52, lr: 7.14e-03, grad_scale: 64.0 2024-06-20 06:12:50,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=147070.0, ans=0.2 2024-06-20 06:12:56,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=147088.33333333334, ans=0.125 2024-06-20 06:12:58,824 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.25 vs. limit=15.0 2024-06-20 06:13:00,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=147106.66666666666, ans=0.0 2024-06-20 06:13:02,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=147106.66666666666, ans=0.05 2024-06-20 06:13:16,732 INFO [train.py:1028] (1/2) Epoch 8, batch 9450, loss[loss=0.2976, simple_loss=0.3366, pruned_loss=0.1294, over 12425.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3253, pruned_loss=0.1233, over 2569109.17 frames. ], batch size: 22, lr: 7.14e-03, grad_scale: 64.0 2024-06-20 06:13:17,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=147161.66666666666, ans=0.025 2024-06-20 06:13:17,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=147161.66666666666, ans=0.125 2024-06-20 06:13:18,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=147161.66666666666, ans=0.025 2024-06-20 06:13:21,063 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:13:30,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147198.33333333334, ans=0.1 2024-06-20 06:13:36,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=147198.33333333334, ans=0.125 2024-06-20 06:13:37,921 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.12 vs. limit=15.0 2024-06-20 06:13:41,521 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 1.961e+02 2.150e+02 2.399e+02 3.185e+02, threshold=4.301e+02, percent-clipped=0.0 2024-06-20 06:13:49,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=147235.0, ans=0.125 2024-06-20 06:13:50,336 INFO [train.py:1028] (1/2) Epoch 8, batch 9500, loss[loss=0.2636, simple_loss=0.3189, pruned_loss=0.1042, over 13216.00 frames. ], tot_loss[loss=0.2851, simple_loss=0.325, pruned_loss=0.1226, over 2577819.61 frames. ], batch size: 43, lr: 7.14e-03, grad_scale: 64.0 2024-06-20 06:13:50,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=147253.33333333334, ans=0.125 2024-06-20 06:13:54,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=147253.33333333334, ans=0.2 2024-06-20 06:14:03,768 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.31 vs. limit=15.0 2024-06-20 06:14:04,504 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.860e+01 2024-06-20 06:14:21,423 INFO [train.py:1028] (1/2) Epoch 8, batch 9550, loss[loss=0.2494, simple_loss=0.2958, pruned_loss=0.1015, over 12904.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.3245, pruned_loss=0.1222, over 2573868.96 frames. ], batch size: 39, lr: 7.13e-03, grad_scale: 64.0 2024-06-20 06:14:27,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=147363.33333333334, ans=0.025 2024-06-20 06:14:34,187 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=147381.66666666666, ans=0.2 2024-06-20 06:14:39,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=147400.0, ans=0.1 2024-06-20 06:14:44,019 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.945e+02 2.109e+02 2.331e+02 3.710e+02, threshold=4.217e+02, percent-clipped=0.0 2024-06-20 06:14:54,865 INFO [train.py:1028] (1/2) Epoch 8, batch 9600, loss[loss=0.2965, simple_loss=0.3227, pruned_loss=0.1351, over 10436.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3245, pruned_loss=0.1224, over 2572017.31 frames. ], batch size: 303, lr: 7.13e-03, grad_scale: 64.0 2024-06-20 06:15:07,278 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.35 vs. limit=15.0 2024-06-20 06:15:07,679 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:15:14,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=147491.66666666666, ans=0.125 2024-06-20 06:15:17,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=147491.66666666666, ans=0.0 2024-06-20 06:15:26,043 INFO [train.py:1028] (1/2) Epoch 8, batch 9650, loss[loss=0.2937, simple_loss=0.3228, pruned_loss=0.1323, over 13093.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3249, pruned_loss=0.123, over 2562287.69 frames. ], batch size: 132, lr: 7.13e-03, grad_scale: 64.0 2024-06-20 06:15:30,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=147528.33333333334, ans=0.125 2024-06-20 06:15:36,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=147546.66666666666, ans=0.0 2024-06-20 06:15:47,439 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.61 vs. limit=10.0 2024-06-20 06:15:48,374 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.918e+02 2.087e+02 2.275e+02 3.787e+02, threshold=4.175e+02, percent-clipped=0.0 2024-06-20 06:15:59,888 INFO [train.py:1028] (1/2) Epoch 8, batch 9700, loss[loss=0.2784, simple_loss=0.3195, pruned_loss=0.1187, over 13049.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.324, pruned_loss=0.1226, over 2556660.12 frames. ], batch size: 144, lr: 7.13e-03, grad_scale: 64.0 2024-06-20 06:16:13,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=147656.66666666666, ans=0.0 2024-06-20 06:16:15,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=147656.66666666666, ans=0.02 2024-06-20 06:16:18,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=147675.0, ans=0.1 2024-06-20 06:16:30,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=147711.66666666666, ans=0.1 2024-06-20 06:16:30,489 INFO [train.py:1028] (1/2) Epoch 8, batch 9750, loss[loss=0.2833, simple_loss=0.3123, pruned_loss=0.1271, over 13050.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3227, pruned_loss=0.1217, over 2553168.69 frames. ], batch size: 132, lr: 7.13e-03, grad_scale: 64.0 2024-06-20 06:16:33,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=147711.66666666666, ans=0.125 2024-06-20 06:16:39,492 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.09 vs. limit=6.0 2024-06-20 06:16:40,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147730.0, ans=0.1 2024-06-20 06:16:54,168 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.011e+02 2.168e+02 2.550e+02 3.783e+02, threshold=4.336e+02, percent-clipped=0.0 2024-06-20 06:16:55,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=147766.66666666666, ans=0.2 2024-06-20 06:17:02,855 INFO [train.py:1028] (1/2) Epoch 8, batch 9800, loss[loss=0.2595, simple_loss=0.3101, pruned_loss=0.1045, over 13287.00 frames. ], tot_loss[loss=0.282, simple_loss=0.322, pruned_loss=0.121, over 2546284.19 frames. ], batch size: 40, lr: 7.12e-03, grad_scale: 64.0 2024-06-20 06:17:05,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=147803.33333333334, ans=0.1 2024-06-20 06:17:05,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=147803.33333333334, ans=0.0 2024-06-20 06:17:07,346 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.66 vs. limit=10.0 2024-06-20 06:17:22,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=147858.33333333334, ans=0.0 2024-06-20 06:17:22,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=147858.33333333334, ans=0.0 2024-06-20 06:17:24,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=147858.33333333334, ans=0.0 2024-06-20 06:17:25,849 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.91 vs. limit=15.0 2024-06-20 06:17:31,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=147876.66666666666, ans=0.95 2024-06-20 06:17:33,695 INFO [train.py:1028] (1/2) Epoch 8, batch 9850, loss[loss=0.2975, simple_loss=0.3319, pruned_loss=0.1316, over 13020.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3211, pruned_loss=0.121, over 2538348.84 frames. ], batch size: 102, lr: 7.12e-03, grad_scale: 64.0 2024-06-20 06:17:33,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=147895.0, ans=10.0 2024-06-20 06:17:35,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=147895.0, ans=0.025 2024-06-20 06:17:46,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=147931.66666666666, ans=0.125 2024-06-20 06:17:47,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=147931.66666666666, ans=0.0 2024-06-20 06:17:51,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=147931.66666666666, ans=0.05 2024-06-20 06:17:54,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=147950.0, ans=0.0 2024-06-20 06:17:57,084 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.905e+02 2.078e+02 2.393e+02 3.804e+02, threshold=4.156e+02, percent-clipped=0.0 2024-06-20 06:17:57,574 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.06 vs. limit=15.0 2024-06-20 06:17:58,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=147950.0, ans=0.1 2024-06-20 06:17:58,363 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:17:58,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=147950.0, ans=0.0 2024-06-20 06:17:58,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=147950.0, ans=0.04949747468305833 2024-06-20 06:18:05,419 INFO [train.py:1028] (1/2) Epoch 8, batch 9900, loss[loss=0.2742, simple_loss=0.3101, pruned_loss=0.1191, over 12900.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3205, pruned_loss=0.1213, over 2531534.31 frames. ], batch size: 39, lr: 7.12e-03, grad_scale: 64.0 2024-06-20 06:18:20,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=148023.33333333334, ans=0.125 2024-06-20 06:18:24,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=148023.33333333334, ans=0.0 2024-06-20 06:18:27,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148041.66666666666, ans=0.1 2024-06-20 06:18:28,664 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.28 vs. limit=22.5 2024-06-20 06:18:35,824 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.40 vs. limit=15.0 2024-06-20 06:18:37,931 INFO [train.py:1028] (1/2) Epoch 8, batch 9950, loss[loss=0.2767, simple_loss=0.3179, pruned_loss=0.1178, over 12640.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.3195, pruned_loss=0.1214, over 2523891.84 frames. ], batch size: 29, lr: 7.12e-03, grad_scale: 64.0 2024-06-20 06:18:43,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=148078.33333333334, ans=0.125 2024-06-20 06:18:51,475 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=27.14 vs. limit=22.5 2024-06-20 06:19:01,369 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 1.917e+02 2.109e+02 2.316e+02 3.279e+02, threshold=4.218e+02, percent-clipped=0.0 2024-06-20 06:19:10,916 INFO [train.py:1028] (1/2) Epoch 8, batch 10000, loss[loss=0.2744, simple_loss=0.3202, pruned_loss=0.1142, over 12407.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3198, pruned_loss=0.1219, over 2486167.65 frames. ], batch size: 22, lr: 7.11e-03, grad_scale: 64.0 2024-06-20 06:19:11,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=148170.0, ans=0.0 2024-06-20 06:19:13,401 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.10 vs. limit=10.0 2024-06-20 06:19:24,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=148206.66666666666, ans=0.125 2024-06-20 06:19:30,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=148225.0, ans=0.125 2024-06-20 06:19:38,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148243.33333333334, ans=0.1 2024-06-20 06:19:42,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148261.66666666666, ans=0.1 2024-06-20 06:19:43,428 INFO [train.py:1028] (1/2) Epoch 8, batch 10050, loss[loss=0.2625, simple_loss=0.3068, pruned_loss=0.1091, over 12411.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3209, pruned_loss=0.1237, over 2443772.22 frames. ], batch size: 22, lr: 7.11e-03, grad_scale: 64.0 2024-06-20 06:19:44,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=148261.66666666666, ans=0.125 2024-06-20 06:19:49,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=148280.0, ans=0.2 2024-06-20 06:19:50,558 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.31 vs. limit=15.0 2024-06-20 06:19:59,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=148298.33333333334, ans=0.125 2024-06-20 06:20:03,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=148316.66666666666, ans=0.125 2024-06-20 06:20:04,542 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 2.090e+02 2.344e+02 2.693e+02 3.716e+02, threshold=4.688e+02, percent-clipped=0.0 2024-06-20 06:20:05,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148316.66666666666, ans=0.1 2024-06-20 06:20:13,396 INFO [train.py:1028] (1/2) Epoch 8, batch 10100, loss[loss=0.262, simple_loss=0.3055, pruned_loss=0.1092, over 11430.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3194, pruned_loss=0.1221, over 2425925.19 frames. ], batch size: 17, lr: 7.11e-03, grad_scale: 64.0 2024-06-20 06:20:13,699 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=22.5 2024-06-20 06:20:17,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148353.33333333334, ans=0.1 2024-06-20 06:20:17,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=148353.33333333334, ans=0.0 2024-06-20 06:20:19,572 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.88 vs. limit=15.0 2024-06-20 06:20:21,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=148371.66666666666, ans=0.0 2024-06-20 06:20:23,229 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.00 vs. limit=15.0 2024-06-20 06:22:12,610 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:22:27,277 INFO [train.py:1028] (1/2) Epoch 9, batch 0, loss[loss=0.2483, simple_loss=0.2937, pruned_loss=0.1014, over 12862.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.2937, pruned_loss=0.1014, over 12862.00 frames. ], batch size: 36, lr: 6.73e-03, grad_scale: 64.0 2024-06-20 06:22:27,277 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 06:22:34,311 INFO [train.py:1060] (1/2) Epoch 9, validation: loss=0.2068, simple_loss=0.27, pruned_loss=0.07179, over 351949.00 frames. 2024-06-20 06:22:34,311 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 06:22:36,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=148386.33333333334, ans=0.125 2024-06-20 06:22:39,196 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=15.0 2024-06-20 06:22:39,848 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.64 vs. limit=15.0 2024-06-20 06:22:49,656 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.30 vs. limit=22.5 2024-06-20 06:22:53,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148423.0, ans=0.1 2024-06-20 06:23:04,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=148459.66666666666, ans=0.05 2024-06-20 06:23:07,402 INFO [train.py:1028] (1/2) Epoch 9, batch 50, loss[loss=0.2787, simple_loss=0.3189, pruned_loss=0.1193, over 12624.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3003, pruned_loss=0.1128, over 574502.98 frames. ], batch size: 29, lr: 6.73e-03, grad_scale: 64.0 2024-06-20 06:23:11,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=148478.0, ans=0.0 2024-06-20 06:23:16,174 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.75 vs. limit=12.0 2024-06-20 06:23:18,708 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.820e+02 1.981e+02 2.167e+02 3.074e+02, threshold=3.962e+02, percent-clipped=0.0 2024-06-20 06:23:20,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=148514.66666666666, ans=0.0 2024-06-20 06:23:21,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=148514.66666666666, ans=0.0 2024-06-20 06:23:23,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=148514.66666666666, ans=0.125 2024-06-20 06:23:30,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=148533.0, ans=0.0 2024-06-20 06:23:33,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=148551.33333333334, ans=0.0 2024-06-20 06:23:41,280 INFO [train.py:1028] (1/2) Epoch 9, batch 100, loss[loss=0.2326, simple_loss=0.2797, pruned_loss=0.09277, over 13313.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.2979, pruned_loss=0.1111, over 1017315.11 frames. ], batch size: 46, lr: 6.73e-03, grad_scale: 64.0 2024-06-20 06:23:41,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=148569.66666666666, ans=0.0 2024-06-20 06:23:43,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=148569.66666666666, ans=0.05 2024-06-20 06:23:45,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=148569.66666666666, ans=0.125 2024-06-20 06:23:58,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=148606.33333333334, ans=0.0 2024-06-20 06:24:00,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=148606.33333333334, ans=0.125 2024-06-20 06:24:02,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=148606.33333333334, ans=0.0 2024-06-20 06:24:05,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=148624.66666666666, ans=0.0 2024-06-20 06:24:10,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=148643.0, ans=0.125 2024-06-20 06:24:16,670 INFO [train.py:1028] (1/2) Epoch 9, batch 150, loss[loss=0.2457, simple_loss=0.2871, pruned_loss=0.1021, over 12795.00 frames. ], tot_loss[loss=0.26, simple_loss=0.298, pruned_loss=0.111, over 1364614.22 frames. ], batch size: 29, lr: 6.72e-03, grad_scale: 64.0 2024-06-20 06:24:16,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=148661.33333333334, ans=0.2 2024-06-20 06:24:28,112 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.804e+02 1.922e+02 2.111e+02 3.094e+02, threshold=3.845e+02, percent-clipped=0.0 2024-06-20 06:24:41,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=148716.33333333334, ans=15.0 2024-06-20 06:24:43,434 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.49 vs. limit=15.0 2024-06-20 06:24:44,841 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.03 vs. limit=15.0 2024-06-20 06:24:48,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=148753.0, ans=0.0 2024-06-20 06:24:48,750 INFO [train.py:1028] (1/2) Epoch 9, batch 200, loss[loss=0.2963, simple_loss=0.3245, pruned_loss=0.1341, over 12555.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.2979, pruned_loss=0.1108, over 1634073.56 frames. ], batch size: 202, lr: 6.72e-03, grad_scale: 64.0 2024-06-20 06:24:52,284 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.27 vs. limit=15.0 2024-06-20 06:24:57,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=148771.33333333334, ans=0.09899494936611666 2024-06-20 06:25:00,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=148771.33333333334, ans=0.2 2024-06-20 06:25:05,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=148789.66666666666, ans=0.125 2024-06-20 06:25:07,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=148808.0, ans=0.0 2024-06-20 06:25:18,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=148826.33333333334, ans=0.2 2024-06-20 06:25:18,366 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.29 vs. limit=22.5 2024-06-20 06:25:19,782 INFO [train.py:1028] (1/2) Epoch 9, batch 250, loss[loss=0.2548, simple_loss=0.2892, pruned_loss=0.1102, over 13020.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.298, pruned_loss=0.1106, over 1845875.11 frames. ], batch size: 144, lr: 6.72e-03, grad_scale: 64.0 2024-06-20 06:25:20,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=148844.66666666666, ans=0.0 2024-06-20 06:25:20,799 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.77 vs. limit=15.0 2024-06-20 06:25:21,536 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.00 vs. limit=10.0 2024-06-20 06:25:23,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=148844.66666666666, ans=0.0 2024-06-20 06:25:29,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=148863.0, ans=0.0 2024-06-20 06:25:30,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=148863.0, ans=0.2 2024-06-20 06:25:31,290 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 1.812e+02 1.972e+02 2.224e+02 4.192e+02, threshold=3.945e+02, percent-clipped=1.0 2024-06-20 06:25:45,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=148899.66666666666, ans=0.025 2024-06-20 06:25:47,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=148899.66666666666, ans=0.0 2024-06-20 06:25:58,228 INFO [train.py:1028] (1/2) Epoch 9, batch 300, loss[loss=0.2652, simple_loss=0.2967, pruned_loss=0.1169, over 13128.00 frames. ], tot_loss[loss=0.26, simple_loss=0.2985, pruned_loss=0.1108, over 2008886.25 frames. ], batch size: 112, lr: 6.72e-03, grad_scale: 64.0 2024-06-20 06:25:58,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=148936.33333333334, ans=0.0 2024-06-20 06:26:03,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=148936.33333333334, ans=0.2 2024-06-20 06:26:08,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=148954.66666666666, ans=0.025 2024-06-20 06:26:09,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=148954.66666666666, ans=0.125 2024-06-20 06:26:14,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=148973.0, ans=0.025 2024-06-20 06:26:15,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=148973.0, ans=0.0 2024-06-20 06:26:16,713 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.38 vs. limit=22.5 2024-06-20 06:26:18,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=148991.33333333334, ans=0.125 2024-06-20 06:26:18,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=148991.33333333334, ans=0.0 2024-06-20 06:26:21,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=148991.33333333334, ans=0.0 2024-06-20 06:26:30,750 INFO [train.py:1028] (1/2) Epoch 9, batch 350, loss[loss=0.2566, simple_loss=0.2895, pruned_loss=0.1118, over 12998.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.2968, pruned_loss=0.1097, over 2137402.98 frames. ], batch size: 33, lr: 6.72e-03, grad_scale: 64.0 2024-06-20 06:26:35,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149028.0, ans=0.1 2024-06-20 06:26:42,787 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.873e+02 2.049e+02 2.269e+02 3.318e+02, threshold=4.098e+02, percent-clipped=0.0 2024-06-20 06:26:51,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=149083.0, ans=0.125 2024-06-20 06:26:56,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=149101.33333333334, ans=0.125 2024-06-20 06:27:02,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=149119.66666666666, ans=0.125 2024-06-20 06:27:03,010 INFO [train.py:1028] (1/2) Epoch 9, batch 400, loss[loss=0.2617, simple_loss=0.3073, pruned_loss=0.1081, over 13316.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.2971, pruned_loss=0.1095, over 2237908.03 frames. ], batch size: 63, lr: 6.71e-03, grad_scale: 64.0 2024-06-20 06:27:05,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=149119.66666666666, ans=0.0 2024-06-20 06:27:09,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=149138.0, ans=0.0 2024-06-20 06:27:12,950 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.19 vs. limit=22.5 2024-06-20 06:27:22,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=149174.66666666666, ans=0.125 2024-06-20 06:27:31,193 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2024-06-20 06:27:34,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=149211.33333333334, ans=0.125 2024-06-20 06:27:34,682 INFO [train.py:1028] (1/2) Epoch 9, batch 450, loss[loss=0.2535, simple_loss=0.3028, pruned_loss=0.1021, over 13200.00 frames. ], tot_loss[loss=0.258, simple_loss=0.297, pruned_loss=0.1095, over 2312138.38 frames. ], batch size: 67, lr: 6.71e-03, grad_scale: 64.0 2024-06-20 06:27:34,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=149211.33333333334, ans=0.95 2024-06-20 06:27:36,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=149211.33333333334, ans=0.0 2024-06-20 06:27:43,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.23 vs. limit=22.5 2024-06-20 06:27:49,779 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.911e+02 2.029e+02 2.241e+02 3.321e+02, threshold=4.059e+02, percent-clipped=0.0 2024-06-20 06:27:50,252 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.83 vs. limit=15.0 2024-06-20 06:27:55,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=149248.0, ans=0.1 2024-06-20 06:28:06,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=149284.66666666666, ans=0.125 2024-06-20 06:28:07,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=149284.66666666666, ans=0.0 2024-06-20 06:28:08,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=149284.66666666666, ans=0.125 2024-06-20 06:28:13,155 INFO [train.py:1028] (1/2) Epoch 9, batch 500, loss[loss=0.2486, simple_loss=0.2793, pruned_loss=0.1089, over 13123.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.2972, pruned_loss=0.1093, over 2375334.54 frames. ], batch size: 121, lr: 6.71e-03, grad_scale: 64.0 2024-06-20 06:28:15,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=149303.0, ans=0.0 2024-06-20 06:28:18,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=149321.33333333334, ans=0.2 2024-06-20 06:28:26,360 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.74 vs. limit=22.5 2024-06-20 06:28:27,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=149339.66666666666, ans=0.125 2024-06-20 06:28:35,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=149358.0, ans=0.2 2024-06-20 06:28:39,179 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=149376.33333333334, ans=0.1 2024-06-20 06:28:40,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=149376.33333333334, ans=0.125 2024-06-20 06:28:45,080 INFO [train.py:1028] (1/2) Epoch 9, batch 550, loss[loss=0.2516, simple_loss=0.2895, pruned_loss=0.1068, over 12937.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.2974, pruned_loss=0.1094, over 2420153.14 frames. ], batch size: 158, lr: 6.71e-03, grad_scale: 64.0 2024-06-20 06:28:53,023 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2024-06-20 06:28:56,386 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.823e+02 1.969e+02 2.175e+02 2.892e+02, threshold=3.938e+02, percent-clipped=0.0 2024-06-20 06:28:59,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=149431.33333333334, ans=15.0 2024-06-20 06:29:07,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=149449.66666666666, ans=0.1 2024-06-20 06:29:11,920 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=6.480e+01 2024-06-20 06:29:14,419 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=15.0 2024-06-20 06:29:16,435 INFO [train.py:1028] (1/2) Epoch 9, batch 600, loss[loss=0.2317, simple_loss=0.2573, pruned_loss=0.103, over 13087.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.2965, pruned_loss=0.109, over 2457544.62 frames. ], batch size: 144, lr: 6.71e-03, grad_scale: 64.0 2024-06-20 06:29:20,643 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.12 vs. limit=15.0 2024-06-20 06:29:28,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=149523.0, ans=0.125 2024-06-20 06:29:31,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=149523.0, ans=0.2 2024-06-20 06:29:31,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=149523.0, ans=0.2 2024-06-20 06:29:32,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=149523.0, ans=0.0 2024-06-20 06:29:33,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=149523.0, ans=0.0 2024-06-20 06:29:33,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=149523.0, ans=0.125 2024-06-20 06:29:38,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=149541.33333333334, ans=0.125 2024-06-20 06:29:44,179 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.95 vs. limit=6.0 2024-06-20 06:29:47,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=149559.66666666666, ans=0.2 2024-06-20 06:29:48,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=149559.66666666666, ans=15.0 2024-06-20 06:29:51,645 INFO [train.py:1028] (1/2) Epoch 9, batch 650, loss[loss=0.2525, simple_loss=0.3029, pruned_loss=0.1011, over 13160.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.2967, pruned_loss=0.1085, over 2489317.83 frames. ], batch size: 59, lr: 6.70e-03, grad_scale: 64.0 2024-06-20 06:29:52,910 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.70 vs. limit=15.0 2024-06-20 06:30:05,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=149596.33333333334, ans=0.025 2024-06-20 06:30:06,495 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.877e+02 1.985e+02 2.088e+02 2.810e+02, threshold=3.971e+02, percent-clipped=0.0 2024-06-20 06:30:11,700 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.43 vs. limit=15.0 2024-06-20 06:30:25,055 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=13.87 vs. limit=15.0 2024-06-20 06:30:27,720 INFO [train.py:1028] (1/2) Epoch 9, batch 700, loss[loss=0.2501, simple_loss=0.2911, pruned_loss=0.1045, over 13257.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.2963, pruned_loss=0.1087, over 2512539.20 frames. ], batch size: 46, lr: 6.70e-03, grad_scale: 64.0 2024-06-20 06:30:37,521 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.59 vs. limit=10.0 2024-06-20 06:30:53,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=149743.0, ans=0.0 2024-06-20 06:30:55,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=149743.0, ans=0.0 2024-06-20 06:30:59,936 INFO [train.py:1028] (1/2) Epoch 9, batch 750, loss[loss=0.2241, simple_loss=0.2705, pruned_loss=0.08883, over 13295.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.2959, pruned_loss=0.1084, over 2528246.24 frames. ], batch size: 63, lr: 6.70e-03, grad_scale: 64.0 2024-06-20 06:31:00,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=149761.33333333334, ans=0.125 2024-06-20 06:31:06,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=149779.66666666666, ans=0.125 2024-06-20 06:31:09,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=149779.66666666666, ans=0.125 2024-06-20 06:31:09,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=149779.66666666666, ans=0.04949747468305833 2024-06-20 06:31:11,614 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.806e+02 1.953e+02 2.153e+02 3.078e+02, threshold=3.906e+02, percent-clipped=0.0 2024-06-20 06:31:19,898 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.42 vs. limit=15.0 2024-06-20 06:31:26,446 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.00 vs. limit=15.0 2024-06-20 06:31:32,437 INFO [train.py:1028] (1/2) Epoch 9, batch 800, loss[loss=0.2564, simple_loss=0.2949, pruned_loss=0.1089, over 12948.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.2965, pruned_loss=0.109, over 2541159.10 frames. ], batch size: 36, lr: 6.70e-03, grad_scale: 64.0 2024-06-20 06:31:33,247 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=149853.0, ans=0.125 2024-06-20 06:31:36,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=149853.0, ans=0.2 2024-06-20 06:32:10,906 INFO [train.py:1028] (1/2) Epoch 9, batch 850, loss[loss=0.2614, simple_loss=0.3002, pruned_loss=0.1113, over 13116.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.2955, pruned_loss=0.1081, over 2551570.10 frames. ], batch size: 95, lr: 6.70e-03, grad_scale: 64.0 2024-06-20 06:32:14,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=149944.66666666666, ans=0.125 2024-06-20 06:32:22,348 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.805e+02 1.965e+02 2.176e+02 3.046e+02, threshold=3.930e+02, percent-clipped=0.0 2024-06-20 06:32:29,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=149999.66666666666, ans=0.125 2024-06-20 06:32:39,180 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:32:42,850 INFO [train.py:1028] (1/2) Epoch 9, batch 900, loss[loss=0.2454, simple_loss=0.2885, pruned_loss=0.1012, over 12892.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.2957, pruned_loss=0.1083, over 2556329.26 frames. ], batch size: 36, lr: 6.69e-03, grad_scale: 64.0 2024-06-20 06:32:49,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=150054.66666666666, ans=0.1 2024-06-20 06:33:01,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.40 vs. limit=15.0 2024-06-20 06:33:09,763 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.84 vs. limit=15.0 2024-06-20 06:33:12,198 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2024-06-20 06:33:14,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=150128.0, ans=0.025 2024-06-20 06:33:15,232 INFO [train.py:1028] (1/2) Epoch 9, batch 950, loss[loss=0.2739, simple_loss=0.3138, pruned_loss=0.117, over 12890.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.2964, pruned_loss=0.1086, over 2559260.46 frames. ], batch size: 39, lr: 6.69e-03, grad_scale: 64.0 2024-06-20 06:33:15,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=150128.0, ans=0.0 2024-06-20 06:33:16,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=150128.0, ans=0.0 2024-06-20 06:33:16,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150128.0, ans=0.1 2024-06-20 06:33:18,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=150128.0, ans=0.2 2024-06-20 06:33:25,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=150146.33333333334, ans=0.2 2024-06-20 06:33:25,584 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.12 vs. limit=6.0 2024-06-20 06:33:26,408 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 1.854e+02 2.047e+02 2.609e+02 4.175e+02, threshold=4.093e+02, percent-clipped=3.0 2024-06-20 06:33:32,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=150164.66666666666, ans=0.125 2024-06-20 06:33:34,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=150183.0, ans=0.125 2024-06-20 06:33:38,697 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2024-06-20 06:33:45,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=150201.33333333334, ans=0.0 2024-06-20 06:33:46,556 INFO [train.py:1028] (1/2) Epoch 9, batch 1000, loss[loss=0.2621, simple_loss=0.3014, pruned_loss=0.1114, over 13321.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.2956, pruned_loss=0.1084, over 2561241.37 frames. ], batch size: 49, lr: 6.69e-03, grad_scale: 64.0 2024-06-20 06:33:53,554 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:34:13,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=150274.66666666666, ans=0.1 2024-06-20 06:34:15,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=150274.66666666666, ans=0.0 2024-06-20 06:34:23,784 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2024-06-20 06:34:24,557 INFO [train.py:1028] (1/2) Epoch 9, batch 1050, loss[loss=0.2383, simple_loss=0.2857, pruned_loss=0.09547, over 13148.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.2963, pruned_loss=0.1086, over 2564818.46 frames. ], batch size: 77, lr: 6.69e-03, grad_scale: 128.0 2024-06-20 06:34:26,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=150311.33333333334, ans=0.1 2024-06-20 06:34:33,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=150329.66666666666, ans=0.125 2024-06-20 06:34:36,008 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.38 vs. limit=15.0 2024-06-20 06:34:36,128 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.819e+02 1.994e+02 2.168e+02 2.832e+02, threshold=3.989e+02, percent-clipped=0.0 2024-06-20 06:34:36,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=150329.66666666666, ans=0.0 2024-06-20 06:34:38,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=150348.0, ans=0.07 2024-06-20 06:34:41,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=150348.0, ans=0.2 2024-06-20 06:34:46,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150366.33333333334, ans=0.1 2024-06-20 06:34:49,929 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:34:53,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=150384.66666666666, ans=0.1 2024-06-20 06:34:53,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=150384.66666666666, ans=0.0 2024-06-20 06:34:53,460 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=12.0 2024-06-20 06:34:57,187 INFO [train.py:1028] (1/2) Epoch 9, batch 1100, loss[loss=0.2618, simple_loss=0.3021, pruned_loss=0.1107, over 13226.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.2966, pruned_loss=0.1085, over 2570315.36 frames. ], batch size: 52, lr: 6.69e-03, grad_scale: 128.0 2024-06-20 06:35:07,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=150421.33333333334, ans=0.025 2024-06-20 06:35:17,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=150458.0, ans=0.125 2024-06-20 06:35:29,684 INFO [train.py:1028] (1/2) Epoch 9, batch 1150, loss[loss=0.2642, simple_loss=0.3062, pruned_loss=0.1111, over 13257.00 frames. ], tot_loss[loss=0.257, simple_loss=0.2966, pruned_loss=0.1087, over 2571347.71 frames. ], batch size: 52, lr: 6.68e-03, grad_scale: 128.0 2024-06-20 06:35:30,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=150494.66666666666, ans=0.125 2024-06-20 06:35:31,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=150494.66666666666, ans=0.0 2024-06-20 06:35:39,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=150513.0, ans=0.125 2024-06-20 06:35:40,583 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.821e+02 1.969e+02 2.222e+02 2.999e+02, threshold=3.938e+02, percent-clipped=0.0 2024-06-20 06:35:51,647 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.48 vs. limit=15.0 2024-06-20 06:36:05,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=150568.0, ans=0.125 2024-06-20 06:36:07,613 INFO [train.py:1028] (1/2) Epoch 9, batch 1200, loss[loss=0.2512, simple_loss=0.2976, pruned_loss=0.1024, over 13208.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.2962, pruned_loss=0.1086, over 2573599.02 frames. ], batch size: 77, lr: 6.68e-03, grad_scale: 64.0 2024-06-20 06:36:09,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=150586.33333333334, ans=0.125 2024-06-20 06:36:09,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=150586.33333333334, ans=0.0 2024-06-20 06:36:12,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150586.33333333334, ans=0.1 2024-06-20 06:36:22,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=150623.0, ans=0.125 2024-06-20 06:36:23,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=150623.0, ans=0.125 2024-06-20 06:36:28,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=150641.33333333334, ans=0.0 2024-06-20 06:36:30,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=150641.33333333334, ans=0.125 2024-06-20 06:36:37,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=150659.66666666666, ans=0.04949747468305833 2024-06-20 06:36:39,354 INFO [train.py:1028] (1/2) Epoch 9, batch 1250, loss[loss=0.265, simple_loss=0.2989, pruned_loss=0.1155, over 13211.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.2957, pruned_loss=0.1083, over 2583207.94 frames. ], batch size: 112, lr: 6.68e-03, grad_scale: 64.0 2024-06-20 06:36:43,633 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.39 vs. limit=15.0 2024-06-20 06:36:51,518 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.616e+02 1.857e+02 1.957e+02 2.192e+02 3.105e+02, threshold=3.915e+02, percent-clipped=0.0 2024-06-20 06:36:53,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=150714.66666666666, ans=0.125 2024-06-20 06:36:55,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=150714.66666666666, ans=0.0 2024-06-20 06:36:57,849 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.39 vs. limit=15.0 2024-06-20 06:36:58,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=150733.0, ans=0.2 2024-06-20 06:37:05,032 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.65 vs. limit=10.0 2024-06-20 06:37:06,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=150751.33333333334, ans=0.0 2024-06-20 06:37:11,582 INFO [train.py:1028] (1/2) Epoch 9, batch 1300, loss[loss=0.279, simple_loss=0.3118, pruned_loss=0.1231, over 12781.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.2964, pruned_loss=0.1089, over 2583562.32 frames. ], batch size: 176, lr: 6.68e-03, grad_scale: 64.0 2024-06-20 06:37:25,212 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2024-06-20 06:37:29,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=150806.33333333334, ans=0.05 2024-06-20 06:37:36,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=150824.66666666666, ans=0.125 2024-06-20 06:37:37,357 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.98 vs. limit=22.5 2024-06-20 06:37:44,162 INFO [train.py:1028] (1/2) Epoch 9, batch 1350, loss[loss=0.2583, simple_loss=0.3036, pruned_loss=0.1065, over 13194.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.2962, pruned_loss=0.1085, over 2586697.57 frames. ], batch size: 59, lr: 6.68e-03, grad_scale: 64.0 2024-06-20 06:37:48,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150861.33333333334, ans=0.1 2024-06-20 06:37:51,968 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2024-06-20 06:38:02,658 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 1.857e+02 2.030e+02 2.265e+02 2.900e+02, threshold=4.059e+02, percent-clipped=0.0 2024-06-20 06:38:04,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=150898.0, ans=0.125 2024-06-20 06:38:05,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=150898.0, ans=0.2 2024-06-20 06:38:08,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=150898.0, ans=0.05 2024-06-20 06:38:17,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=150934.66666666666, ans=0.2 2024-06-20 06:38:23,564 INFO [train.py:1028] (1/2) Epoch 9, batch 1400, loss[loss=0.2717, simple_loss=0.3089, pruned_loss=0.1173, over 12432.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.2959, pruned_loss=0.1087, over 2588316.88 frames. ], batch size: 25, lr: 6.67e-03, grad_scale: 64.0 2024-06-20 06:38:29,634 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.10 vs. limit=15.0 2024-06-20 06:38:51,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=151026.33333333334, ans=0.2 2024-06-20 06:38:54,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=151026.33333333334, ans=0.125 2024-06-20 06:38:56,340 INFO [train.py:1028] (1/2) Epoch 9, batch 1450, loss[loss=0.2642, simple_loss=0.2992, pruned_loss=0.1146, over 13134.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.2958, pruned_loss=0.1088, over 2588384.25 frames. ], batch size: 121, lr: 6.67e-03, grad_scale: 64.0 2024-06-20 06:38:56,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=151044.66666666666, ans=0.025 2024-06-20 06:39:02,044 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=11.67 vs. limit=12.0 2024-06-20 06:39:09,221 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 1.860e+02 1.990e+02 2.165e+02 3.156e+02, threshold=3.980e+02, percent-clipped=0.0 2024-06-20 06:39:15,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=151081.33333333334, ans=0.125 2024-06-20 06:39:19,124 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2024-06-20 06:39:19,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=151099.66666666666, ans=0.0 2024-06-20 06:39:25,122 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.31 vs. limit=15.0 2024-06-20 06:39:28,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=151118.0, ans=0.125 2024-06-20 06:39:30,014 INFO [train.py:1028] (1/2) Epoch 9, batch 1500, loss[loss=0.241, simple_loss=0.2807, pruned_loss=0.1006, over 13199.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.2953, pruned_loss=0.1088, over 2590516.90 frames. ], batch size: 83, lr: 6.67e-03, grad_scale: 64.0 2024-06-20 06:39:30,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=151136.33333333334, ans=0.125 2024-06-20 06:39:38,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=151154.66666666666, ans=0.025 2024-06-20 06:39:50,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=151173.0, ans=0.125 2024-06-20 06:39:55,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=151173.0, ans=0.0 2024-06-20 06:39:58,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=151191.33333333334, ans=0.04949747468305833 2024-06-20 06:40:00,278 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.25 vs. limit=22.5 2024-06-20 06:40:03,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=151209.66666666666, ans=0.2 2024-06-20 06:40:09,702 INFO [train.py:1028] (1/2) Epoch 9, batch 1550, loss[loss=0.2408, simple_loss=0.2775, pruned_loss=0.1021, over 13021.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.2957, pruned_loss=0.1089, over 2585479.05 frames. ], batch size: 102, lr: 6.67e-03, grad_scale: 64.0 2024-06-20 06:40:14,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=151228.0, ans=0.09899494936611666 2024-06-20 06:40:15,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=151228.0, ans=0.2 2024-06-20 06:40:22,118 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.829e+02 1.950e+02 2.121e+02 3.084e+02, threshold=3.900e+02, percent-clipped=0.0 2024-06-20 06:40:29,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=151283.0, ans=0.2 2024-06-20 06:40:42,013 INFO [train.py:1028] (1/2) Epoch 9, batch 1600, loss[loss=0.2614, simple_loss=0.3015, pruned_loss=0.1106, over 13220.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.2956, pruned_loss=0.1086, over 2580372.23 frames. ], batch size: 77, lr: 6.67e-03, grad_scale: 64.0 2024-06-20 06:40:47,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=151319.66666666666, ans=0.125 2024-06-20 06:40:50,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=151338.0, ans=0.125 2024-06-20 06:41:06,783 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:41:12,355 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=14.29 vs. limit=15.0 2024-06-20 06:41:13,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=151411.33333333334, ans=0.125 2024-06-20 06:41:13,826 INFO [train.py:1028] (1/2) Epoch 9, batch 1650, loss[loss=0.2771, simple_loss=0.3079, pruned_loss=0.1231, over 13123.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.296, pruned_loss=0.1088, over 2576795.37 frames. ], batch size: 95, lr: 6.66e-03, grad_scale: 64.0 2024-06-20 06:41:19,417 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.64 vs. limit=15.0 2024-06-20 06:41:20,021 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=12.0 2024-06-20 06:41:21,157 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.22 vs. limit=22.5 2024-06-20 06:41:26,069 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 1.848e+02 1.997e+02 2.191e+02 2.862e+02, threshold=3.994e+02, percent-clipped=0.0 2024-06-20 06:41:34,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=151466.33333333334, ans=0.125 2024-06-20 06:41:39,504 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.53 vs. limit=10.0 2024-06-20 06:41:46,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=151484.66666666666, ans=0.0 2024-06-20 06:41:49,670 INFO [train.py:1028] (1/2) Epoch 9, batch 1700, loss[loss=0.2863, simple_loss=0.3166, pruned_loss=0.128, over 12766.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.296, pruned_loss=0.1084, over 2582199.56 frames. ], batch size: 26, lr: 6.66e-03, grad_scale: 64.0 2024-06-20 06:41:52,530 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:41:55,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=151521.33333333334, ans=0.025 2024-06-20 06:41:56,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=151521.33333333334, ans=0.0 2024-06-20 06:42:03,648 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.80 vs. limit=10.0 2024-06-20 06:42:10,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=151539.66666666666, ans=0.125 2024-06-20 06:42:15,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=151558.0, ans=0.125 2024-06-20 06:42:24,879 INFO [train.py:1028] (1/2) Epoch 9, batch 1750, loss[loss=0.2702, simple_loss=0.3136, pruned_loss=0.1134, over 12637.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.2969, pruned_loss=0.1088, over 2583128.13 frames. ], batch size: 22, lr: 6.66e-03, grad_scale: 64.0 2024-06-20 06:42:26,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=151594.66666666666, ans=0.125 2024-06-20 06:42:31,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=151613.0, ans=0.0 2024-06-20 06:42:32,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=151613.0, ans=0.025 2024-06-20 06:42:37,192 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.790e+02 1.965e+02 2.187e+02 2.836e+02, threshold=3.930e+02, percent-clipped=0.0 2024-06-20 06:42:43,320 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2024-06-20 06:42:57,355 INFO [train.py:1028] (1/2) Epoch 9, batch 1800, loss[loss=0.2569, simple_loss=0.294, pruned_loss=0.1099, over 13262.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.2979, pruned_loss=0.1096, over 2583264.80 frames. ], batch size: 67, lr: 6.66e-03, grad_scale: 64.0 2024-06-20 06:43:15,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=151723.0, ans=0.0 2024-06-20 06:43:15,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=151723.0, ans=0.125 2024-06-20 06:43:17,067 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.04 vs. limit=15.0 2024-06-20 06:43:19,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=151741.33333333334, ans=0.125 2024-06-20 06:43:23,621 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.13 vs. limit=15.0 2024-06-20 06:43:29,826 INFO [train.py:1028] (1/2) Epoch 9, batch 1850, loss[loss=0.2544, simple_loss=0.2902, pruned_loss=0.1093, over 13203.00 frames. ], tot_loss[loss=0.258, simple_loss=0.2976, pruned_loss=0.1092, over 2583144.32 frames. ], batch size: 83, lr: 6.66e-03, grad_scale: 64.0 2024-06-20 06:43:29,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=151778.0, ans=0.125 2024-06-20 06:43:38,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=151796.33333333334, ans=0.125 2024-06-20 06:43:42,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=151796.33333333334, ans=0.125 2024-06-20 06:43:42,436 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.02 vs. limit=6.0 2024-06-20 06:43:42,631 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 1.882e+02 2.028e+02 2.295e+02 3.513e+02, threshold=4.056e+02, percent-clipped=0.0 2024-06-20 06:43:55,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=151833.0, ans=0.2 2024-06-20 06:44:03,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=151851.33333333334, ans=0.125 2024-06-20 06:44:07,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=151851.33333333334, ans=0.125 2024-06-20 06:44:08,297 INFO [train.py:1028] (1/2) Epoch 9, batch 1900, loss[loss=0.2454, simple_loss=0.2832, pruned_loss=0.1038, over 13122.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.2973, pruned_loss=0.1092, over 2585907.61 frames. ], batch size: 95, lr: 6.65e-03, grad_scale: 64.0 2024-06-20 06:44:19,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=151888.0, ans=0.1 2024-06-20 06:44:27,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=151906.33333333334, ans=0.0 2024-06-20 06:44:39,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=151943.0, ans=0.0 2024-06-20 06:44:40,814 INFO [train.py:1028] (1/2) Epoch 9, batch 1950, loss[loss=0.2417, simple_loss=0.2866, pruned_loss=0.09841, over 13259.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.2962, pruned_loss=0.1086, over 2592137.85 frames. ], batch size: 52, lr: 6.65e-03, grad_scale: 64.0 2024-06-20 06:44:43,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=151961.33333333334, ans=0.0 2024-06-20 06:44:46,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=151961.33333333334, ans=0.125 2024-06-20 06:44:50,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=151979.66666666666, ans=0.025 2024-06-20 06:44:50,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=151979.66666666666, ans=0.1 2024-06-20 06:44:53,325 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.799e+02 1.909e+02 2.051e+02 3.642e+02, threshold=3.819e+02, percent-clipped=0.0 2024-06-20 06:44:58,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=151998.0, ans=0.0 2024-06-20 06:45:09,674 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.07 vs. limit=15.0 2024-06-20 06:45:12,137 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:45:13,329 INFO [train.py:1028] (1/2) Epoch 9, batch 2000, loss[loss=0.2842, simple_loss=0.3265, pruned_loss=0.121, over 12321.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.2968, pruned_loss=0.1092, over 2587366.10 frames. ], batch size: 22, lr: 6.65e-03, grad_scale: 64.0 2024-06-20 06:45:14,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=152053.0, ans=0.125 2024-06-20 06:45:14,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=152053.0, ans=0.0 2024-06-20 06:45:15,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152053.0, ans=0.1 2024-06-20 06:45:17,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=152053.0, ans=0.125 2024-06-20 06:45:21,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=152071.33333333334, ans=0.0 2024-06-20 06:45:32,229 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2024-06-20 06:45:47,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=152126.33333333334, ans=0.1 2024-06-20 06:45:47,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=152126.33333333334, ans=0.0 2024-06-20 06:45:48,990 INFO [train.py:1028] (1/2) Epoch 9, batch 2050, loss[loss=0.233, simple_loss=0.2847, pruned_loss=0.09063, over 12563.00 frames. ], tot_loss[loss=0.258, simple_loss=0.2969, pruned_loss=0.1095, over 2583227.07 frames. ], batch size: 29, lr: 6.65e-03, grad_scale: 64.0 2024-06-20 06:45:49,390 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=152144.66666666666, ans=22.5 2024-06-20 06:45:57,481 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.90 vs. limit=22.5 2024-06-20 06:46:05,255 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 1.831e+02 1.980e+02 2.192e+02 2.951e+02, threshold=3.960e+02, percent-clipped=0.0 2024-06-20 06:46:11,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=152181.33333333334, ans=0.125 2024-06-20 06:46:23,099 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=12.0 2024-06-20 06:46:25,263 INFO [train.py:1028] (1/2) Epoch 9, batch 2100, loss[loss=0.2521, simple_loss=0.3044, pruned_loss=0.0999, over 13265.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.297, pruned_loss=0.1088, over 2585782.23 frames. ], batch size: 59, lr: 6.65e-03, grad_scale: 64.0 2024-06-20 06:46:31,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=152254.66666666666, ans=0.5 2024-06-20 06:46:45,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=152291.33333333334, ans=0.025 2024-06-20 06:46:46,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=152291.33333333334, ans=0.125 2024-06-20 06:46:49,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=152291.33333333334, ans=0.1 2024-06-20 06:46:51,764 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=12.0 2024-06-20 06:46:54,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152309.66666666666, ans=0.1 2024-06-20 06:46:55,070 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.55 vs. limit=22.5 2024-06-20 06:46:57,910 INFO [train.py:1028] (1/2) Epoch 9, batch 2150, loss[loss=0.2473, simple_loss=0.2971, pruned_loss=0.09876, over 13285.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.2967, pruned_loss=0.1082, over 2588365.19 frames. ], batch size: 52, lr: 6.64e-03, grad_scale: 64.0 2024-06-20 06:46:59,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=152328.0, ans=0.5 2024-06-20 06:47:02,717 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=3.672e+01 2024-06-20 06:47:10,480 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.802e+02 2.027e+02 2.254e+02 3.096e+02, threshold=4.054e+02, percent-clipped=0.0 2024-06-20 06:47:12,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=152364.66666666666, ans=0.125 2024-06-20 06:47:14,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=152364.66666666666, ans=0.2 2024-06-20 06:47:15,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=152364.66666666666, ans=0.0 2024-06-20 06:47:21,731 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:47:24,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=152401.33333333334, ans=0.125 2024-06-20 06:47:30,717 INFO [train.py:1028] (1/2) Epoch 9, batch 2200, loss[loss=0.2685, simple_loss=0.2973, pruned_loss=0.1199, over 13197.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.2976, pruned_loss=0.1088, over 2587835.67 frames. ], batch size: 83, lr: 6.64e-03, grad_scale: 64.0 2024-06-20 06:47:32,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.64 vs. limit=22.5 2024-06-20 06:47:41,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=152438.0, ans=0.1 2024-06-20 06:48:02,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=152493.0, ans=0.2 2024-06-20 06:48:08,933 INFO [train.py:1028] (1/2) Epoch 9, batch 2250, loss[loss=0.2739, simple_loss=0.3207, pruned_loss=0.1136, over 13311.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.2974, pruned_loss=0.1087, over 2586812.36 frames. ], batch size: 63, lr: 6.64e-03, grad_scale: 64.0 2024-06-20 06:48:09,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=152511.33333333334, ans=0.125 2024-06-20 06:48:09,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=152511.33333333334, ans=0.5 2024-06-20 06:48:21,382 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.782e+02 1.903e+02 2.055e+02 2.931e+02, threshold=3.805e+02, percent-clipped=0.0 2024-06-20 06:48:22,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=152548.0, ans=0.125 2024-06-20 06:48:33,950 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.46 vs. limit=15.0 2024-06-20 06:48:42,142 INFO [train.py:1028] (1/2) Epoch 9, batch 2300, loss[loss=0.2501, simple_loss=0.2906, pruned_loss=0.1048, over 12811.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.2979, pruned_loss=0.1088, over 2580986.51 frames. ], batch size: 33, lr: 6.64e-03, grad_scale: 64.0 2024-06-20 06:48:45,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=152603.0, ans=0.09899494936611666 2024-06-20 06:48:49,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=152621.33333333334, ans=0.125 2024-06-20 06:48:57,712 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.68 vs. limit=10.0 2024-06-20 06:49:00,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=152639.66666666666, ans=0.125 2024-06-20 06:49:00,776 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.94 vs. limit=10.0 2024-06-20 06:49:02,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152658.0, ans=0.1 2024-06-20 06:49:05,540 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.78 vs. limit=15.0 2024-06-20 06:49:08,692 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.41 vs. limit=15.0 2024-06-20 06:49:12,712 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=12.0 2024-06-20 06:49:14,830 INFO [train.py:1028] (1/2) Epoch 9, batch 2350, loss[loss=0.242, simple_loss=0.2885, pruned_loss=0.09778, over 13199.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.2973, pruned_loss=0.1087, over 2584154.08 frames. ], batch size: 67, lr: 6.64e-03, grad_scale: 64.0 2024-06-20 06:49:19,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=152694.66666666666, ans=0.125 2024-06-20 06:49:20,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=152694.66666666666, ans=0.0 2024-06-20 06:49:24,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.26 vs. limit=12.0 2024-06-20 06:49:26,867 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.39 vs. limit=15.0 2024-06-20 06:49:27,014 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.895e+02 2.085e+02 2.312e+02 2.820e+02, threshold=4.170e+02, percent-clipped=0.0 2024-06-20 06:49:31,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=152731.33333333334, ans=0.0 2024-06-20 06:49:37,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=152749.66666666666, ans=0.0 2024-06-20 06:49:43,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=152768.0, ans=0.125 2024-06-20 06:49:43,601 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=11.40 vs. limit=15.0 2024-06-20 06:49:50,076 INFO [train.py:1028] (1/2) Epoch 9, batch 2400, loss[loss=0.2503, simple_loss=0.2879, pruned_loss=0.1063, over 13301.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.2964, pruned_loss=0.1087, over 2587485.12 frames. ], batch size: 46, lr: 6.63e-03, grad_scale: 64.0 2024-06-20 06:50:10,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=152823.0, ans=0.125 2024-06-20 06:50:12,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=152841.33333333334, ans=0.0 2024-06-20 06:50:19,653 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.58 vs. limit=22.5 2024-06-20 06:50:24,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=152878.0, ans=0.0 2024-06-20 06:50:24,768 INFO [train.py:1028] (1/2) Epoch 9, batch 2450, loss[loss=0.2341, simple_loss=0.273, pruned_loss=0.09755, over 13290.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.2952, pruned_loss=0.1086, over 2583513.22 frames. ], batch size: 63, lr: 6.63e-03, grad_scale: 64.0 2024-06-20 06:50:26,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=152878.0, ans=0.0 2024-06-20 06:50:30,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=152896.33333333334, ans=0.125 2024-06-20 06:50:36,990 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 1.896e+02 2.041e+02 2.242e+02 2.984e+02, threshold=4.081e+02, percent-clipped=0.0 2024-06-20 06:50:40,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=152914.66666666666, ans=0.125 2024-06-20 06:50:42,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=152914.66666666666, ans=0.125 2024-06-20 06:50:55,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=152951.33333333334, ans=0.125 2024-06-20 06:50:57,377 INFO [train.py:1028] (1/2) Epoch 9, batch 2500, loss[loss=0.2559, simple_loss=0.2833, pruned_loss=0.1143, over 13147.00 frames. ], tot_loss[loss=0.256, simple_loss=0.2947, pruned_loss=0.1086, over 2586493.17 frames. ], batch size: 83, lr: 6.63e-03, grad_scale: 64.0 2024-06-20 06:51:00,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=152969.66666666666, ans=0.125 2024-06-20 06:51:00,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=152969.66666666666, ans=0.0 2024-06-20 06:51:05,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152988.0, ans=0.1 2024-06-20 06:51:18,187 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.70 vs. limit=22.5 2024-06-20 06:51:24,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=153043.0, ans=0.0 2024-06-20 06:51:25,526 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.09 vs. limit=10.0 2024-06-20 06:51:29,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=153043.0, ans=0.025 2024-06-20 06:51:30,012 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=11.73 vs. limit=12.0 2024-06-20 06:51:30,218 INFO [train.py:1028] (1/2) Epoch 9, batch 2550, loss[loss=0.2583, simple_loss=0.3032, pruned_loss=0.1067, over 12529.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.293, pruned_loss=0.1077, over 2585776.78 frames. ], batch size: 22, lr: 6.63e-03, grad_scale: 64.0 2024-06-20 06:51:33,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=153061.33333333334, ans=0.025 2024-06-20 06:51:34,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=153061.33333333334, ans=0.0 2024-06-20 06:51:45,465 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 1.891e+02 2.094e+02 2.348e+02 3.499e+02, threshold=4.189e+02, percent-clipped=0.0 2024-06-20 06:51:46,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=153098.0, ans=0.0 2024-06-20 06:51:56,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=153116.33333333334, ans=0.025 2024-06-20 06:51:59,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=153116.33333333334, ans=0.125 2024-06-20 06:52:03,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=153134.66666666666, ans=0.125 2024-06-20 06:52:04,051 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.73 vs. limit=22.5 2024-06-20 06:52:05,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.28 vs. limit=6.0 2024-06-20 06:52:09,493 INFO [train.py:1028] (1/2) Epoch 9, batch 2600, loss[loss=0.2513, simple_loss=0.2909, pruned_loss=0.1058, over 13292.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.2915, pruned_loss=0.1072, over 2586694.02 frames. ], batch size: 52, lr: 6.63e-03, grad_scale: 64.0 2024-06-20 06:52:11,622 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.595e-02 2024-06-20 06:52:20,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=153171.33333333334, ans=0.2 2024-06-20 06:52:21,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=153171.33333333334, ans=0.1 2024-06-20 06:52:37,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=153226.33333333334, ans=0.125 2024-06-20 06:52:42,330 INFO [train.py:1028] (1/2) Epoch 9, batch 2650, loss[loss=0.2313, simple_loss=0.264, pruned_loss=0.09932, over 13049.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.2899, pruned_loss=0.1064, over 2587399.73 frames. ], batch size: 144, lr: 6.62e-03, grad_scale: 64.0 2024-06-20 06:52:54,520 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 1.864e+02 2.033e+02 2.277e+02 2.772e+02, threshold=4.065e+02, percent-clipped=0.0 2024-06-20 06:52:55,435 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.99 vs. limit=22.5 2024-06-20 06:52:58,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=153281.33333333334, ans=0.1 2024-06-20 06:53:02,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=153299.66666666666, ans=0.5 2024-06-20 06:53:06,393 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=15.0 2024-06-20 06:53:14,627 INFO [train.py:1028] (1/2) Epoch 9, batch 2700, loss[loss=0.253, simple_loss=0.2824, pruned_loss=0.1118, over 13261.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.2883, pruned_loss=0.1058, over 2584662.38 frames. ], batch size: 89, lr: 6.62e-03, grad_scale: 64.0 2024-06-20 06:53:24,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=153354.66666666666, ans=0.07 2024-06-20 06:53:24,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=153354.66666666666, ans=0.125 2024-06-20 06:53:26,139 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.66 vs. limit=22.5 2024-06-20 06:53:26,196 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.83 vs. limit=15.0 2024-06-20 06:53:27,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=153373.0, ans=0.125 2024-06-20 06:53:32,153 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.95 vs. limit=10.0 2024-06-20 06:53:52,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=153428.0, ans=0.125 2024-06-20 06:53:53,538 INFO [train.py:1028] (1/2) Epoch 9, batch 2750, loss[loss=0.2595, simple_loss=0.2937, pruned_loss=0.1127, over 13256.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.2873, pruned_loss=0.1052, over 2580936.93 frames. ], batch size: 43, lr: 6.62e-03, grad_scale: 64.0 2024-06-20 06:54:02,339 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.36 vs. limit=15.0 2024-06-20 06:54:05,982 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.774e+02 1.902e+02 2.055e+02 2.972e+02, threshold=3.803e+02, percent-clipped=0.0 2024-06-20 06:54:06,379 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.57 vs. limit=15.0 2024-06-20 06:54:06,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=153464.66666666666, ans=0.125 2024-06-20 06:54:13,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=153483.0, ans=0.125 2024-06-20 06:54:22,076 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.04 vs. limit=22.5 2024-06-20 06:54:23,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=153501.33333333334, ans=0.025 2024-06-20 06:54:26,638 INFO [train.py:1028] (1/2) Epoch 9, batch 2800, loss[loss=0.2398, simple_loss=0.2664, pruned_loss=0.1066, over 10794.00 frames. ], tot_loss[loss=0.248, simple_loss=0.2863, pruned_loss=0.1049, over 2579116.55 frames. ], batch size: 303, lr: 6.62e-03, grad_scale: 64.0 2024-06-20 06:54:28,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=153519.66666666666, ans=0.125 2024-06-20 06:54:37,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=153538.0, ans=0.125 2024-06-20 06:54:44,751 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.65 vs. limit=15.0 2024-06-20 06:54:58,298 INFO [train.py:1028] (1/2) Epoch 9, batch 2850, loss[loss=0.2476, simple_loss=0.2846, pruned_loss=0.1053, over 13339.00 frames. ], tot_loss[loss=0.247, simple_loss=0.285, pruned_loss=0.1045, over 2577050.65 frames. ], batch size: 49, lr: 6.62e-03, grad_scale: 64.0 2024-06-20 06:54:58,685 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.50 vs. limit=22.5 2024-06-20 06:55:10,602 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.766e+02 1.919e+02 2.055e+02 2.610e+02, threshold=3.839e+02, percent-clipped=0.0 2024-06-20 06:55:22,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=153666.33333333334, ans=0.04949747468305833 2024-06-20 06:55:25,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=153666.33333333334, ans=0.0 2024-06-20 06:55:29,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=153684.66666666666, ans=0.025 2024-06-20 06:55:36,256 INFO [train.py:1028] (1/2) Epoch 9, batch 2900, loss[loss=0.2464, simple_loss=0.2884, pruned_loss=0.1022, over 13101.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.2832, pruned_loss=0.1036, over 2584734.45 frames. ], batch size: 55, lr: 6.61e-03, grad_scale: 64.0 2024-06-20 06:55:36,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153703.0, ans=0.1 2024-06-20 06:55:56,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=153758.0, ans=0.025 2024-06-20 06:56:09,440 INFO [train.py:1028] (1/2) Epoch 9, batch 2950, loss[loss=0.2101, simple_loss=0.2497, pruned_loss=0.08521, over 13234.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.2837, pruned_loss=0.1039, over 2579602.29 frames. ], batch size: 43, lr: 6.61e-03, grad_scale: 64.0 2024-06-20 06:56:15,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=153794.66666666666, ans=0.2 2024-06-20 06:56:18,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=153813.0, ans=0.1 2024-06-20 06:56:21,708 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.83 vs. limit=15.0 2024-06-20 06:56:22,639 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.756e+02 1.864e+02 2.050e+02 3.041e+02, threshold=3.727e+02, percent-clipped=0.0 2024-06-20 06:56:32,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=153849.66666666666, ans=0.2 2024-06-20 06:56:35,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=153868.0, ans=0.2 2024-06-20 06:56:38,411 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=11.40 vs. limit=12.0 2024-06-20 06:56:43,374 INFO [train.py:1028] (1/2) Epoch 9, batch 3000, loss[loss=0.2346, simple_loss=0.2755, pruned_loss=0.09691, over 13186.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.2827, pruned_loss=0.1034, over 2577778.76 frames. ], batch size: 59, lr: 6.61e-03, grad_scale: 64.0 2024-06-20 06:56:43,374 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 06:56:51,191 INFO [train.py:1060] (1/2) Epoch 9, validation: loss=0.2027, simple_loss=0.2657, pruned_loss=0.06989, over 351949.00 frames. 2024-06-20 06:56:51,192 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 06:57:02,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=153904.66666666666, ans=0.125 2024-06-20 06:57:18,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=153941.33333333334, ans=0.125 2024-06-20 06:57:21,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=153959.66666666666, ans=15.0 2024-06-20 06:57:22,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=153959.66666666666, ans=0.125 2024-06-20 06:57:27,855 INFO [train.py:1028] (1/2) Epoch 9, batch 3050, loss[loss=0.2258, simple_loss=0.264, pruned_loss=0.09383, over 13292.00 frames. ], tot_loss[loss=0.244, simple_loss=0.2815, pruned_loss=0.1032, over 2577593.40 frames. ], batch size: 46, lr: 6.61e-03, grad_scale: 64.0 2024-06-20 06:57:48,355 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.787e+02 1.932e+02 2.125e+02 2.848e+02, threshold=3.865e+02, percent-clipped=0.0 2024-06-20 06:57:50,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=154014.66666666666, ans=0.0 2024-06-20 06:57:52,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=154014.66666666666, ans=0.2 2024-06-20 06:57:54,257 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.97 vs. limit=15.0 2024-06-20 06:58:02,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=154051.33333333334, ans=0.125 2024-06-20 06:58:08,571 INFO [train.py:1028] (1/2) Epoch 9, batch 3100, loss[loss=0.2396, simple_loss=0.2694, pruned_loss=0.1049, over 13025.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.2806, pruned_loss=0.1025, over 2579328.30 frames. ], batch size: 144, lr: 6.61e-03, grad_scale: 64.0 2024-06-20 06:58:09,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=154069.66666666666, ans=0.0 2024-06-20 06:58:20,140 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.81 vs. limit=15.0 2024-06-20 06:58:27,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=154124.66666666666, ans=0.025 2024-06-20 06:58:38,600 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2024-06-20 06:58:41,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=154161.33333333334, ans=0.5 2024-06-20 06:58:41,513 INFO [train.py:1028] (1/2) Epoch 9, batch 3150, loss[loss=0.2527, simple_loss=0.2876, pruned_loss=0.1089, over 12938.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.2802, pruned_loss=0.1021, over 2581829.01 frames. ], batch size: 158, lr: 6.60e-03, grad_scale: 64.0 2024-06-20 06:58:41,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=154161.33333333334, ans=0.125 2024-06-20 06:58:44,674 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.81 vs. limit=12.0 2024-06-20 06:58:51,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=154179.66666666666, ans=0.2 2024-06-20 06:58:51,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154179.66666666666, ans=0.1 2024-06-20 06:58:52,062 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.94 vs. limit=15.0 2024-06-20 06:58:54,070 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.782e+02 1.917e+02 2.096e+02 3.057e+02, threshold=3.834e+02, percent-clipped=0.0 2024-06-20 06:59:14,402 INFO [train.py:1028] (1/2) Epoch 9, batch 3200, loss[loss=0.2416, simple_loss=0.2851, pruned_loss=0.09905, over 13169.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.2796, pruned_loss=0.1015, over 2581386.91 frames. ], batch size: 55, lr: 6.60e-03, grad_scale: 128.0 2024-06-20 06:59:16,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=154253.0, ans=0.125 2024-06-20 06:59:18,048 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2024-06-20 06:59:20,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=154271.33333333334, ans=0.0 2024-06-20 06:59:34,776 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.57 vs. limit=10.0 2024-06-20 06:59:43,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=154308.0, ans=0.0 2024-06-20 06:59:49,728 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.08 vs. limit=15.0 2024-06-20 06:59:50,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=154326.33333333334, ans=0.125 2024-06-20 06:59:51,308 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.586e+01 2024-06-20 06:59:53,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=13.18 vs. limit=15.0 2024-06-20 06:59:53,733 INFO [train.py:1028] (1/2) Epoch 9, batch 3250, loss[loss=0.2471, simple_loss=0.2904, pruned_loss=0.1019, over 13282.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.2794, pruned_loss=0.1018, over 2585528.81 frames. ], batch size: 72, lr: 6.60e-03, grad_scale: 128.0 2024-06-20 07:00:00,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=154363.0, ans=0.0 2024-06-20 07:00:00,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=154363.0, ans=0.015 2024-06-20 07:00:04,683 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=15.0 2024-06-20 07:00:06,855 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.788e+02 1.947e+02 2.164e+02 2.876e+02, threshold=3.894e+02, percent-clipped=0.0 2024-06-20 07:00:07,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154381.33333333334, ans=0.1 2024-06-20 07:00:14,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=154399.66666666666, ans=0.1 2024-06-20 07:00:15,112 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.29 vs. limit=15.0 2024-06-20 07:00:24,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=154418.0, ans=0.125 2024-06-20 07:00:27,201 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.37 vs. limit=15.0 2024-06-20 07:00:28,882 INFO [train.py:1028] (1/2) Epoch 9, batch 3300, loss[loss=0.2614, simple_loss=0.2928, pruned_loss=0.115, over 12737.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.2793, pruned_loss=0.1018, over 2582622.33 frames. ], batch size: 176, lr: 6.60e-03, grad_scale: 128.0 2024-06-20 07:00:29,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=154436.33333333334, ans=0.125 2024-06-20 07:00:35,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=154454.66666666666, ans=0.125 2024-06-20 07:00:38,909 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.29 vs. limit=22.5 2024-06-20 07:00:40,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=154454.66666666666, ans=0.04949747468305833 2024-06-20 07:00:44,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=154473.0, ans=0.125 2024-06-20 07:00:53,234 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=12.63 vs. limit=15.0 2024-06-20 07:00:55,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=154509.66666666666, ans=0.04949747468305833 2024-06-20 07:01:02,623 INFO [train.py:1028] (1/2) Epoch 9, batch 3350, loss[loss=0.2394, simple_loss=0.2675, pruned_loss=0.1056, over 12917.00 frames. ], tot_loss[loss=0.241, simple_loss=0.2783, pruned_loss=0.1019, over 2577041.71 frames. ], batch size: 158, lr: 6.60e-03, grad_scale: 128.0 2024-06-20 07:01:04,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154528.0, ans=0.1 2024-06-20 07:01:11,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=154546.33333333334, ans=0.2 2024-06-20 07:01:19,724 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.827e+02 1.974e+02 2.272e+02 3.567e+02, threshold=3.947e+02, percent-clipped=0.0 2024-06-20 07:01:21,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=154564.66666666666, ans=0.015 2024-06-20 07:01:28,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=154564.66666666666, ans=0.125 2024-06-20 07:01:29,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=154564.66666666666, ans=0.0 2024-06-20 07:01:30,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=154583.0, ans=0.1 2024-06-20 07:01:39,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=154601.33333333334, ans=0.025 2024-06-20 07:01:39,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=154601.33333333334, ans=0.0 2024-06-20 07:01:42,780 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.36 vs. limit=15.0 2024-06-20 07:01:44,279 INFO [train.py:1028] (1/2) Epoch 9, batch 3400, loss[loss=0.2508, simple_loss=0.2957, pruned_loss=0.1029, over 12570.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.2776, pruned_loss=0.1018, over 2576048.55 frames. ], batch size: 22, lr: 6.59e-03, grad_scale: 128.0 2024-06-20 07:01:45,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=154619.66666666666, ans=0.125 2024-06-20 07:01:48,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=154619.66666666666, ans=0.2 2024-06-20 07:01:49,056 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.98 vs. limit=12.0 2024-06-20 07:01:49,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=154619.66666666666, ans=0.2 2024-06-20 07:01:59,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=154656.33333333334, ans=0.025 2024-06-20 07:02:01,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=154656.33333333334, ans=0.125 2024-06-20 07:02:17,129 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.16 vs. limit=15.0 2024-06-20 07:02:17,443 INFO [train.py:1028] (1/2) Epoch 9, batch 3450, loss[loss=0.2458, simple_loss=0.281, pruned_loss=0.1053, over 12816.00 frames. ], tot_loss[loss=0.24, simple_loss=0.277, pruned_loss=0.1015, over 2575572.58 frames. ], batch size: 177, lr: 6.59e-03, grad_scale: 128.0 2024-06-20 07:02:29,711 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.772e+02 1.848e+02 2.103e+02 3.399e+02, threshold=3.695e+02, percent-clipped=0.0 2024-06-20 07:02:31,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=154748.0, ans=0.2 2024-06-20 07:02:34,013 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.61 vs. limit=15.0 2024-06-20 07:02:37,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=154766.33333333334, ans=0.125 2024-06-20 07:02:42,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154766.33333333334, ans=0.1 2024-06-20 07:02:46,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=154784.66666666666, ans=0.125 2024-06-20 07:02:50,488 INFO [train.py:1028] (1/2) Epoch 9, batch 3500, loss[loss=0.2323, simple_loss=0.2747, pruned_loss=0.09492, over 12928.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.2772, pruned_loss=0.1016, over 2574574.85 frames. ], batch size: 33, lr: 6.59e-03, grad_scale: 128.0 2024-06-20 07:02:52,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=154803.0, ans=0.125 2024-06-20 07:03:02,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=154821.33333333334, ans=0.2 2024-06-20 07:03:06,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=154839.66666666666, ans=0.0 2024-06-20 07:03:09,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=154839.66666666666, ans=0.2 2024-06-20 07:03:23,345 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.93 vs. limit=15.0 2024-06-20 07:03:27,516 INFO [train.py:1028] (1/2) Epoch 9, batch 3550, loss[loss=0.218, simple_loss=0.2578, pruned_loss=0.08913, over 13124.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.2765, pruned_loss=0.101, over 2575756.75 frames. ], batch size: 95, lr: 6.59e-03, grad_scale: 128.0 2024-06-20 07:03:40,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=154913.0, ans=0.025 2024-06-20 07:03:41,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=154913.0, ans=0.125 2024-06-20 07:03:43,311 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.748e+02 1.921e+02 2.140e+02 2.883e+02, threshold=3.841e+02, percent-clipped=0.0 2024-06-20 07:03:54,544 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=18.51 vs. limit=15.0 2024-06-20 07:04:04,283 INFO [train.py:1028] (1/2) Epoch 9, batch 3600, loss[loss=0.2248, simple_loss=0.2653, pruned_loss=0.09215, over 13278.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.2765, pruned_loss=0.1013, over 2579064.28 frames. ], batch size: 49, lr: 6.59e-03, grad_scale: 128.0 2024-06-20 07:04:10,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=155004.66666666666, ans=0.04949747468305833 2024-06-20 07:04:25,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=155041.33333333334, ans=0.125 2024-06-20 07:04:30,886 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=5.981e+01 2024-06-20 07:04:32,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=155059.66666666666, ans=0.07 2024-06-20 07:04:34,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=155059.66666666666, ans=0.0 2024-06-20 07:04:36,687 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.82 vs. limit=15.0 2024-06-20 07:04:37,468 INFO [train.py:1028] (1/2) Epoch 9, batch 3650, loss[loss=0.2302, simple_loss=0.2627, pruned_loss=0.09886, over 13033.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.2762, pruned_loss=0.101, over 2578729.61 frames. ], batch size: 102, lr: 6.59e-03, grad_scale: 128.0 2024-06-20 07:04:49,823 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.717e+02 1.876e+02 2.016e+02 2.652e+02, threshold=3.753e+02, percent-clipped=0.0 2024-06-20 07:04:52,375 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=15.0 2024-06-20 07:04:52,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=155114.66666666666, ans=0.125 2024-06-20 07:04:52,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=155114.66666666666, ans=0.125 2024-06-20 07:05:06,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=155151.33333333334, ans=0.125 2024-06-20 07:05:10,453 INFO [train.py:1028] (1/2) Epoch 9, batch 3700, loss[loss=0.2352, simple_loss=0.2759, pruned_loss=0.0972, over 13247.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.2751, pruned_loss=0.1003, over 2583883.68 frames. ], batch size: 72, lr: 6.58e-03, grad_scale: 128.0 2024-06-20 07:05:14,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=155169.66666666666, ans=0.125 2024-06-20 07:05:21,277 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.29 vs. limit=15.0 2024-06-20 07:05:38,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=155224.66666666666, ans=0.0 2024-06-20 07:05:41,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=155243.0, ans=0.0 2024-06-20 07:05:41,487 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.47 vs. limit=15.0 2024-06-20 07:05:46,414 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=22.5 2024-06-20 07:05:47,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=155243.0, ans=10.0 2024-06-20 07:05:49,313 INFO [train.py:1028] (1/2) Epoch 9, batch 3750, loss[loss=0.2477, simple_loss=0.2908, pruned_loss=0.1023, over 12816.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.2745, pruned_loss=0.09996, over 2586963.44 frames. ], batch size: 22, lr: 6.58e-03, grad_scale: 128.0 2024-06-20 07:05:57,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=155279.66666666666, ans=0.025 2024-06-20 07:06:01,801 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.807e+02 1.963e+02 2.203e+02 3.308e+02, threshold=3.925e+02, percent-clipped=0.0 2024-06-20 07:06:02,173 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.06 vs. limit=12.0 2024-06-20 07:06:07,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=155298.0, ans=0.2 2024-06-20 07:06:20,273 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.47 vs. limit=15.0 2024-06-20 07:06:21,873 INFO [train.py:1028] (1/2) Epoch 9, batch 3800, loss[loss=0.2378, simple_loss=0.2745, pruned_loss=0.1006, over 13176.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.2744, pruned_loss=0.09972, over 2584236.10 frames. ], batch size: 83, lr: 6.58e-03, grad_scale: 128.0 2024-06-20 07:06:39,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=155389.66666666666, ans=0.2 2024-06-20 07:06:41,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=155408.0, ans=0.2 2024-06-20 07:06:44,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=155408.0, ans=0.0 2024-06-20 07:06:46,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=155408.0, ans=0.95 2024-06-20 07:06:51,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=155426.33333333334, ans=0.0 2024-06-20 07:06:55,131 INFO [train.py:1028] (1/2) Epoch 9, batch 3850, loss[loss=0.2269, simple_loss=0.2624, pruned_loss=0.09574, over 13020.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.2738, pruned_loss=0.09916, over 2582440.47 frames. ], batch size: 144, lr: 6.58e-03, grad_scale: 128.0 2024-06-20 07:06:56,080 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.19 vs. limit=12.0 2024-06-20 07:06:57,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=155444.66666666666, ans=0.5 2024-06-20 07:07:00,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=155444.66666666666, ans=0.125 2024-06-20 07:07:07,487 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.769e+02 1.948e+02 2.324e+02 3.753e+02, threshold=3.897e+02, percent-clipped=0.0 2024-06-20 07:07:09,797 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.85 vs. limit=15.0 2024-06-20 07:07:17,275 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=20.77 vs. limit=15.0 2024-06-20 07:07:21,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=155518.0, ans=0.2 2024-06-20 07:07:26,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=155518.0, ans=0.2 2024-06-20 07:07:27,303 INFO [train.py:1028] (1/2) Epoch 9, batch 3900, loss[loss=0.2381, simple_loss=0.2704, pruned_loss=0.1029, over 13203.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.2732, pruned_loss=0.09894, over 2585834.71 frames. ], batch size: 83, lr: 6.58e-03, grad_scale: 64.0 2024-06-20 07:07:33,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=155536.33333333334, ans=0.125 2024-06-20 07:07:34,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=155536.33333333334, ans=0.1 2024-06-20 07:07:47,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=155573.0, ans=0.0 2024-06-20 07:07:50,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=155591.33333333334, ans=0.1 2024-06-20 07:07:50,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=155591.33333333334, ans=0.2 2024-06-20 07:07:57,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=155591.33333333334, ans=0.0 2024-06-20 07:08:01,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155609.66666666666, ans=0.1 2024-06-20 07:08:02,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=155609.66666666666, ans=0.125 2024-06-20 07:08:02,682 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.92 vs. limit=6.0 2024-06-20 07:08:06,591 INFO [train.py:1028] (1/2) Epoch 9, batch 3950, loss[loss=0.2412, simple_loss=0.2686, pruned_loss=0.1069, over 13106.00 frames. ], tot_loss[loss=0.234, simple_loss=0.2721, pruned_loss=0.09799, over 2589118.88 frames. ], batch size: 132, lr: 6.57e-03, grad_scale: 64.0 2024-06-20 07:08:10,707 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.37 vs. limit=15.0 2024-06-20 07:08:17,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=155646.33333333334, ans=0.2 2024-06-20 07:08:19,463 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.759e+02 1.860e+02 2.045e+02 2.627e+02, threshold=3.720e+02, percent-clipped=0.0 2024-06-20 07:08:20,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=155664.66666666666, ans=0.025 2024-06-20 07:08:25,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=155683.0, ans=0.125 2024-06-20 07:08:30,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=155683.0, ans=0.0 2024-06-20 07:08:30,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155683.0, ans=0.1 2024-06-20 07:08:36,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=155701.33333333334, ans=0.125 2024-06-20 07:08:39,210 INFO [train.py:1028] (1/2) Epoch 9, batch 4000, loss[loss=0.225, simple_loss=0.2692, pruned_loss=0.0904, over 13011.00 frames. ], tot_loss[loss=0.234, simple_loss=0.2717, pruned_loss=0.0981, over 2584417.59 frames. ], batch size: 39, lr: 6.57e-03, grad_scale: 64.0 2024-06-20 07:08:44,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=155719.66666666666, ans=0.5 2024-06-20 07:08:45,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=155738.0, ans=0.125 2024-06-20 07:08:53,914 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.57 vs. limit=15.0 2024-06-20 07:09:03,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=155774.66666666666, ans=0.0 2024-06-20 07:09:04,036 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.67 vs. limit=10.0 2024-06-20 07:09:06,750 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.77 vs. limit=15.0 2024-06-20 07:09:07,830 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:09:11,895 INFO [train.py:1028] (1/2) Epoch 9, batch 4050, loss[loss=0.2511, simple_loss=0.2809, pruned_loss=0.1107, over 10879.00 frames. ], tot_loss[loss=0.234, simple_loss=0.2715, pruned_loss=0.09825, over 2581613.86 frames. ], batch size: 303, lr: 6.57e-03, grad_scale: 64.0 2024-06-20 07:09:24,727 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.783e+02 1.918e+02 2.174e+02 3.057e+02, threshold=3.836e+02, percent-clipped=0.0 2024-06-20 07:09:39,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=155866.33333333334, ans=0.2 2024-06-20 07:09:47,708 INFO [train.py:1028] (1/2) Epoch 9, batch 4100, loss[loss=0.2443, simple_loss=0.2757, pruned_loss=0.1065, over 13025.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.2718, pruned_loss=0.09861, over 2577214.36 frames. ], batch size: 102, lr: 6.57e-03, grad_scale: 64.0 2024-06-20 07:09:48,117 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.78 vs. limit=15.0 2024-06-20 07:09:56,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=155903.0, ans=0.125 2024-06-20 07:09:57,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=155921.33333333334, ans=0.2 2024-06-20 07:10:04,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=155939.66666666666, ans=0.125 2024-06-20 07:10:06,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=155939.66666666666, ans=0.07 2024-06-20 07:10:10,739 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.13 vs. limit=15.0 2024-06-20 07:10:13,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=155958.0, ans=0.035 2024-06-20 07:10:23,329 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.62 vs. limit=15.0 2024-06-20 07:10:24,797 INFO [train.py:1028] (1/2) Epoch 9, batch 4150, loss[loss=0.2419, simple_loss=0.278, pruned_loss=0.1029, over 13164.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.2716, pruned_loss=0.09846, over 2575282.93 frames. ], batch size: 55, lr: 6.57e-03, grad_scale: 64.0 2024-06-20 07:10:30,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=156013.0, ans=0.95 2024-06-20 07:10:37,676 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.804e+02 2.073e+02 2.365e+02 3.487e+02, threshold=4.145e+02, percent-clipped=0.0 2024-06-20 07:10:41,496 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=15.0 2024-06-20 07:10:43,348 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.75 vs. limit=15.0 2024-06-20 07:10:43,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=156049.66666666666, ans=0.025 2024-06-20 07:10:44,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=156049.66666666666, ans=0.1 2024-06-20 07:10:50,535 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.66 vs. limit=15.0 2024-06-20 07:10:55,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=156068.0, ans=0.125 2024-06-20 07:10:57,510 INFO [train.py:1028] (1/2) Epoch 9, batch 4200, loss[loss=0.2509, simple_loss=0.2747, pruned_loss=0.1135, over 12999.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.2705, pruned_loss=0.09813, over 2578189.75 frames. ], batch size: 102, lr: 6.56e-03, grad_scale: 64.0 2024-06-20 07:11:03,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=156104.66666666666, ans=0.2 2024-06-20 07:11:05,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=156104.66666666666, ans=6.0 2024-06-20 07:11:06,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=156104.66666666666, ans=0.025 2024-06-20 07:11:14,518 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.02 vs. limit=22.5 2024-06-20 07:11:20,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=156141.33333333334, ans=0.125 2024-06-20 07:11:25,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=156159.66666666666, ans=0.04949747468305833 2024-06-20 07:11:32,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=156159.66666666666, ans=0.1 2024-06-20 07:11:34,610 INFO [train.py:1028] (1/2) Epoch 9, batch 4250, loss[loss=0.2288, simple_loss=0.277, pruned_loss=0.09031, over 13360.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2698, pruned_loss=0.09763, over 2580493.86 frames. ], batch size: 46, lr: 6.56e-03, grad_scale: 64.0 2024-06-20 07:11:48,128 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.798e+02 2.015e+02 2.246e+02 3.303e+02, threshold=4.030e+02, percent-clipped=0.0 2024-06-20 07:11:49,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=156214.66666666666, ans=0.0 2024-06-20 07:12:11,086 INFO [train.py:1028] (1/2) Epoch 9, batch 4300, loss[loss=0.23, simple_loss=0.2668, pruned_loss=0.09655, over 13171.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2698, pruned_loss=0.09761, over 2580367.60 frames. ], batch size: 59, lr: 6.56e-03, grad_scale: 64.0 2024-06-20 07:12:25,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=156306.33333333334, ans=0.0 2024-06-20 07:12:27,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=156306.33333333334, ans=0.0 2024-06-20 07:12:31,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=156324.66666666666, ans=0.125 2024-06-20 07:12:42,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=156361.33333333334, ans=0.0 2024-06-20 07:12:43,232 INFO [train.py:1028] (1/2) Epoch 9, batch 4350, loss[loss=0.2198, simple_loss=0.2543, pruned_loss=0.09261, over 13215.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.2695, pruned_loss=0.09752, over 2585241.15 frames. ], batch size: 59, lr: 6.56e-03, grad_scale: 64.0 2024-06-20 07:12:45,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=156361.33333333334, ans=0.125 2024-06-20 07:12:45,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=156361.33333333334, ans=0.125 2024-06-20 07:12:47,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=156361.33333333334, ans=0.125 2024-06-20 07:12:55,874 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.10 vs. limit=12.0 2024-06-20 07:12:56,082 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.714e+02 1.879e+02 2.163e+02 3.055e+02, threshold=3.758e+02, percent-clipped=0.0 2024-06-20 07:13:13,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=156434.66666666666, ans=0.0 2024-06-20 07:13:15,725 INFO [train.py:1028] (1/2) Epoch 9, batch 4400, loss[loss=0.2389, simple_loss=0.274, pruned_loss=0.1019, over 13232.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.2693, pruned_loss=0.09745, over 2586360.49 frames. ], batch size: 83, lr: 6.56e-03, grad_scale: 64.0 2024-06-20 07:13:21,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=156471.33333333334, ans=0.025 2024-06-20 07:13:22,619 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.19 vs. limit=15.0 2024-06-20 07:13:55,408 INFO [train.py:1028] (1/2) Epoch 9, batch 4450, loss[loss=0.2195, simple_loss=0.2637, pruned_loss=0.08761, over 12923.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.2695, pruned_loss=0.09755, over 2581191.84 frames. ], batch size: 33, lr: 6.55e-03, grad_scale: 64.0 2024-06-20 07:13:58,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=156544.66666666666, ans=0.0 2024-06-20 07:14:08,574 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.689e+02 1.812e+02 2.050e+02 2.708e+02, threshold=3.624e+02, percent-clipped=0.0 2024-06-20 07:14:12,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=156581.33333333334, ans=0.025 2024-06-20 07:14:28,529 INFO [train.py:1028] (1/2) Epoch 9, batch 4500, loss[loss=0.2196, simple_loss=0.2589, pruned_loss=0.09011, over 13278.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.2688, pruned_loss=0.09734, over 2585617.59 frames. ], batch size: 89, lr: 6.55e-03, grad_scale: 64.0 2024-06-20 07:14:32,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=156636.33333333334, ans=0.125 2024-06-20 07:14:36,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=156654.66666666666, ans=0.2 2024-06-20 07:14:46,247 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=156673.0, ans=0.125 2024-06-20 07:14:56,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=156709.66666666666, ans=0.1 2024-06-20 07:15:01,141 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.96 vs. limit=15.0 2024-06-20 07:15:02,027 INFO [train.py:1028] (1/2) Epoch 9, batch 4550, loss[loss=0.2474, simple_loss=0.2794, pruned_loss=0.1077, over 13261.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.2686, pruned_loss=0.0973, over 2589618.04 frames. ], batch size: 52, lr: 6.55e-03, grad_scale: 64.0 2024-06-20 07:15:09,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=156746.33333333334, ans=0.025 2024-06-20 07:15:11,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=156746.33333333334, ans=0.0 2024-06-20 07:15:14,963 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.754e+02 1.883e+02 2.084e+02 3.556e+02, threshold=3.766e+02, percent-clipped=0.0 2024-06-20 07:15:21,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=156783.0, ans=0.125 2024-06-20 07:15:21,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=156783.0, ans=0.025 2024-06-20 07:15:34,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=156801.33333333334, ans=0.025 2024-06-20 07:15:38,613 INFO [train.py:1028] (1/2) Epoch 9, batch 4600, loss[loss=0.2485, simple_loss=0.2761, pruned_loss=0.1105, over 12514.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.2678, pruned_loss=0.0966, over 2585014.73 frames. ], batch size: 202, lr: 6.55e-03, grad_scale: 64.0 2024-06-20 07:15:43,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=156819.66666666666, ans=0.125 2024-06-20 07:15:44,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=156819.66666666666, ans=0.025 2024-06-20 07:15:48,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=156838.0, ans=0.1 2024-06-20 07:15:49,886 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.98 vs. limit=22.5 2024-06-20 07:15:50,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=156838.0, ans=0.04949747468305833 2024-06-20 07:16:01,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=156874.66666666666, ans=0.5 2024-06-20 07:16:05,642 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.42 vs. limit=15.0 2024-06-20 07:16:14,914 INFO [train.py:1028] (1/2) Epoch 9, batch 4650, loss[loss=0.2169, simple_loss=0.2459, pruned_loss=0.09396, over 13111.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2674, pruned_loss=0.09659, over 2589350.03 frames. ], batch size: 132, lr: 6.55e-03, grad_scale: 64.0 2024-06-20 07:16:19,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156911.33333333334, ans=0.1 2024-06-20 07:16:27,692 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.700e+02 1.876e+02 2.101e+02 3.110e+02, threshold=3.751e+02, percent-clipped=0.0 2024-06-20 07:16:36,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=156966.33333333334, ans=0.0 2024-06-20 07:16:38,559 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.29 vs. limit=6.0 2024-06-20 07:16:39,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=156966.33333333334, ans=0.125 2024-06-20 07:16:42,159 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.49 vs. limit=12.0 2024-06-20 07:16:43,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=156984.66666666666, ans=0.0 2024-06-20 07:16:45,893 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=156984.66666666666, ans=0.0 2024-06-20 07:16:48,012 INFO [train.py:1028] (1/2) Epoch 9, batch 4700, loss[loss=0.2397, simple_loss=0.2831, pruned_loss=0.09814, over 12422.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2686, pruned_loss=0.09714, over 2584203.96 frames. ], batch size: 25, lr: 6.54e-03, grad_scale: 64.0 2024-06-20 07:16:55,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=157021.33333333334, ans=0.025 2024-06-20 07:17:01,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=157039.66666666666, ans=0.125 2024-06-20 07:17:11,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=157058.0, ans=15.0 2024-06-20 07:17:17,996 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=23.18 vs. limit=15.0 2024-06-20 07:17:21,785 INFO [train.py:1028] (1/2) Epoch 9, batch 4750, loss[loss=0.2687, simple_loss=0.2961, pruned_loss=0.1206, over 12619.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.268, pruned_loss=0.09689, over 2581096.89 frames. ], batch size: 202, lr: 6.54e-03, grad_scale: 64.0 2024-06-20 07:17:21,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=157094.66666666666, ans=0.125 2024-06-20 07:17:22,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=157094.66666666666, ans=0.0 2024-06-20 07:17:38,330 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.853e+02 2.020e+02 2.304e+02 3.405e+02, threshold=4.040e+02, percent-clipped=0.0 2024-06-20 07:17:41,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=157131.33333333334, ans=0.125 2024-06-20 07:17:42,553 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=7.831e+01 2024-06-20 07:17:44,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=157149.66666666666, ans=0.2 2024-06-20 07:17:56,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=157168.0, ans=0.0 2024-06-20 07:18:02,240 INFO [train.py:1028] (1/2) Epoch 9, batch 4800, loss[loss=0.2289, simple_loss=0.2661, pruned_loss=0.09587, over 13230.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.2682, pruned_loss=0.0971, over 2577352.61 frames. ], batch size: 63, lr: 6.54e-03, grad_scale: 64.0 2024-06-20 07:18:09,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=157204.66666666666, ans=0.0 2024-06-20 07:18:12,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=157204.66666666666, ans=0.0 2024-06-20 07:18:12,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=157204.66666666666, ans=0.1 2024-06-20 07:18:17,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=157223.0, ans=0.1 2024-06-20 07:18:24,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=157241.33333333334, ans=0.125 2024-06-20 07:18:34,332 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.16 vs. limit=15.0 2024-06-20 07:18:34,570 INFO [train.py:1028] (1/2) Epoch 9, batch 4850, loss[loss=0.2429, simple_loss=0.2758, pruned_loss=0.1049, over 13236.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2681, pruned_loss=0.09694, over 2574489.94 frames. ], batch size: 89, lr: 6.54e-03, grad_scale: 64.0 2024-06-20 07:18:48,112 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.775e+02 1.970e+02 2.198e+02 2.876e+02, threshold=3.940e+02, percent-clipped=0.0 2024-06-20 07:18:50,623 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.24 vs. limit=22.5 2024-06-20 07:18:53,259 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.77 vs. limit=15.0 2024-06-20 07:18:53,808 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.40 vs. limit=15.0 2024-06-20 07:19:04,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=157351.33333333334, ans=0.125 2024-06-20 07:19:06,807 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.86 vs. limit=15.0 2024-06-20 07:19:06,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=157351.33333333334, ans=6.0 2024-06-20 07:19:07,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=157369.66666666666, ans=0.1 2024-06-20 07:19:08,319 INFO [train.py:1028] (1/2) Epoch 9, batch 4900, loss[loss=0.224, simple_loss=0.2609, pruned_loss=0.09356, over 13220.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.2677, pruned_loss=0.0969, over 2575135.00 frames. ], batch size: 59, lr: 6.54e-03, grad_scale: 64.0 2024-06-20 07:19:09,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=157369.66666666666, ans=0.2 2024-06-20 07:19:09,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=157369.66666666666, ans=0.025 2024-06-20 07:19:12,509 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.51 vs. limit=15.0 2024-06-20 07:19:35,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=157424.66666666666, ans=0.0 2024-06-20 07:19:37,604 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.06 vs. limit=15.0 2024-06-20 07:19:44,997 INFO [train.py:1028] (1/2) Epoch 9, batch 4950, loss[loss=0.2467, simple_loss=0.2667, pruned_loss=0.1134, over 11176.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.268, pruned_loss=0.09747, over 2568757.56 frames. ], batch size: 304, lr: 6.54e-03, grad_scale: 64.0 2024-06-20 07:20:01,100 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.720e+02 1.891e+02 2.183e+02 3.125e+02, threshold=3.783e+02, percent-clipped=0.0 2024-06-20 07:20:05,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=157498.0, ans=0.07 2024-06-20 07:20:07,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=157516.33333333334, ans=0.1 2024-06-20 07:20:07,945 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=22.5 2024-06-20 07:20:09,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=157516.33333333334, ans=0.0 2024-06-20 07:20:16,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=157534.66666666666, ans=0.125 2024-06-20 07:20:20,859 INFO [train.py:1028] (1/2) Epoch 9, batch 5000, loss[loss=0.2258, simple_loss=0.2631, pruned_loss=0.09422, over 13183.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.2677, pruned_loss=0.09706, over 2573346.14 frames. ], batch size: 95, lr: 6.53e-03, grad_scale: 64.0 2024-06-20 07:20:28,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=157571.33333333334, ans=0.0 2024-06-20 07:20:31,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=157571.33333333334, ans=0.1 2024-06-20 07:20:39,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=157589.66666666666, ans=0.09899494936611666 2024-06-20 07:20:54,341 INFO [train.py:1028] (1/2) Epoch 9, batch 5050, loss[loss=0.2316, simple_loss=0.2725, pruned_loss=0.09536, over 12918.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.2681, pruned_loss=0.09729, over 2571424.99 frames. ], batch size: 36, lr: 6.53e-03, grad_scale: 64.0 2024-06-20 07:20:55,944 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.48 vs. limit=22.5 2024-06-20 07:20:56,758 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.85 vs. limit=6.0 2024-06-20 07:21:07,629 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.937e+02 2.178e+02 2.505e+02 3.328e+02, threshold=4.356e+02, percent-clipped=0.0 2024-06-20 07:21:10,764 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.12 vs. limit=15.0 2024-06-20 07:21:15,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=157699.66666666666, ans=0.2 2024-06-20 07:21:17,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=157699.66666666666, ans=0.025 2024-06-20 07:21:17,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=157699.66666666666, ans=0.125 2024-06-20 07:21:17,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=157699.66666666666, ans=0.125 2024-06-20 07:21:18,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=157699.66666666666, ans=0.0 2024-06-20 07:21:30,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=157736.33333333334, ans=0.0 2024-06-20 07:21:31,270 INFO [train.py:1028] (1/2) Epoch 9, batch 5100, loss[loss=0.2201, simple_loss=0.2623, pruned_loss=0.08901, over 12942.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.2675, pruned_loss=0.09719, over 2567739.13 frames. ], batch size: 39, lr: 6.53e-03, grad_scale: 64.0 2024-06-20 07:21:39,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=157754.66666666666, ans=0.0 2024-06-20 07:21:43,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=157773.0, ans=0.125 2024-06-20 07:22:01,057 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2024-06-20 07:22:07,055 INFO [train.py:1028] (1/2) Epoch 9, batch 5150, loss[loss=0.2262, simple_loss=0.2587, pruned_loss=0.09691, over 13121.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2674, pruned_loss=0.09729, over 2570710.23 frames. ], batch size: 132, lr: 6.53e-03, grad_scale: 64.0 2024-06-20 07:22:07,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=157828.0, ans=0.125 2024-06-20 07:22:10,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=157828.0, ans=0.0 2024-06-20 07:22:20,610 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.721e+02 1.915e+02 2.125e+02 2.649e+02, threshold=3.831e+02, percent-clipped=0.0 2024-06-20 07:22:39,946 INFO [train.py:1028] (1/2) Epoch 9, batch 5200, loss[loss=0.2273, simple_loss=0.2692, pruned_loss=0.09269, over 13136.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.2671, pruned_loss=0.09691, over 2574391.63 frames. ], batch size: 95, lr: 6.53e-03, grad_scale: 64.0 2024-06-20 07:22:40,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=157919.66666666666, ans=0.0 2024-06-20 07:22:47,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=157938.0, ans=0.125 2024-06-20 07:22:48,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=157938.0, ans=0.0 2024-06-20 07:22:52,936 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.83 vs. limit=15.0 2024-06-20 07:22:53,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=157956.33333333334, ans=0.0 2024-06-20 07:23:03,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=157974.66666666666, ans=0.125 2024-06-20 07:23:13,440 INFO [train.py:1028] (1/2) Epoch 9, batch 5250, loss[loss=0.2022, simple_loss=0.2464, pruned_loss=0.07901, over 13229.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2676, pruned_loss=0.0972, over 2572378.14 frames. ], batch size: 52, lr: 6.52e-03, grad_scale: 64.0 2024-06-20 07:23:17,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=158011.33333333334, ans=0.125 2024-06-20 07:23:28,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=158029.66666666666, ans=0.0 2024-06-20 07:23:29,929 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.761e+02 1.897e+02 2.172e+02 2.687e+02, threshold=3.794e+02, percent-clipped=0.0 2024-06-20 07:23:32,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=158048.0, ans=0.125 2024-06-20 07:23:33,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=158048.0, ans=0.025 2024-06-20 07:23:35,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=158048.0, ans=0.125 2024-06-20 07:23:38,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=158066.33333333334, ans=0.0 2024-06-20 07:23:40,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=158066.33333333334, ans=0.0 2024-06-20 07:23:42,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=158084.66666666666, ans=0.07 2024-06-20 07:23:42,824 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.32 vs. limit=15.0 2024-06-20 07:23:44,702 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.58 vs. limit=15.0 2024-06-20 07:23:52,720 INFO [train.py:1028] (1/2) Epoch 9, batch 5300, loss[loss=0.2262, simple_loss=0.2607, pruned_loss=0.09582, over 12991.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2668, pruned_loss=0.0966, over 2568241.65 frames. ], batch size: 144, lr: 6.52e-03, grad_scale: 64.0 2024-06-20 07:23:53,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=158103.0, ans=0.2 2024-06-20 07:23:58,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=158103.0, ans=0.2 2024-06-20 07:24:00,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=158121.33333333334, ans=0.125 2024-06-20 07:24:03,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=158121.33333333334, ans=0.125 2024-06-20 07:24:04,842 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.54 vs. limit=15.0 2024-06-20 07:24:12,905 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.68 vs. limit=22.5 2024-06-20 07:24:13,755 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=3.760e+01 2024-06-20 07:24:13,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=158158.0, ans=0.125 2024-06-20 07:24:14,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=158158.0, ans=0.125 2024-06-20 07:24:21,937 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.68 vs. limit=15.0 2024-06-20 07:24:23,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=158176.33333333334, ans=0.125 2024-06-20 07:24:25,877 INFO [train.py:1028] (1/2) Epoch 9, batch 5350, loss[loss=0.233, simple_loss=0.2792, pruned_loss=0.09339, over 11280.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.2665, pruned_loss=0.09639, over 2574234.23 frames. ], batch size: 16, lr: 6.52e-03, grad_scale: 64.0 2024-06-20 07:24:26,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=158194.66666666666, ans=0.0 2024-06-20 07:24:36,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=158213.0, ans=0.0 2024-06-20 07:24:38,756 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.732e+02 1.907e+02 2.107e+02 2.808e+02, threshold=3.813e+02, percent-clipped=0.0 2024-06-20 07:24:46,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=158249.66666666666, ans=0.1 2024-06-20 07:24:48,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=158249.66666666666, ans=0.0 2024-06-20 07:24:52,118 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=27.76 vs. limit=15.0 2024-06-20 07:24:54,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=158268.0, ans=0.1 2024-06-20 07:24:58,074 INFO [train.py:1028] (1/2) Epoch 9, batch 5400, loss[loss=0.2646, simple_loss=0.2776, pruned_loss=0.1258, over 12227.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.2672, pruned_loss=0.09715, over 2566601.28 frames. ], batch size: 240, lr: 6.52e-03, grad_scale: 64.0 2024-06-20 07:25:03,822 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.84 vs. limit=6.0 2024-06-20 07:25:06,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=158304.66666666666, ans=0.125 2024-06-20 07:25:07,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=158304.66666666666, ans=0.1 2024-06-20 07:25:22,534 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.12 vs. limit=15.0 2024-06-20 07:25:22,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=158341.33333333334, ans=0.125 2024-06-20 07:25:31,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=158359.66666666666, ans=0.2 2024-06-20 07:25:33,482 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=12.43 vs. limit=15.0 2024-06-20 07:25:34,373 INFO [train.py:1028] (1/2) Epoch 9, batch 5450, loss[loss=0.2242, simple_loss=0.2666, pruned_loss=0.09094, over 12376.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.2674, pruned_loss=0.09677, over 2571240.98 frames. ], batch size: 25, lr: 6.52e-03, grad_scale: 64.0 2024-06-20 07:25:50,885 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.747e+02 1.931e+02 2.301e+02 3.615e+02, threshold=3.863e+02, percent-clipped=0.0 2024-06-20 07:25:58,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=158433.0, ans=0.2 2024-06-20 07:25:59,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=158433.0, ans=0.125 2024-06-20 07:26:00,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=158433.0, ans=0.125 2024-06-20 07:26:10,831 INFO [train.py:1028] (1/2) Epoch 9, batch 5500, loss[loss=0.2708, simple_loss=0.2903, pruned_loss=0.1257, over 12237.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.2674, pruned_loss=0.09689, over 2564262.52 frames. ], batch size: 240, lr: 6.51e-03, grad_scale: 64.0 2024-06-20 07:26:11,365 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.22 vs. limit=22.5 2024-06-20 07:26:12,368 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=22.5 2024-06-20 07:26:17,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=158488.0, ans=0.0 2024-06-20 07:26:26,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=158506.33333333334, ans=0.2 2024-06-20 07:26:28,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=158506.33333333334, ans=0.1 2024-06-20 07:26:35,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=158524.66666666666, ans=0.1 2024-06-20 07:26:42,243 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:26:43,971 INFO [train.py:1028] (1/2) Epoch 9, batch 5550, loss[loss=0.231, simple_loss=0.2773, pruned_loss=0.09237, over 13291.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.2668, pruned_loss=0.09621, over 2568423.87 frames. ], batch size: 43, lr: 6.51e-03, grad_scale: 64.0 2024-06-20 07:26:56,998 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.724e+02 1.856e+02 2.053e+02 2.773e+02, threshold=3.713e+02, percent-clipped=0.0 2024-06-20 07:27:02,893 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=158616.33333333334, ans=0.025 2024-06-20 07:27:06,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=158616.33333333334, ans=0.125 2024-06-20 07:27:15,951 INFO [train.py:1028] (1/2) Epoch 9, batch 5600, loss[loss=0.2041, simple_loss=0.2446, pruned_loss=0.0818, over 13235.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2655, pruned_loss=0.09549, over 2570219.00 frames. ], batch size: 89, lr: 6.51e-03, grad_scale: 64.0 2024-06-20 07:27:21,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=158653.0, ans=0.09899494936611666 2024-06-20 07:27:24,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=158653.0, ans=0.0 2024-06-20 07:27:32,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=158689.66666666666, ans=0.125 2024-06-20 07:27:34,006 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.79 vs. limit=22.5 2024-06-20 07:27:35,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=158689.66666666666, ans=0.0 2024-06-20 07:27:38,585 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.87 vs. limit=15.0 2024-06-20 07:27:42,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=158708.0, ans=0.1 2024-06-20 07:27:46,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=158708.0, ans=0.05 2024-06-20 07:27:55,285 INFO [train.py:1028] (1/2) Epoch 9, batch 5650, loss[loss=0.23, simple_loss=0.2624, pruned_loss=0.09879, over 12584.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2656, pruned_loss=0.09538, over 2574259.09 frames. ], batch size: 202, lr: 6.51e-03, grad_scale: 64.0 2024-06-20 07:28:01,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=158763.0, ans=0.125 2024-06-20 07:28:08,851 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.710e+02 1.801e+02 1.989e+02 2.965e+02, threshold=3.601e+02, percent-clipped=0.0 2024-06-20 07:28:28,562 INFO [train.py:1028] (1/2) Epoch 9, batch 5700, loss[loss=0.2268, simple_loss=0.2692, pruned_loss=0.09222, over 13297.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2649, pruned_loss=0.09496, over 2578578.71 frames. ], batch size: 63, lr: 6.51e-03, grad_scale: 64.0 2024-06-20 07:28:42,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=158873.0, ans=0.2 2024-06-20 07:28:48,781 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.02 vs. limit=22.5 2024-06-20 07:28:49,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=158891.33333333334, ans=0.0 2024-06-20 07:28:49,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=158891.33333333334, ans=0.125 2024-06-20 07:28:57,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=158909.66666666666, ans=0.0 2024-06-20 07:28:58,766 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.22 vs. limit=15.0 2024-06-20 07:28:59,466 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.11 vs. limit=10.0 2024-06-20 07:29:00,334 INFO [train.py:1028] (1/2) Epoch 9, batch 5750, loss[loss=0.2604, simple_loss=0.2884, pruned_loss=0.1162, over 12647.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2661, pruned_loss=0.09574, over 2577974.40 frames. ], batch size: 176, lr: 6.51e-03, grad_scale: 64.0 2024-06-20 07:29:04,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=158928.0, ans=0.1 2024-06-20 07:29:05,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=158928.0, ans=0.0 2024-06-20 07:29:11,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=158946.33333333334, ans=0.0 2024-06-20 07:29:16,747 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.736e+02 1.859e+02 2.024e+02 3.196e+02, threshold=3.719e+02, percent-clipped=0.0 2024-06-20 07:29:24,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=158983.0, ans=0.0 2024-06-20 07:29:24,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=158983.0, ans=0.125 2024-06-20 07:29:39,924 INFO [train.py:1028] (1/2) Epoch 9, batch 5800, loss[loss=0.2466, simple_loss=0.2711, pruned_loss=0.1111, over 12824.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.2675, pruned_loss=0.0967, over 2576984.24 frames. ], batch size: 177, lr: 6.50e-03, grad_scale: 64.0 2024-06-20 07:29:46,711 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.01 vs. limit=22.5 2024-06-20 07:29:49,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=159038.0, ans=0.125 2024-06-20 07:29:54,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=159056.33333333334, ans=0.2 2024-06-20 07:29:56,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=159056.33333333334, ans=0.0 2024-06-20 07:30:12,824 INFO [train.py:1028] (1/2) Epoch 9, batch 5850, loss[loss=0.251, simple_loss=0.2827, pruned_loss=0.1097, over 12546.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.2702, pruned_loss=0.09826, over 2575957.54 frames. ], batch size: 202, lr: 6.50e-03, grad_scale: 64.0 2024-06-20 07:30:21,761 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.48 vs. limit=15.0 2024-06-20 07:30:22,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=159129.66666666666, ans=0.125 2024-06-20 07:30:26,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=159148.0, ans=0.125 2024-06-20 07:30:26,640 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.70 vs. limit=15.0 2024-06-20 07:30:26,813 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.856e+02 1.998e+02 2.219e+02 2.902e+02, threshold=3.995e+02, percent-clipped=0.0 2024-06-20 07:30:34,028 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.75 vs. limit=15.0 2024-06-20 07:30:47,417 INFO [train.py:1028] (1/2) Epoch 9, batch 5900, loss[loss=0.2398, simple_loss=0.2683, pruned_loss=0.1057, over 13109.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.2723, pruned_loss=0.0992, over 2577540.41 frames. ], batch size: 121, lr: 6.50e-03, grad_scale: 128.0 2024-06-20 07:30:59,155 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:30:59,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=159221.33333333334, ans=0.2 2024-06-20 07:31:12,532 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.90 vs. limit=10.0 2024-06-20 07:31:24,816 INFO [train.py:1028] (1/2) Epoch 9, batch 5950, loss[loss=0.2244, simple_loss=0.2564, pruned_loss=0.09623, over 13117.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.2733, pruned_loss=0.0997, over 2582260.13 frames. ], batch size: 121, lr: 6.50e-03, grad_scale: 128.0 2024-06-20 07:31:30,774 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.90 vs. limit=6.0 2024-06-20 07:31:31,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=159313.0, ans=0.0 2024-06-20 07:31:32,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=159313.0, ans=0.125 2024-06-20 07:31:38,076 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.771e+02 1.909e+02 2.139e+02 2.823e+02, threshold=3.819e+02, percent-clipped=0.0 2024-06-20 07:31:53,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=159349.66666666666, ans=0.125 2024-06-20 07:31:54,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=159368.0, ans=0.09899494936611666 2024-06-20 07:31:57,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=159368.0, ans=0.125 2024-06-20 07:32:00,998 INFO [train.py:1028] (1/2) Epoch 9, batch 6000, loss[loss=0.2991, simple_loss=0.3165, pruned_loss=0.1409, over 12190.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.2748, pruned_loss=0.1002, over 2575052.75 frames. ], batch size: 241, lr: 6.50e-03, grad_scale: 128.0 2024-06-20 07:32:00,998 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 07:32:08,959 INFO [train.py:1060] (1/2) Epoch 9, validation: loss=0.2023, simple_loss=0.2651, pruned_loss=0.06971, over 351949.00 frames. 2024-06-20 07:32:08,960 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 07:32:24,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=159423.0, ans=0.0 2024-06-20 07:32:29,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=159441.33333333334, ans=0.125 2024-06-20 07:32:35,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=159441.33333333334, ans=0.1 2024-06-20 07:32:42,907 INFO [train.py:1028] (1/2) Epoch 9, batch 6050, loss[loss=0.2383, simple_loss=0.2856, pruned_loss=0.09549, over 12950.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.2766, pruned_loss=0.1006, over 2576999.40 frames. ], batch size: 39, lr: 6.49e-03, grad_scale: 128.0 2024-06-20 07:32:52,354 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.54 vs. limit=6.0 2024-06-20 07:32:56,142 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 1.843e+02 2.047e+02 2.369e+02 3.231e+02, threshold=4.094e+02, percent-clipped=0.0 2024-06-20 07:33:08,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=159551.33333333334, ans=0.125 2024-06-20 07:33:18,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=159551.33333333334, ans=0.125 2024-06-20 07:33:19,174 INFO [train.py:1028] (1/2) Epoch 9, batch 6100, loss[loss=0.2446, simple_loss=0.2706, pruned_loss=0.1093, over 13122.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.2783, pruned_loss=0.1015, over 2578988.61 frames. ], batch size: 121, lr: 6.49e-03, grad_scale: 128.0 2024-06-20 07:33:40,548 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2024-06-20 07:33:56,371 INFO [train.py:1028] (1/2) Epoch 9, batch 6150, loss[loss=0.2527, simple_loss=0.2848, pruned_loss=0.1103, over 10841.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.2803, pruned_loss=0.1025, over 2578565.37 frames. ], batch size: 303, lr: 6.49e-03, grad_scale: 128.0 2024-06-20 07:33:58,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=159661.33333333334, ans=0.0 2024-06-20 07:34:06,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=159679.66666666666, ans=0.0 2024-06-20 07:34:09,724 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.831e+02 1.997e+02 2.236e+02 3.156e+02, threshold=3.994e+02, percent-clipped=0.0 2024-06-20 07:34:10,149 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.12 vs. limit=15.0 2024-06-20 07:34:12,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=159698.0, ans=0.125 2024-06-20 07:34:19,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=159716.33333333334, ans=0.125 2024-06-20 07:34:20,129 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.66 vs. limit=22.5 2024-06-20 07:34:21,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=159716.33333333334, ans=0.0 2024-06-20 07:34:22,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=159734.66666666666, ans=0.0 2024-06-20 07:34:23,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=159734.66666666666, ans=0.125 2024-06-20 07:34:24,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=159734.66666666666, ans=0.2 2024-06-20 07:34:29,427 INFO [train.py:1028] (1/2) Epoch 9, batch 6200, loss[loss=0.2699, simple_loss=0.3141, pruned_loss=0.1128, over 13251.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.282, pruned_loss=0.1035, over 2576141.61 frames. ], batch size: 89, lr: 6.49e-03, grad_scale: 128.0 2024-06-20 07:34:32,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=159753.0, ans=0.0 2024-06-20 07:34:37,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=159771.33333333334, ans=0.125 2024-06-20 07:34:45,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=159789.66666666666, ans=0.125 2024-06-20 07:34:47,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=159789.66666666666, ans=0.125 2024-06-20 07:34:49,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=159808.0, ans=0.125 2024-06-20 07:34:49,282 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.59 vs. limit=15.0 2024-06-20 07:34:50,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=159808.0, ans=0.1 2024-06-20 07:34:50,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=159808.0, ans=0.125 2024-06-20 07:34:52,672 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.31 vs. limit=15.0 2024-06-20 07:35:01,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=159826.33333333334, ans=0.0 2024-06-20 07:35:02,949 INFO [train.py:1028] (1/2) Epoch 9, batch 6250, loss[loss=0.2542, simple_loss=0.2855, pruned_loss=0.1114, over 13194.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.2836, pruned_loss=0.1044, over 2568177.65 frames. ], batch size: 83, lr: 6.49e-03, grad_scale: 128.0 2024-06-20 07:35:18,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=159863.0, ans=0.025 2024-06-20 07:35:19,592 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.633e+02 1.839e+02 1.992e+02 2.203e+02 2.987e+02, threshold=3.985e+02, percent-clipped=0.0 2024-06-20 07:35:28,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=159899.66666666666, ans=0.0 2024-06-20 07:35:29,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=159899.66666666666, ans=0.0 2024-06-20 07:35:30,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=159899.66666666666, ans=0.05 2024-06-20 07:35:34,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=159918.0, ans=0.125 2024-06-20 07:35:39,189 INFO [train.py:1028] (1/2) Epoch 9, batch 6300, loss[loss=0.2007, simple_loss=0.2543, pruned_loss=0.0735, over 11856.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.2851, pruned_loss=0.1049, over 2564204.91 frames. ], batch size: 17, lr: 6.49e-03, grad_scale: 128.0 2024-06-20 07:35:58,758 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.46 vs. limit=10.0 2024-06-20 07:36:05,180 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.05 vs. limit=10.0 2024-06-20 07:36:11,453 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=160009.66666666666, ans=0.025 2024-06-20 07:36:15,122 INFO [train.py:1028] (1/2) Epoch 9, batch 6350, loss[loss=0.2977, simple_loss=0.3201, pruned_loss=0.1376, over 12520.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.2861, pruned_loss=0.1047, over 2574127.48 frames. ], batch size: 202, lr: 6.48e-03, grad_scale: 128.0 2024-06-20 07:36:19,242 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:36:28,033 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.856e+02 1.999e+02 2.150e+02 3.070e+02, threshold=3.997e+02, percent-clipped=0.0 2024-06-20 07:36:38,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=160083.0, ans=0.0 2024-06-20 07:36:39,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=160083.0, ans=0.025 2024-06-20 07:36:46,132 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:36:47,961 INFO [train.py:1028] (1/2) Epoch 9, batch 6400, loss[loss=0.2301, simple_loss=0.2825, pruned_loss=0.08884, over 13253.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.2883, pruned_loss=0.1055, over 2574445.12 frames. ], batch size: 67, lr: 6.48e-03, grad_scale: 128.0 2024-06-20 07:36:52,465 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.02 vs. limit=15.0 2024-06-20 07:37:01,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=160156.33333333334, ans=0.0 2024-06-20 07:37:02,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=160156.33333333334, ans=0.2 2024-06-20 07:37:03,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=160156.33333333334, ans=0.125 2024-06-20 07:37:10,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=160174.66666666666, ans=0.125 2024-06-20 07:37:23,379 INFO [train.py:1028] (1/2) Epoch 9, batch 6450, loss[loss=0.3042, simple_loss=0.3325, pruned_loss=0.1379, over 12602.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.2901, pruned_loss=0.1063, over 2579676.92 frames. ], batch size: 202, lr: 6.48e-03, grad_scale: 128.0 2024-06-20 07:37:28,634 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.14 vs. limit=15.0 2024-06-20 07:37:36,759 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.642e+02 1.952e+02 2.128e+02 2.406e+02 3.588e+02, threshold=4.256e+02, percent-clipped=0.0 2024-06-20 07:37:51,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=160284.66666666666, ans=0.125 2024-06-20 07:38:00,753 INFO [train.py:1028] (1/2) Epoch 9, batch 6500, loss[loss=0.2666, simple_loss=0.2886, pruned_loss=0.1222, over 10797.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.2925, pruned_loss=0.1073, over 2583676.13 frames. ], batch size: 304, lr: 6.48e-03, grad_scale: 128.0 2024-06-20 07:38:00,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=160303.0, ans=0.0 2024-06-20 07:38:00,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=160303.0, ans=0.0 2024-06-20 07:38:14,213 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.96 vs. limit=15.0 2024-06-20 07:38:20,733 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:38:26,575 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.85 vs. limit=10.0 2024-06-20 07:38:26,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=160376.33333333334, ans=0.0 2024-06-20 07:38:32,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=160376.33333333334, ans=0.2 2024-06-20 07:38:33,357 INFO [train.py:1028] (1/2) Epoch 9, batch 6550, loss[loss=0.2726, simple_loss=0.3133, pruned_loss=0.116, over 12607.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.2932, pruned_loss=0.1074, over 2587593.63 frames. ], batch size: 22, lr: 6.48e-03, grad_scale: 128.0 2024-06-20 07:38:34,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=160394.66666666666, ans=0.125 2024-06-20 07:38:45,734 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.862e+02 1.995e+02 2.141e+02 2.676e+02, threshold=3.989e+02, percent-clipped=0.0 2024-06-20 07:38:52,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=160449.66666666666, ans=0.0 2024-06-20 07:38:52,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=160449.66666666666, ans=0.0 2024-06-20 07:38:59,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=160468.0, ans=0.1 2024-06-20 07:39:04,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=160486.33333333334, ans=0.2 2024-06-20 07:39:04,816 INFO [train.py:1028] (1/2) Epoch 9, batch 6600, loss[loss=0.2255, simple_loss=0.2749, pruned_loss=0.08802, over 13088.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.2925, pruned_loss=0.1066, over 2589185.68 frames. ], batch size: 71, lr: 6.47e-03, grad_scale: 128.0 2024-06-20 07:39:13,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=160504.66666666666, ans=0.0 2024-06-20 07:39:17,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=160523.0, ans=0.0 2024-06-20 07:39:18,050 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.70 vs. limit=15.0 2024-06-20 07:39:20,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=160523.0, ans=0.1 2024-06-20 07:39:30,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=160541.33333333334, ans=0.025 2024-06-20 07:39:30,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=160541.33333333334, ans=0.2 2024-06-20 07:39:30,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=160541.33333333334, ans=15.0 2024-06-20 07:39:30,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=160541.33333333334, ans=0.1 2024-06-20 07:39:37,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=160559.66666666666, ans=0.025 2024-06-20 07:39:40,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=160578.0, ans=0.0 2024-06-20 07:39:40,781 INFO [train.py:1028] (1/2) Epoch 9, batch 6650, loss[loss=0.2576, simple_loss=0.2983, pruned_loss=0.1084, over 12897.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.2949, pruned_loss=0.108, over 2583169.24 frames. ], batch size: 158, lr: 6.47e-03, grad_scale: 128.0 2024-06-20 07:39:42,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=160578.0, ans=0.05 2024-06-20 07:39:45,338 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.96 vs. limit=15.0 2024-06-20 07:39:52,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=160596.33333333334, ans=0.125 2024-06-20 07:39:53,893 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.657e+02 2.016e+02 2.303e+02 2.686e+02 4.382e+02, threshold=4.607e+02, percent-clipped=3.0 2024-06-20 07:39:54,977 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2024-06-20 07:39:55,088 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.23 vs. limit=15.0 2024-06-20 07:39:57,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=160614.66666666666, ans=0.0 2024-06-20 07:40:17,321 INFO [train.py:1028] (1/2) Epoch 9, batch 6700, loss[loss=0.2655, simple_loss=0.2995, pruned_loss=0.1157, over 12822.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.2956, pruned_loss=0.1083, over 2583382.12 frames. ], batch size: 177, lr: 6.47e-03, grad_scale: 128.0 2024-06-20 07:40:21,693 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2024-06-20 07:40:22,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=160669.66666666666, ans=0.2 2024-06-20 07:40:32,420 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:40:33,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=160706.33333333334, ans=0.2 2024-06-20 07:40:34,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=160706.33333333334, ans=0.1 2024-06-20 07:40:39,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=160724.66666666666, ans=0.0 2024-06-20 07:40:48,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=160743.0, ans=0.125 2024-06-20 07:40:50,222 INFO [train.py:1028] (1/2) Epoch 9, batch 6750, loss[loss=0.3198, simple_loss=0.3505, pruned_loss=0.1445, over 12242.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.2977, pruned_loss=0.1098, over 2576864.56 frames. ], batch size: 241, lr: 6.47e-03, grad_scale: 128.0 2024-06-20 07:40:51,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=160761.33333333334, ans=0.2 2024-06-20 07:40:52,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=160761.33333333334, ans=0.1 2024-06-20 07:40:57,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=160779.66666666666, ans=0.0 2024-06-20 07:41:02,978 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 1.885e+02 2.082e+02 2.311e+02 3.918e+02, threshold=4.165e+02, percent-clipped=0.0 2024-06-20 07:41:04,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=160798.0, ans=0.125 2024-06-20 07:41:06,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=160798.0, ans=0.125 2024-06-20 07:41:10,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=160816.33333333334, ans=0.125 2024-06-20 07:41:10,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=160816.33333333334, ans=0.125 2024-06-20 07:41:12,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=160816.33333333334, ans=0.0 2024-06-20 07:41:13,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=160816.33333333334, ans=10.0 2024-06-20 07:41:19,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=160834.66666666666, ans=0.125 2024-06-20 07:41:25,966 INFO [train.py:1028] (1/2) Epoch 9, batch 6800, loss[loss=0.2455, simple_loss=0.2877, pruned_loss=0.1017, over 13230.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.299, pruned_loss=0.1098, over 2579363.07 frames. ], batch size: 67, lr: 6.47e-03, grad_scale: 128.0 2024-06-20 07:41:29,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=160853.0, ans=0.0 2024-06-20 07:41:36,477 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.41 vs. limit=15.0 2024-06-20 07:41:40,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=160889.66666666666, ans=0.125 2024-06-20 07:41:50,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160908.0, ans=0.1 2024-06-20 07:41:51,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=160926.33333333334, ans=0.025 2024-06-20 07:41:58,642 INFO [train.py:1028] (1/2) Epoch 9, batch 6850, loss[loss=0.2777, simple_loss=0.3289, pruned_loss=0.1133, over 13317.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3, pruned_loss=0.11, over 2583622.92 frames. ], batch size: 63, lr: 6.46e-03, grad_scale: 128.0 2024-06-20 07:42:15,084 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 1.845e+02 2.046e+02 2.388e+02 3.410e+02, threshold=4.091e+02, percent-clipped=0.0 2024-06-20 07:42:15,593 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.30 vs. limit=15.0 2024-06-20 07:42:24,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=160999.66666666666, ans=0.125 2024-06-20 07:42:24,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=160999.66666666666, ans=10.0 2024-06-20 07:42:34,718 INFO [train.py:1028] (1/2) Epoch 9, batch 6900, loss[loss=0.258, simple_loss=0.2942, pruned_loss=0.1109, over 13264.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3008, pruned_loss=0.1103, over 2585983.85 frames. ], batch size: 49, lr: 6.46e-03, grad_scale: 128.0 2024-06-20 07:42:39,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=161036.33333333334, ans=15.0 2024-06-20 07:42:45,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=161054.66666666666, ans=0.2 2024-06-20 07:42:46,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=161054.66666666666, ans=0.07 2024-06-20 07:42:48,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=161073.0, ans=0.125 2024-06-20 07:42:48,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=161073.0, ans=0.0 2024-06-20 07:42:49,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=161073.0, ans=0.2 2024-06-20 07:42:50,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=161073.0, ans=0.125 2024-06-20 07:43:03,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161109.66666666666, ans=0.1 2024-06-20 07:43:03,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=161109.66666666666, ans=0.125 2024-06-20 07:43:05,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=161109.66666666666, ans=0.2 2024-06-20 07:43:07,732 INFO [train.py:1028] (1/2) Epoch 9, batch 6950, loss[loss=0.2118, simple_loss=0.266, pruned_loss=0.07882, over 11321.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3012, pruned_loss=0.1101, over 2579152.75 frames. ], batch size: 16, lr: 6.46e-03, grad_scale: 64.0 2024-06-20 07:43:13,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=161128.0, ans=0.0 2024-06-20 07:43:17,451 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.87 vs. limit=15.0 2024-06-20 07:43:21,675 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 1.896e+02 2.057e+02 2.321e+02 3.049e+02, threshold=4.115e+02, percent-clipped=0.0 2024-06-20 07:43:25,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=161164.66666666666, ans=0.125 2024-06-20 07:43:44,277 INFO [train.py:1028] (1/2) Epoch 9, batch 7000, loss[loss=0.2763, simple_loss=0.3115, pruned_loss=0.1206, over 13006.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3011, pruned_loss=0.1102, over 2575280.35 frames. ], batch size: 158, lr: 6.46e-03, grad_scale: 64.0 2024-06-20 07:43:48,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=161219.66666666666, ans=0.0 2024-06-20 07:43:51,871 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.21 vs. limit=10.0 2024-06-20 07:44:02,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161256.33333333334, ans=0.1 2024-06-20 07:44:03,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=161256.33333333334, ans=0.125 2024-06-20 07:44:21,925 INFO [train.py:1028] (1/2) Epoch 9, batch 7050, loss[loss=0.2736, simple_loss=0.3069, pruned_loss=0.1202, over 12762.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3014, pruned_loss=0.1102, over 2582877.56 frames. ], batch size: 176, lr: 6.46e-03, grad_scale: 64.0 2024-06-20 07:44:26,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=161311.33333333334, ans=0.0 2024-06-20 07:44:38,490 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.48 vs. limit=15.0 2024-06-20 07:44:39,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=161329.66666666666, ans=0.125 2024-06-20 07:44:39,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=161329.66666666666, ans=0.0 2024-06-20 07:44:40,629 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 1.896e+02 2.023e+02 2.166e+02 2.591e+02, threshold=4.045e+02, percent-clipped=0.0 2024-06-20 07:44:59,452 INFO [train.py:1028] (1/2) Epoch 9, batch 7100, loss[loss=0.2918, simple_loss=0.3279, pruned_loss=0.1278, over 13197.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3026, pruned_loss=0.111, over 2575053.84 frames. ], batch size: 112, lr: 6.46e-03, grad_scale: 64.0 2024-06-20 07:45:02,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=161403.0, ans=0.125 2024-06-20 07:45:02,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=161403.0, ans=0.09899494936611666 2024-06-20 07:45:12,390 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=161439.66666666666, ans=0.0 2024-06-20 07:45:14,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=161439.66666666666, ans=0.0 2024-06-20 07:45:25,567 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.26 vs. limit=15.0 2024-06-20 07:45:32,237 INFO [train.py:1028] (1/2) Epoch 9, batch 7150, loss[loss=0.2756, simple_loss=0.3067, pruned_loss=0.1223, over 12566.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3035, pruned_loss=0.1112, over 2573668.28 frames. ], batch size: 202, lr: 6.45e-03, grad_scale: 64.0 2024-06-20 07:45:33,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=161494.66666666666, ans=0.125 2024-06-20 07:45:49,320 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.902e+02 2.034e+02 2.249e+02 2.810e+02, threshold=4.068e+02, percent-clipped=0.0 2024-06-20 07:45:57,173 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:46:01,125 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2024-06-20 07:46:08,416 INFO [train.py:1028] (1/2) Epoch 9, batch 7200, loss[loss=0.2865, simple_loss=0.3213, pruned_loss=0.1258, over 13205.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3052, pruned_loss=0.1119, over 2578796.89 frames. ], batch size: 112, lr: 6.45e-03, grad_scale: 64.0 2024-06-20 07:46:13,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=161586.33333333334, ans=0.2 2024-06-20 07:46:28,967 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=16.12 vs. limit=15.0 2024-06-20 07:46:29,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=161623.0, ans=0.125 2024-06-20 07:46:33,386 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2024-06-20 07:46:38,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=161659.66666666666, ans=15.0 2024-06-20 07:46:44,610 INFO [train.py:1028] (1/2) Epoch 9, batch 7250, loss[loss=0.2531, simple_loss=0.2998, pruned_loss=0.1032, over 12962.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3056, pruned_loss=0.1117, over 2579849.17 frames. ], batch size: 36, lr: 6.45e-03, grad_scale: 64.0 2024-06-20 07:46:46,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=161678.0, ans=0.125 2024-06-20 07:46:58,452 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.912e+02 2.025e+02 2.243e+02 3.186e+02, threshold=4.050e+02, percent-clipped=0.0 2024-06-20 07:46:58,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=161714.66666666666, ans=0.1 2024-06-20 07:47:02,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=161714.66666666666, ans=0.025 2024-06-20 07:47:04,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=161733.0, ans=0.2 2024-06-20 07:47:08,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=161733.0, ans=0.125 2024-06-20 07:47:08,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161733.0, ans=0.1 2024-06-20 07:47:18,011 INFO [train.py:1028] (1/2) Epoch 9, batch 7300, loss[loss=0.2328, simple_loss=0.2824, pruned_loss=0.09165, over 12967.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3069, pruned_loss=0.1127, over 2580020.90 frames. ], batch size: 36, lr: 6.45e-03, grad_scale: 64.0 2024-06-20 07:47:21,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=161769.66666666666, ans=0.125 2024-06-20 07:47:28,506 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.34 vs. limit=15.0 2024-06-20 07:47:31,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=161806.33333333334, ans=0.1 2024-06-20 07:47:33,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=161806.33333333334, ans=0.125 2024-06-20 07:47:34,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=161806.33333333334, ans=0.125 2024-06-20 07:47:34,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=161806.33333333334, ans=0.5 2024-06-20 07:47:38,370 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.02 vs. limit=15.0 2024-06-20 07:47:39,876 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.46 vs. limit=15.0 2024-06-20 07:47:53,356 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2024-06-20 07:47:54,296 INFO [train.py:1028] (1/2) Epoch 9, batch 7350, loss[loss=0.2908, simple_loss=0.3274, pruned_loss=0.1271, over 13274.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3076, pruned_loss=0.113, over 2581680.81 frames. ], batch size: 46, lr: 6.45e-03, grad_scale: 64.0 2024-06-20 07:47:55,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=161861.33333333334, ans=0.125 2024-06-20 07:47:56,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=161861.33333333334, ans=0.2 2024-06-20 07:47:56,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=161861.33333333334, ans=0.0 2024-06-20 07:48:02,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161879.66666666666, ans=0.1 2024-06-20 07:48:08,029 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 1.922e+02 2.132e+02 2.512e+02 3.948e+02, threshold=4.263e+02, percent-clipped=0.0 2024-06-20 07:48:14,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=161916.33333333334, ans=15.0 2024-06-20 07:48:15,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=161916.33333333334, ans=0.025 2024-06-20 07:48:29,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161934.66666666666, ans=0.1 2024-06-20 07:48:29,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=161934.66666666666, ans=0.125 2024-06-20 07:48:30,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.39 vs. limit=22.5 2024-06-20 07:48:31,330 INFO [train.py:1028] (1/2) Epoch 9, batch 7400, loss[loss=0.3018, simple_loss=0.34, pruned_loss=0.1318, over 13267.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3081, pruned_loss=0.1131, over 2586745.78 frames. ], batch size: 63, lr: 6.44e-03, grad_scale: 64.0 2024-06-20 07:48:54,882 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.84 vs. limit=15.0 2024-06-20 07:48:58,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=162026.33333333334, ans=0.125 2024-06-20 07:49:05,214 INFO [train.py:1028] (1/2) Epoch 9, batch 7450, loss[loss=0.2564, simple_loss=0.2936, pruned_loss=0.1096, over 13051.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3068, pruned_loss=0.112, over 2580250.30 frames. ], batch size: 30, lr: 6.44e-03, grad_scale: 64.0 2024-06-20 07:49:06,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=162044.66666666666, ans=0.125 2024-06-20 07:49:09,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=162044.66666666666, ans=0.025 2024-06-20 07:49:19,409 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 1.889e+02 2.044e+02 2.294e+02 3.470e+02, threshold=4.087e+02, percent-clipped=0.0 2024-06-20 07:49:20,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=162081.33333333334, ans=0.125 2024-06-20 07:49:22,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=162081.33333333334, ans=0.2 2024-06-20 07:49:27,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=162099.66666666666, ans=0.2 2024-06-20 07:49:32,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=162118.0, ans=0.125 2024-06-20 07:49:38,707 INFO [train.py:1028] (1/2) Epoch 9, batch 7500, loss[loss=0.2559, simple_loss=0.2823, pruned_loss=0.1148, over 10614.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3072, pruned_loss=0.1123, over 2578197.23 frames. ], batch size: 304, lr: 6.44e-03, grad_scale: 64.0 2024-06-20 07:49:40,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=162136.33333333334, ans=0.125 2024-06-20 07:49:49,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=162154.66666666666, ans=0.0 2024-06-20 07:49:49,587 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:49:57,082 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.82 vs. limit=15.0 2024-06-20 07:50:14,568 INFO [train.py:1028] (1/2) Epoch 9, batch 7550, loss[loss=0.2689, simple_loss=0.3058, pruned_loss=0.116, over 12972.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3083, pruned_loss=0.1132, over 2576889.87 frames. ], batch size: 158, lr: 6.44e-03, grad_scale: 64.0 2024-06-20 07:50:28,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=162246.33333333334, ans=0.125 2024-06-20 07:50:32,126 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 1.869e+02 2.051e+02 2.235e+02 2.683e+02, threshold=4.103e+02, percent-clipped=0.0 2024-06-20 07:50:45,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=162301.33333333334, ans=0.1 2024-06-20 07:50:46,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=162301.33333333334, ans=0.125 2024-06-20 07:50:51,625 INFO [train.py:1028] (1/2) Epoch 9, batch 7600, loss[loss=0.2539, simple_loss=0.2934, pruned_loss=0.1072, over 13238.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3087, pruned_loss=0.1134, over 2577527.60 frames. ], batch size: 83, lr: 6.44e-03, grad_scale: 64.0 2024-06-20 07:50:55,321 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.14 vs. limit=22.5 2024-06-20 07:51:00,875 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.53 vs. limit=10.0 2024-06-20 07:51:01,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=162338.0, ans=0.0 2024-06-20 07:51:10,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=162356.33333333334, ans=0.125 2024-06-20 07:51:13,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=162374.66666666666, ans=0.125 2024-06-20 07:51:14,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=162374.66666666666, ans=0.09899494936611666 2024-06-20 07:51:16,046 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.56 vs. limit=22.5 2024-06-20 07:51:25,166 INFO [train.py:1028] (1/2) Epoch 9, batch 7650, loss[loss=0.2335, simple_loss=0.2779, pruned_loss=0.09455, over 12900.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3089, pruned_loss=0.1135, over 2572801.70 frames. ], batch size: 33, lr: 6.44e-03, grad_scale: 64.0 2024-06-20 07:51:39,649 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.909e+02 2.071e+02 2.328e+02 3.395e+02, threshold=4.142e+02, percent-clipped=0.0 2024-06-20 07:51:52,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=162466.33333333334, ans=0.1 2024-06-20 07:52:01,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=162503.0, ans=0.1 2024-06-20 07:52:02,346 INFO [train.py:1028] (1/2) Epoch 9, batch 7700, loss[loss=0.2856, simple_loss=0.3337, pruned_loss=0.1188, over 13234.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3096, pruned_loss=0.1138, over 2569514.96 frames. ], batch size: 63, lr: 6.43e-03, grad_scale: 64.0 2024-06-20 07:52:25,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=162558.0, ans=0.125 2024-06-20 07:52:31,919 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=162576.33333333334, ans=0.2 2024-06-20 07:52:37,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=162594.66666666666, ans=0.1 2024-06-20 07:52:38,241 INFO [train.py:1028] (1/2) Epoch 9, batch 7750, loss[loss=0.2818, simple_loss=0.3284, pruned_loss=0.1176, over 13176.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3111, pruned_loss=0.1147, over 2573588.29 frames. ], batch size: 72, lr: 6.43e-03, grad_scale: 64.0 2024-06-20 07:52:42,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=162594.66666666666, ans=0.125 2024-06-20 07:52:46,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=162613.0, ans=0.125 2024-06-20 07:52:52,084 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.653e+02 1.904e+02 2.052e+02 2.223e+02 3.325e+02, threshold=4.104e+02, percent-clipped=0.0 2024-06-20 07:52:56,482 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.67 vs. limit=10.0 2024-06-20 07:52:58,901 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:52:59,103 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.89 vs. limit=15.0 2024-06-20 07:53:04,894 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.75 vs. limit=15.0 2024-06-20 07:53:11,307 INFO [train.py:1028] (1/2) Epoch 9, batch 7800, loss[loss=0.2786, simple_loss=0.3133, pruned_loss=0.122, over 13152.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3114, pruned_loss=0.1145, over 2577492.27 frames. ], batch size: 95, lr: 6.43e-03, grad_scale: 64.0 2024-06-20 07:53:21,236 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.51 vs. limit=10.0 2024-06-20 07:53:28,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=162723.0, ans=0.0 2024-06-20 07:53:39,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=162741.33333333334, ans=0.125 2024-06-20 07:53:48,031 INFO [train.py:1028] (1/2) Epoch 9, batch 7850, loss[loss=0.2771, simple_loss=0.3114, pruned_loss=0.1214, over 12205.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3126, pruned_loss=0.1154, over 2572948.27 frames. ], batch size: 18, lr: 6.43e-03, grad_scale: 64.0 2024-06-20 07:53:51,679 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.45 vs. limit=15.0 2024-06-20 07:54:02,174 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.031e+02 2.244e+02 2.640e+02 3.267e+02, threshold=4.488e+02, percent-clipped=0.0 2024-06-20 07:54:02,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=162814.66666666666, ans=0.0 2024-06-20 07:54:04,494 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.10 vs. limit=15.0 2024-06-20 07:54:11,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=162833.0, ans=0.2 2024-06-20 07:54:24,877 INFO [train.py:1028] (1/2) Epoch 9, batch 7900, loss[loss=0.2525, simple_loss=0.3012, pruned_loss=0.1019, over 13140.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3122, pruned_loss=0.1152, over 2570850.53 frames. ], batch size: 77, lr: 6.43e-03, grad_scale: 64.0 2024-06-20 07:54:33,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=162888.0, ans=0.04949747468305833 2024-06-20 07:54:33,514 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.05 vs. limit=22.5 2024-06-20 07:54:40,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=162906.33333333334, ans=0.125 2024-06-20 07:54:41,345 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.92 vs. limit=10.0 2024-06-20 07:54:48,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=162924.66666666666, ans=0.125 2024-06-20 07:54:49,981 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.59 vs. limit=15.0 2024-06-20 07:54:52,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=162943.0, ans=0.125 2024-06-20 07:54:55,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=162943.0, ans=0.0 2024-06-20 07:54:58,207 INFO [train.py:1028] (1/2) Epoch 9, batch 7950, loss[loss=0.2834, simple_loss=0.3055, pruned_loss=0.1306, over 10402.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3125, pruned_loss=0.1152, over 2574645.96 frames. ], batch size: 303, lr: 6.43e-03, grad_scale: 64.0 2024-06-20 07:55:04,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=162979.66666666666, ans=0.07 2024-06-20 07:55:07,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=162979.66666666666, ans=0.0 2024-06-20 07:55:12,423 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 1.946e+02 2.102e+02 2.334e+02 3.667e+02, threshold=4.204e+02, percent-clipped=0.0 2024-06-20 07:55:27,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=163034.66666666666, ans=0.95 2024-06-20 07:55:31,669 INFO [train.py:1028] (1/2) Epoch 9, batch 8000, loss[loss=0.2559, simple_loss=0.3053, pruned_loss=0.1032, over 12541.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3128, pruned_loss=0.1151, over 2570959.60 frames. ], batch size: 29, lr: 6.42e-03, grad_scale: 64.0 2024-06-20 07:55:37,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=163053.0, ans=0.0 2024-06-20 07:55:38,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=163053.0, ans=0.125 2024-06-20 07:55:39,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=163053.0, ans=0.125 2024-06-20 07:55:42,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=163071.33333333334, ans=0.125 2024-06-20 07:55:49,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.33 vs. limit=10.0 2024-06-20 07:55:58,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=163108.0, ans=0.04949747468305833 2024-06-20 07:56:00,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=163126.33333333334, ans=0.125 2024-06-20 07:56:07,775 INFO [train.py:1028] (1/2) Epoch 9, batch 8050, loss[loss=0.2717, simple_loss=0.3168, pruned_loss=0.1133, over 13237.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3124, pruned_loss=0.1146, over 2571192.36 frames. ], batch size: 83, lr: 6.42e-03, grad_scale: 64.0 2024-06-20 07:56:16,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=163144.66666666666, ans=0.02 2024-06-20 07:56:20,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=163163.0, ans=0.0 2024-06-20 07:56:23,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=163163.0, ans=0.125 2024-06-20 07:56:24,889 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.631e+02 1.911e+02 2.109e+02 2.364e+02 3.890e+02, threshold=4.217e+02, percent-clipped=0.0 2024-06-20 07:56:26,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=163181.33333333334, ans=0.0 2024-06-20 07:56:35,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=163199.66666666666, ans=0.125 2024-06-20 07:56:40,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=163218.0, ans=0.025 2024-06-20 07:56:43,275 INFO [train.py:1028] (1/2) Epoch 9, batch 8100, loss[loss=0.2537, simple_loss=0.3002, pruned_loss=0.1036, over 13125.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3133, pruned_loss=0.1151, over 2575898.95 frames. ], batch size: 112, lr: 6.42e-03, grad_scale: 64.0 2024-06-20 07:56:43,778 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.80 vs. limit=15.0 2024-06-20 07:56:46,396 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.27 vs. limit=15.0 2024-06-20 07:57:01,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=163273.0, ans=0.1 2024-06-20 07:57:03,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=163291.33333333334, ans=0.0 2024-06-20 07:57:15,566 INFO [train.py:1028] (1/2) Epoch 9, batch 8150, loss[loss=0.251, simple_loss=0.2914, pruned_loss=0.1053, over 13091.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3127, pruned_loss=0.1144, over 2579706.90 frames. ], batch size: 121, lr: 6.42e-03, grad_scale: 64.0 2024-06-20 07:57:19,353 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.74 vs. limit=6.0 2024-06-20 07:57:27,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=163364.66666666666, ans=0.0 2024-06-20 07:57:28,936 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.707e+02 1.922e+02 2.171e+02 2.480e+02 3.860e+02, threshold=4.342e+02, percent-clipped=0.0 2024-06-20 07:57:40,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=163383.0, ans=0.125 2024-06-20 07:57:40,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=163383.0, ans=0.2 2024-06-20 07:57:41,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=163383.0, ans=0.0 2024-06-20 07:57:49,427 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.99 vs. limit=15.0 2024-06-20 07:57:50,891 INFO [train.py:1028] (1/2) Epoch 9, batch 8200, loss[loss=0.2838, simple_loss=0.3262, pruned_loss=0.1207, over 13141.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.313, pruned_loss=0.1144, over 2583017.65 frames. ], batch size: 112, lr: 6.42e-03, grad_scale: 64.0 2024-06-20 07:57:51,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=163419.66666666666, ans=0.125 2024-06-20 07:58:11,878 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.53 vs. limit=15.0 2024-06-20 07:58:27,188 INFO [train.py:1028] (1/2) Epoch 9, batch 8250, loss[loss=0.2817, simple_loss=0.332, pruned_loss=0.1157, over 13292.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3133, pruned_loss=0.1146, over 2584209.86 frames. ], batch size: 52, lr: 6.41e-03, grad_scale: 64.0 2024-06-20 07:58:30,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=163511.33333333334, ans=0.1 2024-06-20 07:58:41,076 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 1.938e+02 2.066e+02 2.346e+02 3.240e+02, threshold=4.132e+02, percent-clipped=0.0 2024-06-20 07:59:00,128 INFO [train.py:1028] (1/2) Epoch 9, batch 8300, loss[loss=0.2848, simple_loss=0.3145, pruned_loss=0.1275, over 13038.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3129, pruned_loss=0.1144, over 2580587.69 frames. ], batch size: 102, lr: 6.41e-03, grad_scale: 64.0 2024-06-20 07:59:01,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=163603.0, ans=0.125 2024-06-20 07:59:08,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=163621.33333333334, ans=0.025 2024-06-20 07:59:10,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=163621.33333333334, ans=0.1 2024-06-20 07:59:14,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=163639.66666666666, ans=0.125 2024-06-20 07:59:17,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=163639.66666666666, ans=0.125 2024-06-20 07:59:34,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=163676.33333333334, ans=0.0 2024-06-20 07:59:34,928 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.88 vs. limit=22.5 2024-06-20 07:59:36,416 INFO [train.py:1028] (1/2) Epoch 9, batch 8350, loss[loss=0.2751, simple_loss=0.3112, pruned_loss=0.1195, over 13165.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3129, pruned_loss=0.1142, over 2580724.48 frames. ], batch size: 112, lr: 6.41e-03, grad_scale: 64.0 2024-06-20 07:59:43,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=163713.0, ans=0.0 2024-06-20 07:59:48,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=163713.0, ans=0.125 2024-06-20 07:59:50,312 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.869e+02 2.003e+02 2.172e+02 3.219e+02, threshold=4.006e+02, percent-clipped=0.0 2024-06-20 07:59:53,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=163731.33333333334, ans=0.125 2024-06-20 08:00:04,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=23.22 vs. limit=22.5 2024-06-20 08:00:04,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=163768.0, ans=0.125 2024-06-20 08:00:08,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=163768.0, ans=0.0 2024-06-20 08:00:09,560 INFO [train.py:1028] (1/2) Epoch 9, batch 8400, loss[loss=0.2388, simple_loss=0.2835, pruned_loss=0.097, over 12891.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3134, pruned_loss=0.1147, over 2577158.39 frames. ], batch size: 39, lr: 6.41e-03, grad_scale: 64.0 2024-06-20 08:00:15,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=163786.33333333334, ans=0.025 2024-06-20 08:00:21,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=163804.66666666666, ans=0.125 2024-06-20 08:00:25,405 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.41 vs. limit=15.0 2024-06-20 08:00:34,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=163841.33333333334, ans=0.05 2024-06-20 08:00:46,234 INFO [train.py:1028] (1/2) Epoch 9, batch 8450, loss[loss=0.2847, simple_loss=0.3273, pruned_loss=0.121, over 13137.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3136, pruned_loss=0.1146, over 2578870.34 frames. ], batch size: 112, lr: 6.41e-03, grad_scale: 64.0 2024-06-20 08:00:50,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=17.08 vs. limit=15.0 2024-06-20 08:01:00,681 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.902e+02 2.013e+02 2.139e+02 2.893e+02, threshold=4.025e+02, percent-clipped=0.0 2024-06-20 08:01:01,139 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=15.0 2024-06-20 08:01:01,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=163914.66666666666, ans=0.0 2024-06-20 08:01:02,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=163914.66666666666, ans=0.125 2024-06-20 08:01:09,888 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.67 vs. limit=6.0 2024-06-20 08:01:12,649 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2024-06-20 08:01:20,622 INFO [train.py:1028] (1/2) Epoch 9, batch 8500, loss[loss=0.2664, simple_loss=0.3202, pruned_loss=0.1063, over 12780.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3147, pruned_loss=0.1151, over 2577135.98 frames. ], batch size: 29, lr: 6.41e-03, grad_scale: 64.0 2024-06-20 08:01:28,262 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2024-06-20 08:01:32,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=163988.0, ans=0.125 2024-06-20 08:01:47,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=164024.66666666666, ans=0.125 2024-06-20 08:01:51,843 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.24 vs. limit=22.5 2024-06-20 08:01:59,512 INFO [train.py:1028] (1/2) Epoch 9, batch 8550, loss[loss=0.2737, simple_loss=0.3263, pruned_loss=0.1105, over 12598.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3142, pruned_loss=0.1145, over 2575625.12 frames. ], batch size: 22, lr: 6.40e-03, grad_scale: 64.0 2024-06-20 08:02:00,539 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.90 vs. limit=15.0 2024-06-20 08:02:01,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=164061.33333333334, ans=0.1 2024-06-20 08:02:02,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=164061.33333333334, ans=0.125 2024-06-20 08:02:03,980 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=18.39 vs. limit=15.0 2024-06-20 08:02:07,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=164079.66666666666, ans=0.05 2024-06-20 08:02:13,870 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 2.008e+02 2.155e+02 2.309e+02 3.733e+02, threshold=4.310e+02, percent-clipped=0.0 2024-06-20 08:02:22,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=164116.33333333334, ans=0.2 2024-06-20 08:02:36,831 INFO [train.py:1028] (1/2) Epoch 9, batch 8600, loss[loss=0.26, simple_loss=0.3004, pruned_loss=0.1098, over 13130.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3148, pruned_loss=0.1147, over 2574480.49 frames. ], batch size: 121, lr: 6.40e-03, grad_scale: 64.0 2024-06-20 08:02:45,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=164171.33333333334, ans=0.0 2024-06-20 08:03:07,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=164226.33333333334, ans=0.025 2024-06-20 08:03:10,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=164226.33333333334, ans=0.125 2024-06-20 08:03:11,286 INFO [train.py:1028] (1/2) Epoch 9, batch 8650, loss[loss=0.2628, simple_loss=0.3051, pruned_loss=0.1102, over 13127.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3152, pruned_loss=0.1147, over 2577346.45 frames. ], batch size: 103, lr: 6.40e-03, grad_scale: 64.0 2024-06-20 08:03:11,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=164244.66666666666, ans=0.2 2024-06-20 08:03:25,107 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 1.961e+02 2.239e+02 2.574e+02 3.794e+02, threshold=4.477e+02, percent-clipped=0.0 2024-06-20 08:03:34,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=164299.66666666666, ans=0.1 2024-06-20 08:03:43,247 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=164318.0, ans=0.125 2024-06-20 08:03:47,807 INFO [train.py:1028] (1/2) Epoch 9, batch 8700, loss[loss=0.2756, simple_loss=0.3251, pruned_loss=0.1131, over 13227.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.316, pruned_loss=0.1153, over 2573639.55 frames. ], batch size: 59, lr: 6.40e-03, grad_scale: 64.0 2024-06-20 08:03:53,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=164336.33333333334, ans=0.1 2024-06-20 08:03:55,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=164354.66666666666, ans=0.0 2024-06-20 08:03:56,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=164354.66666666666, ans=0.0 2024-06-20 08:04:04,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=164373.0, ans=0.0 2024-06-20 08:04:17,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=164409.66666666666, ans=0.2 2024-06-20 08:04:21,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=164428.0, ans=0.07 2024-06-20 08:04:21,925 INFO [train.py:1028] (1/2) Epoch 9, batch 8750, loss[loss=0.2724, simple_loss=0.3075, pruned_loss=0.1187, over 13081.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3161, pruned_loss=0.1156, over 2569765.08 frames. ], batch size: 121, lr: 6.40e-03, grad_scale: 64.0 2024-06-20 08:04:32,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=164446.33333333334, ans=0.125 2024-06-20 08:04:38,890 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 1.872e+02 2.037e+02 2.206e+02 3.116e+02, threshold=4.074e+02, percent-clipped=0.0 2024-06-20 08:04:41,192 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.54 vs. limit=10.0 2024-06-20 08:04:50,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=164483.0, ans=0.2 2024-06-20 08:04:53,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=164501.33333333334, ans=0.025 2024-06-20 08:04:55,329 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=9.80 vs. limit=12.0 2024-06-20 08:04:58,079 INFO [train.py:1028] (1/2) Epoch 9, batch 8800, loss[loss=0.2877, simple_loss=0.3308, pruned_loss=0.1223, over 13181.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3167, pruned_loss=0.1158, over 2574410.41 frames. ], batch size: 72, lr: 6.39e-03, grad_scale: 64.0 2024-06-20 08:04:58,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=164519.66666666666, ans=0.125 2024-06-20 08:05:02,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=164519.66666666666, ans=0.125 2024-06-20 08:05:13,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=164556.33333333334, ans=0.0 2024-06-20 08:05:24,195 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.80 vs. limit=15.0 2024-06-20 08:05:31,630 INFO [train.py:1028] (1/2) Epoch 9, batch 8850, loss[loss=0.3129, simple_loss=0.3407, pruned_loss=0.1425, over 12485.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3162, pruned_loss=0.1158, over 2563134.67 frames. ], batch size: 202, lr: 6.39e-03, grad_scale: 64.0 2024-06-20 08:05:36,579 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=8.490e+00 2024-06-20 08:05:40,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=164611.33333333334, ans=0.1 2024-06-20 08:05:49,006 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 1.883e+02 2.052e+02 2.301e+02 3.173e+02, threshold=4.104e+02, percent-clipped=0.0 2024-06-20 08:06:01,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=164684.66666666666, ans=0.125 2024-06-20 08:06:04,532 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=164684.66666666666, ans=0.125 2024-06-20 08:06:04,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=164684.66666666666, ans=0.125 2024-06-20 08:06:08,267 INFO [train.py:1028] (1/2) Epoch 9, batch 8900, loss[loss=0.269, simple_loss=0.3204, pruned_loss=0.1088, over 12967.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3167, pruned_loss=0.1163, over 2560040.87 frames. ], batch size: 33, lr: 6.39e-03, grad_scale: 64.0 2024-06-20 08:06:12,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=164703.0, ans=0.125 2024-06-20 08:06:15,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.12 vs. limit=10.0 2024-06-20 08:06:31,227 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:06:31,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=164758.0, ans=0.0 2024-06-20 08:06:32,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=164758.0, ans=0.035 2024-06-20 08:06:39,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=164776.33333333334, ans=0.125 2024-06-20 08:06:42,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=164776.33333333334, ans=0.07 2024-06-20 08:06:42,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=164776.33333333334, ans=0.0 2024-06-20 08:06:44,433 INFO [train.py:1028] (1/2) Epoch 9, batch 8950, loss[loss=0.2845, simple_loss=0.3244, pruned_loss=0.1223, over 12524.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3166, pruned_loss=0.116, over 2561912.92 frames. ], batch size: 202, lr: 6.39e-03, grad_scale: 128.0 2024-06-20 08:06:44,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=164794.66666666666, ans=0.125 2024-06-20 08:06:51,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=164813.0, ans=0.2 2024-06-20 08:06:58,675 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 1.919e+02 2.074e+02 2.310e+02 3.014e+02, threshold=4.149e+02, percent-clipped=0.0 2024-06-20 08:07:13,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=164868.0, ans=0.125 2024-06-20 08:07:15,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=164868.0, ans=0.125 2024-06-20 08:07:16,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=164868.0, ans=0.125 2024-06-20 08:07:17,724 INFO [train.py:1028] (1/2) Epoch 9, batch 9000, loss[loss=0.2581, simple_loss=0.3045, pruned_loss=0.1059, over 13258.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3163, pruned_loss=0.1154, over 2568197.17 frames. ], batch size: 46, lr: 6.39e-03, grad_scale: 128.0 2024-06-20 08:07:17,724 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 08:07:22,604 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.2.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([2.5351, 2.8731, 2.8466, 2.8682], device='cuda:1') 2024-06-20 08:07:23,520 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.9849, 4.4013, 2.3872, 4.3390], device='cuda:1') 2024-06-20 08:07:25,756 INFO [train.py:1060] (1/2) Epoch 9, validation: loss=0.2006, simple_loss=0.2641, pruned_loss=0.06858, over 351949.00 frames. 2024-06-20 08:07:25,756 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 08:07:30,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=164886.33333333334, ans=0.0 2024-06-20 08:07:31,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=164886.33333333334, ans=0.125 2024-06-20 08:07:32,577 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.29 vs. limit=10.0 2024-06-20 08:07:38,203 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.18 vs. limit=15.0 2024-06-20 08:07:41,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=164923.0, ans=0.125 2024-06-20 08:08:01,925 INFO [train.py:1028] (1/2) Epoch 9, batch 9050, loss[loss=0.2934, simple_loss=0.3266, pruned_loss=0.1301, over 11273.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3174, pruned_loss=0.1162, over 2566670.28 frames. ], batch size: 17, lr: 6.39e-03, grad_scale: 128.0 2024-06-20 08:08:15,511 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.008e+02 2.224e+02 2.471e+02 3.519e+02, threshold=4.448e+02, percent-clipped=0.0 2024-06-20 08:08:17,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=165014.66666666666, ans=0.025 2024-06-20 08:08:22,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=165033.0, ans=0.1 2024-06-20 08:08:28,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=165051.33333333334, ans=0.0 2024-06-20 08:08:30,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=165051.33333333334, ans=0.125 2024-06-20 08:08:31,094 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.50 vs. limit=15.0 2024-06-20 08:08:34,676 INFO [train.py:1028] (1/2) Epoch 9, batch 9100, loss[loss=0.2594, simple_loss=0.3095, pruned_loss=0.1047, over 13213.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3168, pruned_loss=0.1155, over 2567097.39 frames. ], batch size: 72, lr: 6.38e-03, grad_scale: 128.0 2024-06-20 08:08:37,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=165069.66666666666, ans=0.1 2024-06-20 08:08:43,242 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.21 vs. limit=22.5 2024-06-20 08:08:53,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=165124.66666666666, ans=0.125 2024-06-20 08:08:57,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=165124.66666666666, ans=0.125 2024-06-20 08:09:07,037 INFO [train.py:1028] (1/2) Epoch 9, batch 9150, loss[loss=0.3004, simple_loss=0.3448, pruned_loss=0.1281, over 13165.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3173, pruned_loss=0.1158, over 2567698.36 frames. ], batch size: 77, lr: 6.38e-03, grad_scale: 128.0 2024-06-20 08:09:11,339 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.60 vs. limit=15.0 2024-06-20 08:09:12,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=165179.66666666666, ans=0.0 2024-06-20 08:09:16,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=165179.66666666666, ans=0.125 2024-06-20 08:09:23,706 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 1.910e+02 2.089e+02 2.265e+02 2.863e+02, threshold=4.178e+02, percent-clipped=0.0 2024-06-20 08:09:42,028 INFO [train.py:1028] (1/2) Epoch 9, batch 9200, loss[loss=0.2543, simple_loss=0.3078, pruned_loss=0.1005, over 12908.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3173, pruned_loss=0.1152, over 2570782.19 frames. ], batch size: 36, lr: 6.38e-03, grad_scale: 128.0 2024-06-20 08:09:42,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=165253.0, ans=0.0 2024-06-20 08:09:49,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=165271.33333333334, ans=0.125 2024-06-20 08:09:51,282 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=24.29 vs. limit=22.5 2024-06-20 08:09:54,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=165289.66666666666, ans=0.0 2024-06-20 08:09:56,179 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=165289.66666666666, ans=0.125 2024-06-20 08:09:56,774 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:10:07,289 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.03 vs. limit=12.0 2024-06-20 08:10:08,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=165326.33333333334, ans=0.0 2024-06-20 08:10:13,762 INFO [train.py:1028] (1/2) Epoch 9, batch 9250, loss[loss=0.2693, simple_loss=0.32, pruned_loss=0.1093, over 13280.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3173, pruned_loss=0.1151, over 2572813.71 frames. ], batch size: 67, lr: 6.38e-03, grad_scale: 128.0 2024-06-20 08:10:19,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=165363.0, ans=0.0 2024-06-20 08:10:27,307 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 1.924e+02 2.031e+02 2.209e+02 3.361e+02, threshold=4.062e+02, percent-clipped=0.0 2024-06-20 08:10:30,223 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.69 vs. limit=12.0 2024-06-20 08:10:35,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=165399.66666666666, ans=0.1 2024-06-20 08:10:37,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=165399.66666666666, ans=0.125 2024-06-20 08:10:40,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=165418.0, ans=0.2 2024-06-20 08:10:45,933 INFO [train.py:1028] (1/2) Epoch 9, batch 9300, loss[loss=0.2441, simple_loss=0.2887, pruned_loss=0.09977, over 13009.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3171, pruned_loss=0.1149, over 2569442.79 frames. ], batch size: 39, lr: 6.38e-03, grad_scale: 128.0 2024-06-20 08:10:46,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=165436.33333333334, ans=0.125 2024-06-20 08:10:47,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=165436.33333333334, ans=0.125 2024-06-20 08:10:48,950 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.40 vs. limit=15.0 2024-06-20 08:10:54,516 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:10:58,295 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:11:04,711 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.22 vs. limit=15.0 2024-06-20 08:11:08,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=165491.33333333334, ans=0.0 2024-06-20 08:11:11,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=165509.66666666666, ans=0.2 2024-06-20 08:11:17,276 INFO [train.py:1028] (1/2) Epoch 9, batch 9350, loss[loss=0.2078, simple_loss=0.2528, pruned_loss=0.08141, over 12568.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3169, pruned_loss=0.115, over 2567815.41 frames. ], batch size: 22, lr: 6.38e-03, grad_scale: 128.0 2024-06-20 08:11:17,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=165528.0, ans=0.04949747468305833 2024-06-20 08:11:18,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=165528.0, ans=0.2 2024-06-20 08:11:21,778 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.86 vs. limit=22.5 2024-06-20 08:11:25,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=165546.33333333334, ans=0.125 2024-06-20 08:11:32,485 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.928e+02 2.098e+02 2.270e+02 3.271e+02, threshold=4.197e+02, percent-clipped=0.0 2024-06-20 08:11:40,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=165583.0, ans=0.125 2024-06-20 08:11:48,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=165601.33333333334, ans=0.125 2024-06-20 08:11:50,555 INFO [train.py:1028] (1/2) Epoch 9, batch 9400, loss[loss=0.2857, simple_loss=0.3296, pruned_loss=0.1209, over 13203.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3174, pruned_loss=0.1154, over 2566423.66 frames. ], batch size: 52, lr: 6.37e-03, grad_scale: 128.0 2024-06-20 08:11:57,997 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=15.09 vs. limit=15.0 2024-06-20 08:12:12,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=165674.66666666666, ans=0.0 2024-06-20 08:12:21,219 INFO [train.py:1028] (1/2) Epoch 9, batch 9450, loss[loss=0.2889, simple_loss=0.3296, pruned_loss=0.1241, over 12532.00 frames. ], tot_loss[loss=0.275, simple_loss=0.318, pruned_loss=0.116, over 2566875.04 frames. ], batch size: 22, lr: 6.37e-03, grad_scale: 128.0 2024-06-20 08:12:25,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=165711.33333333334, ans=0.125 2024-06-20 08:12:31,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=165729.66666666666, ans=0.2 2024-06-20 08:12:35,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=165748.0, ans=0.09899494936611666 2024-06-20 08:12:36,133 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.939e+02 2.102e+02 2.280e+02 2.892e+02, threshold=4.203e+02, percent-clipped=0.0 2024-06-20 08:12:37,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=165748.0, ans=0.0 2024-06-20 08:12:51,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=165784.66666666666, ans=0.125 2024-06-20 08:12:53,401 INFO [train.py:1028] (1/2) Epoch 9, batch 9500, loss[loss=0.269, simple_loss=0.3181, pruned_loss=0.11, over 13235.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3176, pruned_loss=0.1155, over 2576151.80 frames. ], batch size: 43, lr: 6.37e-03, grad_scale: 128.0 2024-06-20 08:12:58,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=165803.0, ans=0.0 2024-06-20 08:13:04,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=165821.33333333334, ans=0.125 2024-06-20 08:13:07,115 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.68 vs. limit=22.5 2024-06-20 08:13:10,035 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.99 vs. limit=10.0 2024-06-20 08:13:11,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=165858.0, ans=0.0 2024-06-20 08:13:12,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=165858.0, ans=0.0 2024-06-20 08:13:16,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=165858.0, ans=0.125 2024-06-20 08:13:18,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=165876.33333333334, ans=0.125 2024-06-20 08:13:23,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=165894.66666666666, ans=0.0 2024-06-20 08:13:23,823 INFO [train.py:1028] (1/2) Epoch 9, batch 9550, loss[loss=0.2698, simple_loss=0.3057, pruned_loss=0.117, over 12920.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3174, pruned_loss=0.1156, over 2570868.85 frames. ], batch size: 39, lr: 6.37e-03, grad_scale: 128.0 2024-06-20 08:13:37,463 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.668e+02 1.982e+02 2.166e+02 2.371e+02 3.673e+02, threshold=4.331e+02, percent-clipped=0.0 2024-06-20 08:13:43,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=165949.66666666666, ans=0.125 2024-06-20 08:13:48,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=165968.0, ans=0.0 2024-06-20 08:13:53,065 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2024-06-20 08:13:54,587 INFO [train.py:1028] (1/2) Epoch 9, batch 9600, loss[loss=0.309, simple_loss=0.3332, pruned_loss=0.1424, over 10483.00 frames. ], tot_loss[loss=0.274, simple_loss=0.317, pruned_loss=0.1156, over 2569495.06 frames. ], batch size: 305, lr: 6.37e-03, grad_scale: 64.0 2024-06-20 08:14:02,674 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.77 vs. limit=6.0 2024-06-20 08:14:10,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=166023.0, ans=0.125 2024-06-20 08:14:10,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=166023.0, ans=0.04949747468305833 2024-06-20 08:14:14,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=166041.33333333334, ans=0.125 2024-06-20 08:14:27,025 INFO [train.py:1028] (1/2) Epoch 9, batch 9650, loss[loss=0.2597, simple_loss=0.3013, pruned_loss=0.109, over 13021.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3175, pruned_loss=0.1163, over 2559602.24 frames. ], batch size: 132, lr: 6.36e-03, grad_scale: 64.0 2024-06-20 08:14:40,943 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2024-06-20 08:14:41,021 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 2.067e+02 2.177e+02 2.491e+02 3.298e+02, threshold=4.355e+02, percent-clipped=0.0 2024-06-20 08:14:46,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=166133.0, ans=0.0 2024-06-20 08:14:49,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=166133.0, ans=0.0 2024-06-20 08:15:01,147 INFO [train.py:1028] (1/2) Epoch 9, batch 9700, loss[loss=0.2665, simple_loss=0.3017, pruned_loss=0.1157, over 13036.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3163, pruned_loss=0.116, over 2554520.82 frames. ], batch size: 144, lr: 6.36e-03, grad_scale: 64.0 2024-06-20 08:15:03,805 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.46 vs. limit=15.0 2024-06-20 08:15:04,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166169.66666666666, ans=0.1 2024-06-20 08:15:07,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=166188.0, ans=0.1 2024-06-20 08:15:08,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=166188.0, ans=0.0 2024-06-20 08:15:12,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=166188.0, ans=0.1 2024-06-20 08:15:16,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166206.33333333334, ans=0.1 2024-06-20 08:15:17,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=166206.33333333334, ans=0.04949747468305833 2024-06-20 08:15:18,228 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:15:25,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=166243.0, ans=0.0 2024-06-20 08:15:29,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=166243.0, ans=0.0 2024-06-20 08:15:31,407 INFO [train.py:1028] (1/2) Epoch 9, batch 9750, loss[loss=0.2645, simple_loss=0.303, pruned_loss=0.113, over 13133.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3153, pruned_loss=0.1154, over 2552425.30 frames. ], batch size: 132, lr: 6.36e-03, grad_scale: 64.0 2024-06-20 08:15:35,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=166261.33333333334, ans=0.125 2024-06-20 08:15:35,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=166261.33333333334, ans=0.125 2024-06-20 08:15:45,204 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 1.940e+02 2.091e+02 2.358e+02 3.881e+02, threshold=4.181e+02, percent-clipped=0.0 2024-06-20 08:15:47,058 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.40 vs. limit=15.0 2024-06-20 08:15:58,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=166334.66666666666, ans=0.125 2024-06-20 08:15:58,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=166334.66666666666, ans=0.125 2024-06-20 08:16:01,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=166334.66666666666, ans=0.125 2024-06-20 08:16:03,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=166353.0, ans=0.2 2024-06-20 08:16:04,305 INFO [train.py:1028] (1/2) Epoch 9, batch 9800, loss[loss=0.2772, simple_loss=0.3274, pruned_loss=0.1135, over 13200.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3147, pruned_loss=0.1147, over 2544801.70 frames. ], batch size: 40, lr: 6.36e-03, grad_scale: 64.0 2024-06-20 08:16:09,598 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.16 vs. limit=22.5 2024-06-20 08:16:12,924 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.48 vs. limit=15.0 2024-06-20 08:16:20,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=166389.66666666666, ans=0.0 2024-06-20 08:16:21,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=166389.66666666666, ans=0.125 2024-06-20 08:16:22,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=166408.0, ans=0.0 2024-06-20 08:16:26,047 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:16:26,431 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.98 vs. limit=10.0 2024-06-20 08:16:26,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=166408.0, ans=0.125 2024-06-20 08:16:26,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=166408.0, ans=0.125 2024-06-20 08:16:32,902 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.71 vs. limit=22.5 2024-06-20 08:16:35,005 INFO [train.py:1028] (1/2) Epoch 9, batch 9850, loss[loss=0.2959, simple_loss=0.3381, pruned_loss=0.1269, over 13044.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3141, pruned_loss=0.1143, over 2539195.18 frames. ], batch size: 102, lr: 6.36e-03, grad_scale: 64.0 2024-06-20 08:16:35,975 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.65 vs. limit=15.0 2024-06-20 08:16:36,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166444.66666666666, ans=0.1 2024-06-20 08:16:49,845 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 1.971e+02 2.232e+02 2.556e+02 3.991e+02, threshold=4.463e+02, percent-clipped=0.0 2024-06-20 08:16:51,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.34 vs. limit=15.0 2024-06-20 08:16:59,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=166499.66666666666, ans=0.035 2024-06-20 08:17:06,350 INFO [train.py:1028] (1/2) Epoch 9, batch 9900, loss[loss=0.2775, simple_loss=0.3154, pruned_loss=0.1198, over 12835.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3137, pruned_loss=0.1147, over 2531398.76 frames. ], batch size: 39, lr: 6.36e-03, grad_scale: 64.0 2024-06-20 08:17:09,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=166536.33333333334, ans=0.125 2024-06-20 08:17:17,237 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=166554.66666666666, ans=0.0 2024-06-20 08:17:27,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=166591.33333333334, ans=0.125 2024-06-20 08:17:38,377 INFO [train.py:1028] (1/2) Epoch 9, batch 9950, loss[loss=0.2856, simple_loss=0.328, pruned_loss=0.1216, over 12483.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3124, pruned_loss=0.1147, over 2524279.07 frames. ], batch size: 29, lr: 6.35e-03, grad_scale: 64.0 2024-06-20 08:17:38,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=166628.0, ans=0.125 2024-06-20 08:17:46,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=166646.33333333334, ans=0.0 2024-06-20 08:17:49,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166646.33333333334, ans=0.1 2024-06-20 08:17:50,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166664.66666666666, ans=0.1 2024-06-20 08:17:52,469 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 1.911e+02 2.032e+02 2.257e+02 5.473e+02, threshold=4.064e+02, percent-clipped=1.0 2024-06-20 08:17:57,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166683.0, ans=0.1 2024-06-20 08:18:04,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=166701.33333333334, ans=0.125 2024-06-20 08:18:07,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=166701.33333333334, ans=0.07 2024-06-20 08:18:10,412 INFO [train.py:1028] (1/2) Epoch 9, batch 10000, loss[loss=0.2376, simple_loss=0.2993, pruned_loss=0.08796, over 12617.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3126, pruned_loss=0.1151, over 2485483.25 frames. ], batch size: 22, lr: 6.35e-03, grad_scale: 64.0 2024-06-20 08:18:18,926 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.44 vs. limit=22.5 2024-06-20 08:18:29,340 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.16 vs. limit=15.0 2024-06-20 08:18:34,798 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.80 vs. limit=22.5 2024-06-20 08:18:35,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=166793.0, ans=0.0 2024-06-20 08:18:40,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=166793.0, ans=0.04949747468305833 2024-06-20 08:18:41,958 INFO [train.py:1028] (1/2) Epoch 9, batch 10050, loss[loss=0.2586, simple_loss=0.3007, pruned_loss=0.1083, over 12815.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3126, pruned_loss=0.1159, over 2442813.12 frames. ], batch size: 22, lr: 6.35e-03, grad_scale: 64.0 2024-06-20 08:18:55,351 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.052e+02 2.246e+02 2.639e+02 5.037e+02, threshold=4.492e+02, percent-clipped=6.0 2024-06-20 08:19:02,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166866.33333333334, ans=0.1 2024-06-20 08:19:09,893 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=166884.66666666666, ans=0.0 2024-06-20 08:19:12,369 INFO [train.py:1028] (1/2) Epoch 9, batch 10100, loss[loss=0.2356, simple_loss=0.273, pruned_loss=0.09906, over 10732.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.312, pruned_loss=0.1155, over 2423607.97 frames. ], batch size: 16, lr: 6.35e-03, grad_scale: 64.0 2024-06-20 08:19:16,370 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.16 vs. limit=15.0 2024-06-20 08:21:29,056 INFO [train.py:1028] (1/2) Epoch 10, batch 0, loss[loss=0.2405, simple_loss=0.2855, pruned_loss=0.09778, over 12987.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.2855, pruned_loss=0.09778, over 12987.00 frames. ], batch size: 36, lr: 6.04e-03, grad_scale: 64.0 2024-06-20 08:21:29,057 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 08:21:36,997 INFO [train.py:1060] (1/2) Epoch 10, validation: loss=0.2026, simple_loss=0.2664, pruned_loss=0.06938, over 351949.00 frames. 2024-06-20 08:21:36,997 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 08:21:40,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=166934.16666666666, ans=0.0 2024-06-20 08:22:03,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=167007.5, ans=0.0 2024-06-20 08:22:04,223 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.72 vs. limit=6.0 2024-06-20 08:22:08,236 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:22:09,143 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.19 vs. limit=15.0 2024-06-20 08:22:10,860 INFO [train.py:1028] (1/2) Epoch 10, batch 50, loss[loss=0.2545, simple_loss=0.2989, pruned_loss=0.105, over 12561.00 frames. ], tot_loss[loss=0.253, simple_loss=0.2928, pruned_loss=0.1066, over 574936.93 frames. ], batch size: 29, lr: 6.04e-03, grad_scale: 64.0 2024-06-20 08:22:14,219 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.824e+02 1.970e+02 2.289e+02 3.262e+02, threshold=3.939e+02, percent-clipped=0.0 2024-06-20 08:22:21,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=167044.16666666666, ans=0.0 2024-06-20 08:22:23,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=167062.5, ans=0.125 2024-06-20 08:22:30,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=167080.83333333334, ans=0.125 2024-06-20 08:22:31,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=167080.83333333334, ans=0.95 2024-06-20 08:22:38,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=167099.16666666666, ans=0.125 2024-06-20 08:22:45,561 INFO [train.py:1028] (1/2) Epoch 10, batch 100, loss[loss=0.2281, simple_loss=0.2844, pruned_loss=0.08592, over 13378.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.2919, pruned_loss=0.1055, over 1017939.23 frames. ], batch size: 46, lr: 6.03e-03, grad_scale: 64.0 2024-06-20 08:22:51,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=167135.83333333334, ans=0.025 2024-06-20 08:23:06,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=167172.5, ans=0.05 2024-06-20 08:23:07,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=167172.5, ans=0.2 2024-06-20 08:23:15,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=167190.83333333334, ans=22.5 2024-06-20 08:23:15,875 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.64 vs. limit=22.5 2024-06-20 08:23:17,289 INFO [train.py:1028] (1/2) Epoch 10, batch 150, loss[loss=0.26, simple_loss=0.3024, pruned_loss=0.1088, over 12635.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.2896, pruned_loss=0.1029, over 1365036.11 frames. ], batch size: 29, lr: 6.03e-03, grad_scale: 64.0 2024-06-20 08:23:20,529 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.834e+02 1.991e+02 2.236e+02 2.869e+02, threshold=3.982e+02, percent-clipped=0.0 2024-06-20 08:23:25,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=167209.16666666666, ans=0.125 2024-06-20 08:23:32,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=167245.83333333334, ans=0.025 2024-06-20 08:23:38,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=167245.83333333334, ans=0.125 2024-06-20 08:23:50,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=167282.5, ans=0.5 2024-06-20 08:23:52,456 INFO [train.py:1028] (1/2) Epoch 10, batch 200, loss[loss=0.2591, simple_loss=0.297, pruned_loss=0.1106, over 12451.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.2906, pruned_loss=0.1035, over 1634181.55 frames. ], batch size: 202, lr: 6.03e-03, grad_scale: 64.0 2024-06-20 08:23:56,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=167300.83333333334, ans=10.0 2024-06-20 08:23:58,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=167319.16666666666, ans=0.0 2024-06-20 08:23:59,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.40 vs. limit=15.0 2024-06-20 08:24:02,210 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.61 vs. limit=15.0 2024-06-20 08:24:03,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=167319.16666666666, ans=0.125 2024-06-20 08:24:06,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=167337.5, ans=0.0 2024-06-20 08:24:15,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=167355.83333333334, ans=0.0 2024-06-20 08:24:16,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=167374.16666666666, ans=0.0 2024-06-20 08:24:20,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=167374.16666666666, ans=0.0 2024-06-20 08:24:21,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=167374.16666666666, ans=0.0 2024-06-20 08:24:23,466 INFO [train.py:1028] (1/2) Epoch 10, batch 250, loss[loss=0.2177, simple_loss=0.2576, pruned_loss=0.08888, over 13007.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.2905, pruned_loss=0.1032, over 1846425.28 frames. ], batch size: 144, lr: 6.03e-03, grad_scale: 64.0 2024-06-20 08:24:25,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=167392.5, ans=0.0 2024-06-20 08:24:26,796 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.849e+02 1.948e+02 2.083e+02 2.623e+02, threshold=3.896e+02, percent-clipped=0.0 2024-06-20 08:24:27,186 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.18 vs. limit=15.0 2024-06-20 08:24:30,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=167410.83333333334, ans=0.125 2024-06-20 08:24:31,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=167410.83333333334, ans=0.125 2024-06-20 08:24:35,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=167410.83333333334, ans=0.125 2024-06-20 08:24:38,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=167429.16666666666, ans=0.0 2024-06-20 08:24:39,459 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.90 vs. limit=22.5 2024-06-20 08:24:41,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1.whitening_limit, batch_count=167429.16666666666, ans=10.0 2024-06-20 08:24:51,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=167447.5, ans=0.025 2024-06-20 08:24:55,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=167465.83333333334, ans=0.125 2024-06-20 08:25:00,431 INFO [train.py:1028] (1/2) Epoch 10, batch 300, loss[loss=0.2479, simple_loss=0.2829, pruned_loss=0.1065, over 13193.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.2901, pruned_loss=0.103, over 2009620.64 frames. ], batch size: 112, lr: 6.03e-03, grad_scale: 64.0 2024-06-20 08:25:10,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=167502.5, ans=0.125 2024-06-20 08:25:12,373 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2024-06-20 08:25:16,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=167520.83333333334, ans=0.1 2024-06-20 08:25:20,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=167539.16666666666, ans=0.125 2024-06-20 08:25:35,515 INFO [train.py:1028] (1/2) Epoch 10, batch 350, loss[loss=0.2331, simple_loss=0.2792, pruned_loss=0.0935, over 12875.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.2894, pruned_loss=0.1026, over 2138954.06 frames. ], batch size: 33, lr: 6.03e-03, grad_scale: 64.0 2024-06-20 08:25:36,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=167575.83333333334, ans=0.125 2024-06-20 08:25:36,519 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.09 vs. limit=15.0 2024-06-20 08:25:38,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=167575.83333333334, ans=0.2 2024-06-20 08:25:38,696 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.826e+02 2.050e+02 2.217e+02 2.972e+02, threshold=4.100e+02, percent-clipped=0.0 2024-06-20 08:25:44,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=167594.16666666666, ans=0.05 2024-06-20 08:25:49,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=167612.5, ans=0.125 2024-06-20 08:25:50,032 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:26:04,543 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.00 vs. limit=15.0 2024-06-20 08:26:06,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=167649.16666666666, ans=0.09899494936611666 2024-06-20 08:26:07,715 INFO [train.py:1028] (1/2) Epoch 10, batch 400, loss[loss=0.2463, simple_loss=0.2859, pruned_loss=0.1034, over 13296.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.2892, pruned_loss=0.1022, over 2239347.04 frames. ], batch size: 63, lr: 6.02e-03, grad_scale: 64.0 2024-06-20 08:26:09,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=167667.5, ans=0.125 2024-06-20 08:26:12,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=167667.5, ans=0.125 2024-06-20 08:26:12,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=167667.5, ans=0.125 2024-06-20 08:26:16,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=167685.83333333334, ans=0.0 2024-06-20 08:26:39,214 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.44 vs. limit=10.0 2024-06-20 08:26:39,457 INFO [train.py:1028] (1/2) Epoch 10, batch 450, loss[loss=0.2165, simple_loss=0.2648, pruned_loss=0.08403, over 13213.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.2894, pruned_loss=0.1022, over 2313394.38 frames. ], batch size: 67, lr: 6.02e-03, grad_scale: 64.0 2024-06-20 08:26:42,535 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.743e+02 1.889e+02 2.077e+02 2.798e+02, threshold=3.777e+02, percent-clipped=0.0 2024-06-20 08:26:47,374 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.40 vs. limit=15.0 2024-06-20 08:26:54,607 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.95 vs. limit=15.0 2024-06-20 08:26:57,390 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=10.65 vs. limit=15.0 2024-06-20 08:27:01,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=167814.16666666666, ans=0.025 2024-06-20 08:27:08,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=167832.5, ans=0.0 2024-06-20 08:27:14,478 INFO [train.py:1028] (1/2) Epoch 10, batch 500, loss[loss=0.2437, simple_loss=0.2838, pruned_loss=0.1018, over 13103.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.2898, pruned_loss=0.1024, over 2375563.70 frames. ], batch size: 121, lr: 6.02e-03, grad_scale: 64.0 2024-06-20 08:27:15,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=167850.83333333334, ans=0.125 2024-06-20 08:27:16,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=167850.83333333334, ans=0.125 2024-06-20 08:27:21,640 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.60 vs. limit=15.0 2024-06-20 08:27:22,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=167850.83333333334, ans=0.5 2024-06-20 08:27:26,928 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.93 vs. limit=6.0 2024-06-20 08:27:32,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=167887.5, ans=0.0 2024-06-20 08:27:35,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=167887.5, ans=0.0 2024-06-20 08:27:36,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=167905.83333333334, ans=0.2 2024-06-20 08:27:40,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=167905.83333333334, ans=0.125 2024-06-20 08:27:45,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=167924.16666666666, ans=0.2 2024-06-20 08:27:49,516 INFO [train.py:1028] (1/2) Epoch 10, batch 550, loss[loss=0.2501, simple_loss=0.2834, pruned_loss=0.1084, over 12900.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.2902, pruned_loss=0.1026, over 2420920.13 frames. ], batch size: 158, lr: 6.02e-03, grad_scale: 64.0 2024-06-20 08:27:49,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=167942.5, ans=0.125 2024-06-20 08:27:52,837 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.623e+02 1.858e+02 2.019e+02 2.174e+02 3.635e+02, threshold=4.039e+02, percent-clipped=0.0 2024-06-20 08:27:54,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=167942.5, ans=0.1 2024-06-20 08:27:58,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=167960.83333333334, ans=0.125 2024-06-20 08:27:59,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167960.83333333334, ans=0.1 2024-06-20 08:28:14,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=168015.83333333334, ans=0.125 2024-06-20 08:28:20,936 INFO [train.py:1028] (1/2) Epoch 10, batch 600, loss[loss=0.2568, simple_loss=0.2912, pruned_loss=0.1112, over 13020.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.2898, pruned_loss=0.1024, over 2458514.78 frames. ], batch size: 144, lr: 6.02e-03, grad_scale: 64.0 2024-06-20 08:28:29,285 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2024-06-20 08:28:29,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=168052.5, ans=0.125 2024-06-20 08:28:30,768 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.81 vs. limit=15.0 2024-06-20 08:28:32,737 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.87 vs. limit=15.0 2024-06-20 08:28:53,169 INFO [train.py:1028] (1/2) Epoch 10, batch 650, loss[loss=0.2566, simple_loss=0.3083, pruned_loss=0.1025, over 13130.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.2896, pruned_loss=0.1021, over 2490013.13 frames. ], batch size: 59, lr: 6.02e-03, grad_scale: 64.0 2024-06-20 08:28:53,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=168125.83333333334, ans=0.1 2024-06-20 08:28:59,381 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 1.796e+02 1.891e+02 2.172e+02 2.849e+02, threshold=3.782e+02, percent-clipped=0.0 2024-06-20 08:29:01,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=168125.83333333334, ans=0.1 2024-06-20 08:29:02,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=168144.16666666666, ans=0.0 2024-06-20 08:29:20,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=168180.83333333334, ans=0.125 2024-06-20 08:29:24,684 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.96 vs. limit=15.0 2024-06-20 08:29:25,893 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.73 vs. limit=22.5 2024-06-20 08:29:27,269 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.87 vs. limit=15.0 2024-06-20 08:29:28,008 INFO [train.py:1028] (1/2) Epoch 10, batch 700, loss[loss=0.253, simple_loss=0.2938, pruned_loss=0.1061, over 13296.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.2896, pruned_loss=0.1025, over 2512837.00 frames. ], batch size: 46, lr: 6.01e-03, grad_scale: 64.0 2024-06-20 08:29:28,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=168217.5, ans=0.2 2024-06-20 08:29:38,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=168235.83333333334, ans=0.125 2024-06-20 08:29:47,422 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.14 vs. limit=15.0 2024-06-20 08:29:49,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=168254.16666666666, ans=0.1 2024-06-20 08:29:50,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=168254.16666666666, ans=0.125 2024-06-20 08:29:53,400 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2024-06-20 08:30:04,910 INFO [train.py:1028] (1/2) Epoch 10, batch 750, loss[loss=0.2256, simple_loss=0.2732, pruned_loss=0.08902, over 13258.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.2898, pruned_loss=0.1024, over 2527963.48 frames. ], batch size: 63, lr: 6.01e-03, grad_scale: 64.0 2024-06-20 08:30:08,268 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.803e+02 1.930e+02 2.091e+02 2.746e+02, threshold=3.860e+02, percent-clipped=0.0 2024-06-20 08:30:08,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=168309.16666666666, ans=0.125 2024-06-20 08:30:12,654 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.44 vs. limit=15.0 2024-06-20 08:30:18,057 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.61 vs. limit=10.0 2024-06-20 08:30:24,664 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.82 vs. limit=6.0 2024-06-20 08:30:26,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=168364.16666666666, ans=0.2 2024-06-20 08:30:39,094 INFO [train.py:1028] (1/2) Epoch 10, batch 800, loss[loss=0.2539, simple_loss=0.2925, pruned_loss=0.1077, over 12860.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.2893, pruned_loss=0.1022, over 2541008.59 frames. ], batch size: 36, lr: 6.01e-03, grad_scale: 64.0 2024-06-20 08:30:54,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=168437.5, ans=0.0 2024-06-20 08:31:15,912 INFO [train.py:1028] (1/2) Epoch 10, batch 850, loss[loss=0.2229, simple_loss=0.2711, pruned_loss=0.08729, over 13144.00 frames. ], tot_loss[loss=0.246, simple_loss=0.2887, pruned_loss=0.1016, over 2551418.48 frames. ], batch size: 95, lr: 6.01e-03, grad_scale: 64.0 2024-06-20 08:31:19,064 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.787e+02 1.922e+02 2.052e+02 2.636e+02, threshold=3.844e+02, percent-clipped=0.0 2024-06-20 08:31:23,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=168510.83333333334, ans=0.2 2024-06-20 08:31:26,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=168510.83333333334, ans=0.0 2024-06-20 08:31:27,140 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.13 vs. limit=15.0 2024-06-20 08:31:27,917 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.58 vs. limit=8.0 2024-06-20 08:31:29,761 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.56 vs. limit=15.0 2024-06-20 08:31:31,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=168529.16666666666, ans=0.1 2024-06-20 08:31:50,920 INFO [train.py:1028] (1/2) Epoch 10, batch 900, loss[loss=0.2467, simple_loss=0.2933, pruned_loss=0.1, over 12886.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.2884, pruned_loss=0.1017, over 2556924.15 frames. ], batch size: 36, lr: 6.01e-03, grad_scale: 64.0 2024-06-20 08:31:53,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=168584.16666666666, ans=0.125 2024-06-20 08:32:05,819 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:32:16,016 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.82 vs. limit=22.5 2024-06-20 08:32:27,225 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.32 vs. limit=15.0 2024-06-20 08:32:28,887 INFO [train.py:1028] (1/2) Epoch 10, batch 950, loss[loss=0.2547, simple_loss=0.3049, pruned_loss=0.1022, over 12915.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.288, pruned_loss=0.1013, over 2560666.90 frames. ], batch size: 39, lr: 6.01e-03, grad_scale: 64.0 2024-06-20 08:32:32,006 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.875e+02 2.028e+02 2.246e+02 3.316e+02, threshold=4.056e+02, percent-clipped=0.0 2024-06-20 08:32:58,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=168749.16666666666, ans=0.0 2024-06-20 08:32:58,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=168749.16666666666, ans=0.04949747468305833 2024-06-20 08:32:59,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=168749.16666666666, ans=0.125 2024-06-20 08:33:04,225 INFO [train.py:1028] (1/2) Epoch 10, batch 1000, loss[loss=0.267, simple_loss=0.3098, pruned_loss=0.1121, over 13055.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.2888, pruned_loss=0.102, over 2562359.07 frames. ], batch size: 48, lr: 6.00e-03, grad_scale: 64.0 2024-06-20 08:33:09,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=168767.5, ans=0.125 2024-06-20 08:33:12,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=168785.83333333334, ans=0.0 2024-06-20 08:33:14,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=168785.83333333334, ans=0.125 2024-06-20 08:33:25,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=168822.5, ans=0.2 2024-06-20 08:33:28,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=168822.5, ans=0.95 2024-06-20 08:33:30,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=168840.83333333334, ans=0.125 2024-06-20 08:33:37,591 INFO [train.py:1028] (1/2) Epoch 10, batch 1050, loss[loss=0.2438, simple_loss=0.2927, pruned_loss=0.09745, over 13200.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.2894, pruned_loss=0.1021, over 2565374.10 frames. ], batch size: 77, lr: 6.00e-03, grad_scale: 64.0 2024-06-20 08:33:41,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=168859.16666666666, ans=0.0 2024-06-20 08:33:43,905 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.775e+02 1.895e+02 2.104e+02 2.890e+02, threshold=3.790e+02, percent-clipped=0.0 2024-06-20 08:33:45,354 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:33:51,295 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.13 vs. limit=22.5 2024-06-20 08:34:02,120 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.17 vs. limit=15.0 2024-06-20 08:34:05,132 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.561e+01 2024-06-20 08:34:05,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=168932.5, ans=0.1 2024-06-20 08:34:12,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=168932.5, ans=0.1 2024-06-20 08:34:13,579 INFO [train.py:1028] (1/2) Epoch 10, batch 1100, loss[loss=0.255, simple_loss=0.3001, pruned_loss=0.105, over 13232.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.2902, pruned_loss=0.1026, over 2570042.88 frames. ], batch size: 52, lr: 6.00e-03, grad_scale: 64.0 2024-06-20 08:34:14,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=168950.83333333334, ans=0.0 2024-06-20 08:34:17,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=168950.83333333334, ans=0.125 2024-06-20 08:34:33,596 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.03 vs. limit=15.0 2024-06-20 08:34:41,278 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.336e+01 2024-06-20 08:34:46,463 INFO [train.py:1028] (1/2) Epoch 10, batch 1150, loss[loss=0.2648, simple_loss=0.3053, pruned_loss=0.1121, over 13279.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.2901, pruned_loss=0.1025, over 2570899.73 frames. ], batch size: 52, lr: 6.00e-03, grad_scale: 64.0 2024-06-20 08:34:49,633 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.813e+02 1.939e+02 2.084e+02 2.757e+02, threshold=3.879e+02, percent-clipped=0.0 2024-06-20 08:34:51,945 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2024-06-20 08:34:54,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=169060.83333333334, ans=0.125 2024-06-20 08:35:02,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=169079.16666666666, ans=0.0 2024-06-20 08:35:08,158 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=13.59 vs. limit=15.0 2024-06-20 08:35:15,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=169115.83333333334, ans=0.125 2024-06-20 08:35:21,015 INFO [train.py:1028] (1/2) Epoch 10, batch 1200, loss[loss=0.2318, simple_loss=0.2843, pruned_loss=0.08968, over 13161.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.2901, pruned_loss=0.1025, over 2572908.28 frames. ], batch size: 77, lr: 6.00e-03, grad_scale: 64.0 2024-06-20 08:35:28,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169152.5, ans=0.1 2024-06-20 08:35:49,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=169207.5, ans=0.0 2024-06-20 08:35:50,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169207.5, ans=0.1 2024-06-20 08:35:55,239 INFO [train.py:1028] (1/2) Epoch 10, batch 1250, loss[loss=0.2323, simple_loss=0.2699, pruned_loss=0.09738, over 13199.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.2894, pruned_loss=0.1021, over 2583215.53 frames. ], batch size: 112, lr: 6.00e-03, grad_scale: 64.0 2024-06-20 08:35:58,492 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.812e+02 1.934e+02 2.118e+02 3.061e+02, threshold=3.869e+02, percent-clipped=0.0 2024-06-20 08:36:02,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=169244.16666666666, ans=15.0 2024-06-20 08:36:12,117 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.39 vs. limit=10.0 2024-06-20 08:36:21,395 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.50 vs. limit=10.0 2024-06-20 08:36:27,398 INFO [train.py:1028] (1/2) Epoch 10, batch 1300, loss[loss=0.2511, simple_loss=0.2869, pruned_loss=0.1076, over 12793.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.2906, pruned_loss=0.103, over 2584096.50 frames. ], batch size: 177, lr: 5.99e-03, grad_scale: 64.0 2024-06-20 08:36:29,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=169317.5, ans=0.0 2024-06-20 08:36:33,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=169335.83333333334, ans=0.0 2024-06-20 08:36:34,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.86 vs. limit=15.0 2024-06-20 08:36:36,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=169335.83333333334, ans=0.1 2024-06-20 08:36:36,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=169335.83333333334, ans=0.125 2024-06-20 08:36:37,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=169335.83333333334, ans=0.0 2024-06-20 08:36:39,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=169354.16666666666, ans=0.125 2024-06-20 08:36:49,163 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.07 vs. limit=22.5 2024-06-20 08:36:51,882 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.09 vs. limit=12.0 2024-06-20 08:36:57,771 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2024-06-20 08:36:59,989 INFO [train.py:1028] (1/2) Epoch 10, batch 1350, loss[loss=0.2695, simple_loss=0.319, pruned_loss=0.11, over 13164.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.291, pruned_loss=0.1031, over 2585399.00 frames. ], batch size: 59, lr: 5.99e-03, grad_scale: 64.0 2024-06-20 08:37:01,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=169409.16666666666, ans=0.125 2024-06-20 08:37:03,106 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 1.822e+02 2.004e+02 2.196e+02 3.076e+02, threshold=4.008e+02, percent-clipped=0.0 2024-06-20 08:37:07,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=169427.5, ans=0.0 2024-06-20 08:37:08,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=169427.5, ans=22.5 2024-06-20 08:37:11,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169427.5, ans=0.1 2024-06-20 08:37:17,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=169445.83333333334, ans=0.1 2024-06-20 08:37:17,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=169445.83333333334, ans=0.0 2024-06-20 08:37:28,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=169464.16666666666, ans=0.125 2024-06-20 08:37:35,746 INFO [train.py:1028] (1/2) Epoch 10, batch 1400, loss[loss=0.2434, simple_loss=0.2914, pruned_loss=0.09772, over 12473.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.2908, pruned_loss=0.103, over 2587130.44 frames. ], batch size: 25, lr: 5.99e-03, grad_scale: 64.0 2024-06-20 08:37:45,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=169519.16666666666, ans=0.0 2024-06-20 08:37:47,419 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.60 vs. limit=12.0 2024-06-20 08:37:55,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=169537.5, ans=0.125 2024-06-20 08:37:59,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=169555.83333333334, ans=0.125 2024-06-20 08:38:01,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=169555.83333333334, ans=0.0 2024-06-20 08:38:06,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=169574.16666666666, ans=0.125 2024-06-20 08:38:10,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=169592.5, ans=0.125 2024-06-20 08:38:11,367 INFO [train.py:1028] (1/2) Epoch 10, batch 1450, loss[loss=0.2427, simple_loss=0.2782, pruned_loss=0.1037, over 13073.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.29, pruned_loss=0.1023, over 2586726.99 frames. ], batch size: 121, lr: 5.99e-03, grad_scale: 128.0 2024-06-20 08:38:14,530 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.827e+02 1.984e+02 2.162e+02 3.144e+02, threshold=3.967e+02, percent-clipped=0.0 2024-06-20 08:38:20,919 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=169610.83333333334, ans=0.0 2024-06-20 08:38:27,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=169629.16666666666, ans=0.125 2024-06-20 08:38:27,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=169629.16666666666, ans=0.125 2024-06-20 08:38:33,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169647.5, ans=0.1 2024-06-20 08:38:38,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=169665.83333333334, ans=0.1 2024-06-20 08:38:41,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=169665.83333333334, ans=0.125 2024-06-20 08:38:43,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=169684.16666666666, ans=0.2 2024-06-20 08:38:43,905 INFO [train.py:1028] (1/2) Epoch 10, batch 1500, loss[loss=0.2467, simple_loss=0.2902, pruned_loss=0.1017, over 13223.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.2902, pruned_loss=0.1025, over 2589438.53 frames. ], batch size: 83, lr: 5.99e-03, grad_scale: 128.0 2024-06-20 08:38:48,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=169684.16666666666, ans=0.025 2024-06-20 08:38:49,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=169684.16666666666, ans=0.125 2024-06-20 08:38:51,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=169702.5, ans=0.1 2024-06-20 08:38:57,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=169720.83333333334, ans=0.95 2024-06-20 08:38:59,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=169720.83333333334, ans=0.025 2024-06-20 08:39:00,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=169720.83333333334, ans=0.125 2024-06-20 08:39:15,581 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:39:19,231 INFO [train.py:1028] (1/2) Epoch 10, batch 1550, loss[loss=0.2459, simple_loss=0.2776, pruned_loss=0.1071, over 13162.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.29, pruned_loss=0.1026, over 2585222.13 frames. ], batch size: 103, lr: 5.99e-03, grad_scale: 128.0 2024-06-20 08:39:22,289 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.848e+02 1.985e+02 2.158e+02 2.791e+02, threshold=3.970e+02, percent-clipped=0.0 2024-06-20 08:39:22,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=169775.83333333334, ans=0.125 2024-06-20 08:39:31,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=169812.5, ans=0.0 2024-06-20 08:39:33,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169812.5, ans=0.1 2024-06-20 08:39:40,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=169812.5, ans=0.125 2024-06-20 08:39:41,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=169830.83333333334, ans=0.0 2024-06-20 08:39:46,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=169830.83333333334, ans=0.0 2024-06-20 08:39:46,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=169830.83333333334, ans=0.125 2024-06-20 08:39:54,320 INFO [train.py:1028] (1/2) Epoch 10, batch 1600, loss[loss=0.2321, simple_loss=0.2754, pruned_loss=0.09438, over 13218.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.2904, pruned_loss=0.1027, over 2581047.85 frames. ], batch size: 77, lr: 5.99e-03, grad_scale: 128.0 2024-06-20 08:39:54,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=169867.5, ans=0.0 2024-06-20 08:39:59,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=169867.5, ans=0.0 2024-06-20 08:40:05,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=169885.83333333334, ans=0.025 2024-06-20 08:40:07,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=169904.16666666666, ans=0.125 2024-06-20 08:40:08,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.whiten.whitening_limit, batch_count=169904.16666666666, ans=12.0 2024-06-20 08:40:09,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=169904.16666666666, ans=0.2 2024-06-20 08:40:16,569 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.73 vs. limit=22.5 2024-06-20 08:40:17,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.48 vs. limit=22.5 2024-06-20 08:40:18,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=169922.5, ans=0.125 2024-06-20 08:40:24,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=169940.83333333334, ans=0.0 2024-06-20 08:40:26,522 INFO [train.py:1028] (1/2) Epoch 10, batch 1650, loss[loss=0.2592, simple_loss=0.2957, pruned_loss=0.1113, over 13124.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.2906, pruned_loss=0.103, over 2576999.18 frames. ], batch size: 95, lr: 5.98e-03, grad_scale: 128.0 2024-06-20 08:40:29,821 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 1.825e+02 1.963e+02 2.252e+02 2.989e+02, threshold=3.926e+02, percent-clipped=0.0 2024-06-20 08:40:35,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=169977.5, ans=0.0 2024-06-20 08:40:40,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=169995.83333333334, ans=0.125 2024-06-20 08:40:41,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=169995.83333333334, ans=0.5 2024-06-20 08:40:44,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169995.83333333334, ans=0.1 2024-06-20 08:40:59,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170050.83333333334, ans=0.1 2024-06-20 08:40:59,493 INFO [train.py:1028] (1/2) Epoch 10, batch 1700, loss[loss=0.213, simple_loss=0.2667, pruned_loss=0.0796, over 12535.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.2901, pruned_loss=0.1025, over 2581175.97 frames. ], batch size: 25, lr: 5.98e-03, grad_scale: 128.0 2024-06-20 08:41:10,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=170069.16666666666, ans=0.0 2024-06-20 08:41:12,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=170069.16666666666, ans=0.025 2024-06-20 08:41:28,931 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2024-06-20 08:41:34,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=170142.5, ans=0.125 2024-06-20 08:41:34,570 INFO [train.py:1028] (1/2) Epoch 10, batch 1750, loss[loss=0.2285, simple_loss=0.282, pruned_loss=0.08749, over 12271.00 frames. ], tot_loss[loss=0.248, simple_loss=0.2906, pruned_loss=0.1027, over 2581026.55 frames. ], batch size: 22, lr: 5.98e-03, grad_scale: 128.0 2024-06-20 08:41:37,855 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 1.799e+02 1.879e+02 2.025e+02 2.418e+02, threshold=3.758e+02, percent-clipped=0.0 2024-06-20 08:41:42,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=170142.5, ans=0.125 2024-06-20 08:41:45,751 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.37 vs. limit=15.0 2024-06-20 08:41:46,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=170160.83333333334, ans=0.125 2024-06-20 08:41:50,195 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.24 vs. limit=22.5 2024-06-20 08:41:56,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=170197.5, ans=0.125 2024-06-20 08:42:09,638 INFO [train.py:1028] (1/2) Epoch 10, batch 1800, loss[loss=0.2448, simple_loss=0.2871, pruned_loss=0.1013, over 13212.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.2909, pruned_loss=0.103, over 2582693.10 frames. ], batch size: 67, lr: 5.98e-03, grad_scale: 128.0 2024-06-20 08:42:18,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=170252.5, ans=0.0 2024-06-20 08:42:29,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=170289.16666666666, ans=0.2 2024-06-20 08:42:36,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=170307.5, ans=0.125 2024-06-20 08:42:39,272 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.85 vs. limit=15.0 2024-06-20 08:42:42,121 INFO [train.py:1028] (1/2) Epoch 10, batch 1850, loss[loss=0.2309, simple_loss=0.2751, pruned_loss=0.09341, over 13158.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.2907, pruned_loss=0.1025, over 2584017.50 frames. ], batch size: 83, lr: 5.98e-03, grad_scale: 128.0 2024-06-20 08:42:45,466 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.844e+02 1.955e+02 2.174e+02 3.212e+02, threshold=3.911e+02, percent-clipped=0.0 2024-06-20 08:42:56,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=170362.5, ans=0.125 2024-06-20 08:42:56,695 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.59 vs. limit=15.0 2024-06-20 08:43:03,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2024-06-20 08:43:09,690 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.63 vs. limit=6.0 2024-06-20 08:43:11,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=170399.16666666666, ans=0.0 2024-06-20 08:43:16,938 INFO [train.py:1028] (1/2) Epoch 10, batch 1900, loss[loss=0.2436, simple_loss=0.285, pruned_loss=0.1011, over 13154.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.2899, pruned_loss=0.1025, over 2586367.47 frames. ], batch size: 95, lr: 5.98e-03, grad_scale: 128.0 2024-06-20 08:43:18,437 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=1.240e+01 2024-06-20 08:43:19,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=170417.5, ans=0.125 2024-06-20 08:43:21,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=170417.5, ans=0.0 2024-06-20 08:43:30,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=170454.16666666666, ans=0.125 2024-06-20 08:43:38,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170454.16666666666, ans=0.1 2024-06-20 08:43:40,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=170472.5, ans=0.0 2024-06-20 08:43:51,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=170490.83333333334, ans=0.125 2024-06-20 08:43:52,453 INFO [train.py:1028] (1/2) Epoch 10, batch 1950, loss[loss=0.2339, simple_loss=0.2794, pruned_loss=0.09417, over 13262.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.2894, pruned_loss=0.1026, over 2592029.40 frames. ], batch size: 52, lr: 5.97e-03, grad_scale: 128.0 2024-06-20 08:43:55,779 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 1.784e+02 1.907e+02 2.063e+02 2.719e+02, threshold=3.814e+02, percent-clipped=0.0 2024-06-20 08:43:56,905 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.81 vs. limit=10.0 2024-06-20 08:43:57,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=170509.16666666666, ans=0.0 2024-06-20 08:44:00,175 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.45 vs. limit=22.5 2024-06-20 08:44:01,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=170527.5, ans=0.0 2024-06-20 08:44:03,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=170527.5, ans=0.125 2024-06-20 08:44:08,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=170545.83333333334, ans=0.125 2024-06-20 08:44:23,077 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:44:25,811 INFO [train.py:1028] (1/2) Epoch 10, batch 2000, loss[loss=0.2432, simple_loss=0.2926, pruned_loss=0.09689, over 12678.00 frames. ], tot_loss[loss=0.247, simple_loss=0.2891, pruned_loss=0.1024, over 2587381.16 frames. ], batch size: 22, lr: 5.97e-03, grad_scale: 128.0 2024-06-20 08:44:25,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=170600.83333333334, ans=0.2 2024-06-20 08:44:30,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170600.83333333334, ans=0.1 2024-06-20 08:44:34,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=170619.16666666666, ans=0.0 2024-06-20 08:44:47,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=170655.83333333334, ans=0.125 2024-06-20 08:44:58,861 INFO [train.py:1028] (1/2) Epoch 10, batch 2050, loss[loss=0.2534, simple_loss=0.2907, pruned_loss=0.1081, over 12639.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.2901, pruned_loss=0.1031, over 2583363.86 frames. ], batch size: 29, lr: 5.97e-03, grad_scale: 128.0 2024-06-20 08:44:58,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=170692.5, ans=0.015 2024-06-20 08:44:59,254 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.94 vs. limit=10.0 2024-06-20 08:45:02,021 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.832e+02 1.978e+02 2.110e+02 3.070e+02, threshold=3.956e+02, percent-clipped=0.0 2024-06-20 08:45:12,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=170710.83333333334, ans=0.1 2024-06-20 08:45:18,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170729.16666666666, ans=0.1 2024-06-20 08:45:22,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=170747.5, ans=0.0 2024-06-20 08:45:29,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=170765.83333333334, ans=0.125 2024-06-20 08:45:30,535 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2024-06-20 08:45:34,142 INFO [train.py:1028] (1/2) Epoch 10, batch 2100, loss[loss=0.251, simple_loss=0.2969, pruned_loss=0.1026, over 13181.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.2901, pruned_loss=0.1027, over 2585685.43 frames. ], batch size: 59, lr: 5.97e-03, grad_scale: 128.0 2024-06-20 08:45:38,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=170784.16666666666, ans=0.125 2024-06-20 08:45:38,816 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.25 vs. limit=22.5 2024-06-20 08:45:40,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=170784.16666666666, ans=0.0 2024-06-20 08:45:53,237 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=170820.83333333334, ans=0.0 2024-06-20 08:45:56,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=170839.16666666666, ans=0.2 2024-06-20 08:45:57,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=170839.16666666666, ans=0.1 2024-06-20 08:46:05,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=170857.5, ans=0.1 2024-06-20 08:46:06,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.59 vs. limit=15.0 2024-06-20 08:46:09,767 INFO [train.py:1028] (1/2) Epoch 10, batch 2150, loss[loss=0.2382, simple_loss=0.2885, pruned_loss=0.09388, over 13253.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.2896, pruned_loss=0.1024, over 2588893.05 frames. ], batch size: 52, lr: 5.97e-03, grad_scale: 128.0 2024-06-20 08:46:12,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=170875.83333333334, ans=0.125 2024-06-20 08:46:13,008 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.937e+02 2.163e+02 2.379e+02 3.253e+02, threshold=4.326e+02, percent-clipped=0.0 2024-06-20 08:46:13,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=170875.83333333334, ans=0.125 2024-06-20 08:46:16,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=170894.16666666666, ans=0.125 2024-06-20 08:46:22,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=170894.16666666666, ans=0.2 2024-06-20 08:46:23,531 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=22.5 2024-06-20 08:46:28,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=170912.5, ans=0.125 2024-06-20 08:46:32,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=170930.83333333334, ans=0.1 2024-06-20 08:46:33,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=170930.83333333334, ans=0.5 2024-06-20 08:46:42,389 INFO [train.py:1028] (1/2) Epoch 10, batch 2200, loss[loss=0.2393, simple_loss=0.2875, pruned_loss=0.09555, over 13224.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.29, pruned_loss=0.1024, over 2589137.78 frames. ], batch size: 83, lr: 5.97e-03, grad_scale: 128.0 2024-06-20 08:46:53,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=170985.83333333334, ans=0.125 2024-06-20 08:47:00,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=171004.16666666666, ans=0.125 2024-06-20 08:47:00,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=171022.5, ans=0.02 2024-06-20 08:47:06,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=171022.5, ans=0.125 2024-06-20 08:47:07,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=171040.83333333334, ans=0.0 2024-06-20 08:47:07,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=171040.83333333334, ans=0.1 2024-06-20 08:47:17,092 INFO [train.py:1028] (1/2) Epoch 10, batch 2250, loss[loss=0.2583, simple_loss=0.2954, pruned_loss=0.1106, over 13243.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.29, pruned_loss=0.1021, over 2588197.58 frames. ], batch size: 63, lr: 5.96e-03, grad_scale: 128.0 2024-06-20 08:47:20,210 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.790e+02 1.929e+02 2.146e+02 3.033e+02, threshold=3.859e+02, percent-clipped=0.0 2024-06-20 08:47:37,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=171114.16666666666, ans=0.125 2024-06-20 08:47:46,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=171132.5, ans=0.05 2024-06-20 08:47:48,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=171132.5, ans=0.125 2024-06-20 08:47:52,143 INFO [train.py:1028] (1/2) Epoch 10, batch 2300, loss[loss=0.2528, simple_loss=0.2989, pruned_loss=0.1034, over 12981.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.2911, pruned_loss=0.103, over 2581301.01 frames. ], batch size: 33, lr: 5.96e-03, grad_scale: 64.0 2024-06-20 08:47:53,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=171150.83333333334, ans=0.025 2024-06-20 08:47:58,499 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:48:05,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=171187.5, ans=0.0 2024-06-20 08:48:06,049 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.34 vs. limit=12.0 2024-06-20 08:48:07,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=171187.5, ans=0.1 2024-06-20 08:48:09,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=171187.5, ans=0.1 2024-06-20 08:48:12,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=171205.83333333334, ans=0.1 2024-06-20 08:48:20,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=171224.16666666666, ans=0.125 2024-06-20 08:48:25,360 INFO [train.py:1028] (1/2) Epoch 10, batch 2350, loss[loss=0.2675, simple_loss=0.312, pruned_loss=0.1115, over 13238.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.2918, pruned_loss=0.1034, over 2584908.96 frames. ], batch size: 67, lr: 5.96e-03, grad_scale: 64.0 2024-06-20 08:48:29,474 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 1.886e+02 2.084e+02 2.413e+02 3.562e+02, threshold=4.169e+02, percent-clipped=0.0 2024-06-20 08:48:34,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=171260.83333333334, ans=0.125 2024-06-20 08:48:41,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=171279.16666666666, ans=0.0 2024-06-20 08:48:52,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=171315.83333333334, ans=0.125 2024-06-20 08:48:58,565 INFO [train.py:1028] (1/2) Epoch 10, batch 2400, loss[loss=0.2512, simple_loss=0.295, pruned_loss=0.1037, over 13293.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.2907, pruned_loss=0.103, over 2587552.69 frames. ], batch size: 46, lr: 5.96e-03, grad_scale: 64.0 2024-06-20 08:49:18,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=171370.83333333334, ans=0.1 2024-06-20 08:49:19,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=171370.83333333334, ans=0.125 2024-06-20 08:49:34,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=171407.5, ans=0.125 2024-06-20 08:49:35,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=171425.83333333334, ans=0.125 2024-06-20 08:49:35,894 INFO [train.py:1028] (1/2) Epoch 10, batch 2450, loss[loss=0.2331, simple_loss=0.2749, pruned_loss=0.09561, over 13317.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.2891, pruned_loss=0.1026, over 2584571.41 frames. ], batch size: 63, lr: 5.96e-03, grad_scale: 64.0 2024-06-20 08:49:39,744 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.830e+02 2.043e+02 2.240e+02 3.065e+02, threshold=4.086e+02, percent-clipped=0.0 2024-06-20 08:49:50,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.23 vs. limit=22.5 2024-06-20 08:49:56,418 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.51 vs. limit=22.5 2024-06-20 08:49:56,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=171480.83333333334, ans=0.1 2024-06-20 08:49:57,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=171480.83333333334, ans=0.125 2024-06-20 08:49:59,502 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.08 vs. limit=10.0 2024-06-20 08:49:59,514 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.24 vs. limit=22.5 2024-06-20 08:50:02,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=171499.16666666666, ans=0.1 2024-06-20 08:50:08,503 INFO [train.py:1028] (1/2) Epoch 10, batch 2500, loss[loss=0.246, simple_loss=0.287, pruned_loss=0.1025, over 13243.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.288, pruned_loss=0.1022, over 2588752.55 frames. ], batch size: 83, lr: 5.96e-03, grad_scale: 64.0 2024-06-20 08:50:17,241 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.64 vs. limit=15.0 2024-06-20 08:50:34,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=171590.83333333334, ans=0.2 2024-06-20 08:50:40,586 INFO [train.py:1028] (1/2) Epoch 10, batch 2550, loss[loss=0.2653, simple_loss=0.3151, pruned_loss=0.1078, over 12455.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.2871, pruned_loss=0.1017, over 2587357.96 frames. ], batch size: 22, lr: 5.95e-03, grad_scale: 64.0 2024-06-20 08:50:41,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=171609.16666666666, ans=0.125 2024-06-20 08:50:42,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=171609.16666666666, ans=0.125 2024-06-20 08:50:44,497 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 1.810e+02 1.946e+02 2.108e+02 2.911e+02, threshold=3.891e+02, percent-clipped=0.0 2024-06-20 08:50:47,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=171627.5, ans=0.125 2024-06-20 08:50:57,448 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.79 vs. limit=6.0 2024-06-20 08:51:04,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=171664.16666666666, ans=0.0 2024-06-20 08:51:12,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=171682.5, ans=0.2 2024-06-20 08:51:16,545 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.31 vs. limit=15.0 2024-06-20 08:51:17,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=171700.83333333334, ans=15.0 2024-06-20 08:51:17,482 INFO [train.py:1028] (1/2) Epoch 10, batch 2600, loss[loss=0.2402, simple_loss=0.2826, pruned_loss=0.0989, over 13292.00 frames. ], tot_loss[loss=0.244, simple_loss=0.2855, pruned_loss=0.1012, over 2587482.83 frames. ], batch size: 52, lr: 5.95e-03, grad_scale: 64.0 2024-06-20 08:51:25,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=171700.83333333334, ans=0.025 2024-06-20 08:51:25,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=171700.83333333334, ans=0.035 2024-06-20 08:51:34,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=171737.5, ans=0.125 2024-06-20 08:51:36,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=171737.5, ans=0.0 2024-06-20 08:51:46,024 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.85 vs. limit=15.0 2024-06-20 08:51:47,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=171774.16666666666, ans=0.0 2024-06-20 08:51:52,766 INFO [train.py:1028] (1/2) Epoch 10, batch 2650, loss[loss=0.2387, simple_loss=0.2757, pruned_loss=0.1008, over 13016.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.2844, pruned_loss=0.1009, over 2587751.66 frames. ], batch size: 144, lr: 5.95e-03, grad_scale: 64.0 2024-06-20 08:51:56,700 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.788e+02 1.930e+02 2.141e+02 3.105e+02, threshold=3.859e+02, percent-clipped=0.0 2024-06-20 08:52:00,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=171810.83333333334, ans=0.1 2024-06-20 08:52:00,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=171810.83333333334, ans=0.2 2024-06-20 08:52:04,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=171810.83333333334, ans=0.2 2024-06-20 08:52:21,786 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.04 vs. limit=15.0 2024-06-20 08:52:25,079 INFO [train.py:1028] (1/2) Epoch 10, batch 2700, loss[loss=0.2336, simple_loss=0.2673, pruned_loss=0.09992, over 13208.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.283, pruned_loss=0.1005, over 2584124.15 frames. ], batch size: 89, lr: 5.95e-03, grad_scale: 64.0 2024-06-20 08:52:28,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=171884.16666666666, ans=0.125 2024-06-20 08:52:31,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=171902.5, ans=0.125 2024-06-20 08:52:32,098 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=17.17 vs. limit=15.0 2024-06-20 08:52:50,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=171957.5, ans=0.0 2024-06-20 08:52:55,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=171957.5, ans=0.1 2024-06-20 08:52:59,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=171957.5, ans=0.125 2024-06-20 08:53:02,299 INFO [train.py:1028] (1/2) Epoch 10, batch 2750, loss[loss=0.2637, simple_loss=0.3017, pruned_loss=0.1129, over 13247.00 frames. ], tot_loss[loss=0.241, simple_loss=0.2821, pruned_loss=0.09996, over 2580836.44 frames. ], batch size: 43, lr: 5.95e-03, grad_scale: 64.0 2024-06-20 08:53:03,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=171975.83333333334, ans=0.1 2024-06-20 08:53:04,224 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.44 vs. limit=6.0 2024-06-20 08:53:04,704 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.95 vs. limit=22.5 2024-06-20 08:53:06,277 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 1.753e+02 1.889e+02 2.145e+02 2.804e+02, threshold=3.779e+02, percent-clipped=0.0 2024-06-20 08:53:07,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=171975.83333333334, ans=0.2 2024-06-20 08:53:08,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=171994.16666666666, ans=0.0 2024-06-20 08:53:23,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=172012.5, ans=0.125 2024-06-20 08:53:31,186 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.65 vs. limit=15.0 2024-06-20 08:53:37,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=172049.16666666666, ans=0.025 2024-06-20 08:53:38,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=172067.5, ans=0.04949747468305833 2024-06-20 08:53:38,552 INFO [train.py:1028] (1/2) Epoch 10, batch 2800, loss[loss=0.2407, simple_loss=0.2747, pruned_loss=0.1033, over 10882.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.281, pruned_loss=0.09968, over 2578924.15 frames. ], batch size: 303, lr: 5.95e-03, grad_scale: 64.0 2024-06-20 08:53:41,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=172067.5, ans=0.0 2024-06-20 08:53:42,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=172067.5, ans=0.05 2024-06-20 08:53:46,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=172085.83333333334, ans=0.0 2024-06-20 08:53:55,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=172104.16666666666, ans=0.125 2024-06-20 08:53:56,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=172104.16666666666, ans=0.125 2024-06-20 08:53:58,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=172122.5, ans=0.125 2024-06-20 08:54:00,440 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.32 vs. limit=22.5 2024-06-20 08:54:05,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=172140.83333333334, ans=0.025 2024-06-20 08:54:11,104 INFO [train.py:1028] (1/2) Epoch 10, batch 2850, loss[loss=0.2163, simple_loss=0.266, pruned_loss=0.08336, over 13256.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.2801, pruned_loss=0.09927, over 2576844.94 frames. ], batch size: 49, lr: 5.95e-03, grad_scale: 64.0 2024-06-20 08:54:11,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=172159.16666666666, ans=0.125 2024-06-20 08:54:14,968 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.821e+02 1.960e+02 2.143e+02 3.198e+02, threshold=3.920e+02, percent-clipped=0.0 2024-06-20 08:54:15,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=172159.16666666666, ans=0.2 2024-06-20 08:54:34,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=172214.16666666666, ans=0.2 2024-06-20 08:54:34,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=172214.16666666666, ans=0.125 2024-06-20 08:54:39,298 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.94 vs. limit=15.0 2024-06-20 08:54:45,557 INFO [train.py:1028] (1/2) Epoch 10, batch 2900, loss[loss=0.2262, simple_loss=0.2722, pruned_loss=0.0901, over 13143.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.2779, pruned_loss=0.09847, over 2584514.62 frames. ], batch size: 55, lr: 5.94e-03, grad_scale: 64.0 2024-06-20 08:54:45,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=172250.83333333334, ans=0.125 2024-06-20 08:55:03,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=172287.5, ans=0.025 2024-06-20 08:55:09,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=172305.83333333334, ans=0.2 2024-06-20 08:55:12,326 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.18 vs. limit=15.0 2024-06-20 08:55:22,489 INFO [train.py:1028] (1/2) Epoch 10, batch 2950, loss[loss=0.2152, simple_loss=0.2626, pruned_loss=0.08395, over 13200.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.278, pruned_loss=0.09846, over 2579304.40 frames. ], batch size: 43, lr: 5.94e-03, grad_scale: 64.0 2024-06-20 08:55:22,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=172342.5, ans=0.07 2024-06-20 08:55:24,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=172342.5, ans=0.0 2024-06-20 08:55:24,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=172342.5, ans=0.125 2024-06-20 08:55:26,595 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.735e+02 1.860e+02 2.020e+02 2.681e+02, threshold=3.720e+02, percent-clipped=0.0 2024-06-20 08:55:26,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=172342.5, ans=0.125 2024-06-20 08:55:26,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=172342.5, ans=0.0 2024-06-20 08:55:27,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=172342.5, ans=0.0 2024-06-20 08:55:29,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=172360.83333333334, ans=0.025 2024-06-20 08:55:55,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=172434.16666666666, ans=0.1 2024-06-20 08:55:56,312 INFO [train.py:1028] (1/2) Epoch 10, batch 3000, loss[loss=0.2566, simple_loss=0.2942, pruned_loss=0.1095, over 13217.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.2767, pruned_loss=0.09773, over 2578017.72 frames. ], batch size: 59, lr: 5.94e-03, grad_scale: 64.0 2024-06-20 08:55:56,313 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 08:56:03,920 INFO [train.py:1060] (1/2) Epoch 10, validation: loss=0.1983, simple_loss=0.2621, pruned_loss=0.06725, over 351949.00 frames. 2024-06-20 08:56:03,921 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 08:56:11,237 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=172452.5, ans=0.0 2024-06-20 08:56:18,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=172470.83333333334, ans=0.0 2024-06-20 08:56:18,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=172470.83333333334, ans=0.125 2024-06-20 08:56:22,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=12.0 2024-06-20 08:56:34,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=172507.5, ans=0.125 2024-06-20 08:56:37,231 INFO [train.py:1028] (1/2) Epoch 10, batch 3050, loss[loss=0.2565, simple_loss=0.2939, pruned_loss=0.1096, over 13271.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.2761, pruned_loss=0.09786, over 2578837.18 frames. ], batch size: 46, lr: 5.94e-03, grad_scale: 64.0 2024-06-20 08:56:40,988 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.763e+02 1.874e+02 2.072e+02 2.669e+02, threshold=3.747e+02, percent-clipped=0.0 2024-06-20 08:56:46,467 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.72 vs. limit=5.0 2024-06-20 08:56:50,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=172544.16666666666, ans=0.0 2024-06-20 08:57:14,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=172617.5, ans=0.025 2024-06-20 08:57:15,183 INFO [train.py:1028] (1/2) Epoch 10, batch 3100, loss[loss=0.2128, simple_loss=0.2514, pruned_loss=0.08711, over 13063.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.2749, pruned_loss=0.09718, over 2579854.64 frames. ], batch size: 144, lr: 5.94e-03, grad_scale: 64.0 2024-06-20 08:57:18,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=172617.5, ans=0.125 2024-06-20 08:57:21,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=172635.83333333334, ans=0.125 2024-06-20 08:57:24,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=172635.83333333334, ans=0.125 2024-06-20 08:57:26,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=172635.83333333334, ans=0.125 2024-06-20 08:57:31,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=172654.16666666666, ans=0.125 2024-06-20 08:57:40,869 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.64 vs. limit=15.0 2024-06-20 08:57:48,289 INFO [train.py:1028] (1/2) Epoch 10, batch 3150, loss[loss=0.227, simple_loss=0.2601, pruned_loss=0.09695, over 12925.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.2736, pruned_loss=0.09678, over 2582360.21 frames. ], batch size: 158, lr: 5.94e-03, grad_scale: 64.0 2024-06-20 08:57:52,243 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.754e+02 1.887e+02 2.149e+02 3.244e+02, threshold=3.775e+02, percent-clipped=0.0 2024-06-20 08:57:53,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=172709.16666666666, ans=0.125 2024-06-20 08:57:54,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=172727.5, ans=0.025 2024-06-20 08:58:13,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=172782.5, ans=0.0 2024-06-20 08:58:19,042 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.97 vs. limit=22.5 2024-06-20 08:58:19,514 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2024-06-20 08:58:20,493 INFO [train.py:1028] (1/2) Epoch 10, batch 3200, loss[loss=0.2331, simple_loss=0.2759, pruned_loss=0.09518, over 13212.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.273, pruned_loss=0.09641, over 2582260.02 frames. ], batch size: 55, lr: 5.93e-03, grad_scale: 64.0 2024-06-20 08:58:21,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=172800.83333333334, ans=0.0 2024-06-20 08:58:25,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=172800.83333333334, ans=0.125 2024-06-20 08:58:46,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=172855.83333333334, ans=0.04949747468305833 2024-06-20 08:58:46,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=172855.83333333334, ans=0.1 2024-06-20 08:58:50,791 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.15 vs. limit=6.0 2024-06-20 08:58:54,775 INFO [train.py:1028] (1/2) Epoch 10, batch 3250, loss[loss=0.2338, simple_loss=0.2743, pruned_loss=0.09664, over 13287.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.2723, pruned_loss=0.09624, over 2587639.69 frames. ], batch size: 72, lr: 5.93e-03, grad_scale: 64.0 2024-06-20 08:58:58,901 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.724e+02 1.883e+02 2.092e+02 2.896e+02, threshold=3.765e+02, percent-clipped=0.0 2024-06-20 08:59:05,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=172910.83333333334, ans=0.125 2024-06-20 08:59:06,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=172910.83333333334, ans=0.2 2024-06-20 08:59:09,924 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.26 vs. limit=10.0 2024-06-20 08:59:30,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=172965.83333333334, ans=0.025 2024-06-20 08:59:31,272 INFO [train.py:1028] (1/2) Epoch 10, batch 3300, loss[loss=0.255, simple_loss=0.2875, pruned_loss=0.1113, over 12729.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2724, pruned_loss=0.09632, over 2584759.84 frames. ], batch size: 176, lr: 5.93e-03, grad_scale: 64.0 2024-06-20 08:59:33,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=172984.16666666666, ans=0.1 2024-06-20 08:59:39,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=173002.5, ans=0.125 2024-06-20 08:59:48,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=173020.83333333334, ans=0.0 2024-06-20 08:59:51,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=173039.16666666666, ans=0.0 2024-06-20 08:59:55,187 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=173039.16666666666, ans=0.125 2024-06-20 09:00:03,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=173075.83333333334, ans=0.0 2024-06-20 09:00:03,536 INFO [train.py:1028] (1/2) Epoch 10, batch 3350, loss[loss=0.2463, simple_loss=0.2772, pruned_loss=0.1077, over 12944.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2719, pruned_loss=0.09611, over 2579242.94 frames. ], batch size: 158, lr: 5.93e-03, grad_scale: 64.0 2024-06-20 09:00:04,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=173075.83333333334, ans=0.125 2024-06-20 09:00:07,457 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.830e+02 2.006e+02 2.365e+02 3.421e+02, threshold=4.013e+02, percent-clipped=0.0 2024-06-20 09:00:17,604 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.86 vs. limit=10.0 2024-06-20 09:00:26,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=173130.83333333334, ans=0.0 2024-06-20 09:00:36,855 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.40 vs. limit=15.0 2024-06-20 09:00:40,215 INFO [train.py:1028] (1/2) Epoch 10, batch 3400, loss[loss=0.2411, simple_loss=0.2805, pruned_loss=0.1009, over 12441.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2709, pruned_loss=0.09601, over 2576058.10 frames. ], batch size: 22, lr: 5.93e-03, grad_scale: 64.0 2024-06-20 09:00:43,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=173167.5, ans=0.1 2024-06-20 09:00:54,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=173185.83333333334, ans=0.1 2024-06-20 09:01:06,864 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.28 vs. limit=15.0 2024-06-20 09:01:13,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=173240.83333333334, ans=0.125 2024-06-20 09:01:16,823 INFO [train.py:1028] (1/2) Epoch 10, batch 3450, loss[loss=0.2454, simple_loss=0.2797, pruned_loss=0.1056, over 12696.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2693, pruned_loss=0.09501, over 2577548.88 frames. ], batch size: 176, lr: 5.93e-03, grad_scale: 64.0 2024-06-20 09:01:16,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=173259.16666666666, ans=0.125 2024-06-20 09:01:20,747 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.755e+02 1.900e+02 2.101e+02 2.856e+02, threshold=3.800e+02, percent-clipped=0.0 2024-06-20 09:01:20,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=173259.16666666666, ans=0.0 2024-06-20 09:01:21,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=173259.16666666666, ans=22.5 2024-06-20 09:01:22,817 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:01:30,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1.whitening_limit, batch_count=173295.83333333334, ans=10.0 2024-06-20 09:01:32,882 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.13 vs. limit=15.0 2024-06-20 09:01:37,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=173314.16666666666, ans=0.1 2024-06-20 09:01:39,027 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.60 vs. limit=22.5 2024-06-20 09:01:50,091 INFO [train.py:1028] (1/2) Epoch 10, batch 3500, loss[loss=0.2174, simple_loss=0.2597, pruned_loss=0.08755, over 12960.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2687, pruned_loss=0.09425, over 2576390.94 frames. ], batch size: 33, lr: 5.93e-03, grad_scale: 64.0 2024-06-20 09:01:52,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=173350.83333333334, ans=0.0 2024-06-20 09:01:52,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=173350.83333333334, ans=0.125 2024-06-20 09:01:52,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=173350.83333333334, ans=0.1 2024-06-20 09:01:58,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=173369.16666666666, ans=0.05 2024-06-20 09:02:06,078 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.80 vs. limit=15.0 2024-06-20 09:02:09,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=173387.5, ans=0.0 2024-06-20 09:02:15,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=173405.83333333334, ans=0.025 2024-06-20 09:02:24,180 INFO [train.py:1028] (1/2) Epoch 10, batch 3550, loss[loss=0.2362, simple_loss=0.2706, pruned_loss=0.101, over 13151.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2686, pruned_loss=0.09419, over 2576796.29 frames. ], batch size: 95, lr: 5.92e-03, grad_scale: 64.0 2024-06-20 09:02:26,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=173442.5, ans=0.2 2024-06-20 09:02:28,090 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.734e+02 1.852e+02 2.000e+02 2.779e+02, threshold=3.705e+02, percent-clipped=0.0 2024-06-20 09:02:41,430 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.19 vs. limit=15.0 2024-06-20 09:02:41,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=173479.16666666666, ans=0.0 2024-06-20 09:02:42,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=173479.16666666666, ans=0.1 2024-06-20 09:02:55,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=173515.83333333334, ans=0.5 2024-06-20 09:02:56,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=173515.83333333334, ans=0.0 2024-06-20 09:03:03,162 INFO [train.py:1028] (1/2) Epoch 10, batch 3600, loss[loss=0.2239, simple_loss=0.2723, pruned_loss=0.08776, over 13243.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2681, pruned_loss=0.09389, over 2580389.60 frames. ], batch size: 49, lr: 5.92e-03, grad_scale: 64.0 2024-06-20 09:03:11,397 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=15.0 2024-06-20 09:03:15,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=173552.5, ans=0.2 2024-06-20 09:03:15,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=173552.5, ans=0.2 2024-06-20 09:03:36,385 INFO [train.py:1028] (1/2) Epoch 10, batch 3650, loss[loss=0.2357, simple_loss=0.2763, pruned_loss=0.09756, over 13186.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2678, pruned_loss=0.09355, over 2578578.24 frames. ], batch size: 103, lr: 5.92e-03, grad_scale: 64.0 2024-06-20 09:03:40,243 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.752e+02 1.874e+02 2.055e+02 2.647e+02, threshold=3.749e+02, percent-clipped=0.0 2024-06-20 09:03:44,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=173644.16666666666, ans=0.025 2024-06-20 09:03:46,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=173644.16666666666, ans=0.1 2024-06-20 09:03:52,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=173662.5, ans=0.125 2024-06-20 09:03:53,772 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.00 vs. limit=10.0 2024-06-20 09:04:07,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=173699.16666666666, ans=0.0 2024-06-20 09:04:09,468 INFO [train.py:1028] (1/2) Epoch 10, batch 3700, loss[loss=0.2289, simple_loss=0.2676, pruned_loss=0.09508, over 13265.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2677, pruned_loss=0.09343, over 2583274.27 frames. ], batch size: 72, lr: 5.92e-03, grad_scale: 64.0 2024-06-20 09:04:11,980 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.29 vs. limit=15.0 2024-06-20 09:04:20,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=173735.83333333334, ans=0.1 2024-06-20 09:04:24,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=173754.16666666666, ans=0.125 2024-06-20 09:04:26,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.60 vs. limit=15.0 2024-06-20 09:04:45,236 INFO [train.py:1028] (1/2) Epoch 10, batch 3750, loss[loss=0.2611, simple_loss=0.3004, pruned_loss=0.1109, over 12490.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2665, pruned_loss=0.09266, over 2585466.71 frames. ], batch size: 22, lr: 5.92e-03, grad_scale: 64.0 2024-06-20 09:04:49,198 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.748e+02 1.869e+02 2.059e+02 2.560e+02, threshold=3.738e+02, percent-clipped=0.0 2024-06-20 09:04:49,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=173809.16666666666, ans=0.5 2024-06-20 09:04:53,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=173827.5, ans=0.125 2024-06-20 09:04:55,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=173827.5, ans=0.125 2024-06-20 09:04:57,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=173845.83333333334, ans=0.125 2024-06-20 09:05:13,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=173864.16666666666, ans=0.125 2024-06-20 09:05:18,739 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.79 vs. limit=10.0 2024-06-20 09:05:20,878 INFO [train.py:1028] (1/2) Epoch 10, batch 3800, loss[loss=0.2154, simple_loss=0.2523, pruned_loss=0.08922, over 13170.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2668, pruned_loss=0.09287, over 2583216.05 frames. ], batch size: 83, lr: 5.92e-03, grad_scale: 64.0 2024-06-20 09:05:25,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=173900.83333333334, ans=0.95 2024-06-20 09:05:32,389 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.592e+01 2024-06-20 09:05:39,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=173937.5, ans=0.125 2024-06-20 09:05:53,881 INFO [train.py:1028] (1/2) Epoch 10, batch 3850, loss[loss=0.2073, simple_loss=0.2446, pruned_loss=0.08497, over 13015.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2666, pruned_loss=0.0927, over 2583987.40 frames. ], batch size: 144, lr: 5.91e-03, grad_scale: 64.0 2024-06-20 09:05:57,756 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.701e+02 1.841e+02 2.007e+02 2.704e+02, threshold=3.682e+02, percent-clipped=0.0 2024-06-20 09:06:03,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=174010.83333333334, ans=0.125 2024-06-20 09:06:06,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=174029.16666666666, ans=0.05 2024-06-20 09:06:12,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=174047.5, ans=0.0 2024-06-20 09:06:22,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=174065.83333333334, ans=0.0 2024-06-20 09:06:22,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=174065.83333333334, ans=0.2 2024-06-20 09:06:26,256 INFO [train.py:1028] (1/2) Epoch 10, batch 3900, loss[loss=0.2289, simple_loss=0.2705, pruned_loss=0.09364, over 13241.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2667, pruned_loss=0.09303, over 2586917.84 frames. ], batch size: 83, lr: 5.91e-03, grad_scale: 64.0 2024-06-20 09:06:27,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=174084.16666666666, ans=0.2 2024-06-20 09:06:30,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=174084.16666666666, ans=0.2 2024-06-20 09:06:31,504 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.67 vs. limit=6.0 2024-06-20 09:06:32,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=174102.5, ans=0.125 2024-06-20 09:06:35,542 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.027e+01 2024-06-20 09:06:41,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=174120.83333333334, ans=0.125 2024-06-20 09:06:49,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=174139.16666666666, ans=0.1 2024-06-20 09:06:54,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=174139.16666666666, ans=0.125 2024-06-20 09:06:55,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=174157.5, ans=0.0 2024-06-20 09:07:00,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=174157.5, ans=0.2 2024-06-20 09:07:02,618 INFO [train.py:1028] (1/2) Epoch 10, batch 3950, loss[loss=0.2267, simple_loss=0.2589, pruned_loss=0.09728, over 13128.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2653, pruned_loss=0.0921, over 2589312.62 frames. ], batch size: 132, lr: 5.91e-03, grad_scale: 64.0 2024-06-20 09:07:06,560 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.805e+02 1.974e+02 2.277e+02 3.132e+02, threshold=3.949e+02, percent-clipped=0.0 2024-06-20 09:07:06,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=174175.83333333334, ans=0.125 2024-06-20 09:07:30,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=174230.83333333334, ans=0.125 2024-06-20 09:07:35,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=174249.16666666666, ans=0.05 2024-06-20 09:07:38,523 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.72 vs. limit=15.0 2024-06-20 09:07:39,501 INFO [train.py:1028] (1/2) Epoch 10, batch 4000, loss[loss=0.2284, simple_loss=0.2751, pruned_loss=0.09092, over 12870.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2656, pruned_loss=0.0923, over 2583746.40 frames. ], batch size: 39, lr: 5.91e-03, grad_scale: 64.0 2024-06-20 09:07:40,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=174267.5, ans=0.0 2024-06-20 09:07:43,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=174267.5, ans=0.05 2024-06-20 09:07:47,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=174285.83333333334, ans=0.125 2024-06-20 09:07:50,744 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.79 vs. limit=22.5 2024-06-20 09:08:05,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=174322.5, ans=0.125 2024-06-20 09:08:12,815 INFO [train.py:1028] (1/2) Epoch 10, batch 4050, loss[loss=0.2353, simple_loss=0.2627, pruned_loss=0.1039, over 10963.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2658, pruned_loss=0.09275, over 2581843.76 frames. ], batch size: 304, lr: 5.91e-03, grad_scale: 64.0 2024-06-20 09:08:12,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=174359.16666666666, ans=0.125 2024-06-20 09:08:16,630 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.748e+02 1.860e+02 2.069e+02 2.653e+02, threshold=3.719e+02, percent-clipped=0.0 2024-06-20 09:08:21,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=174377.5, ans=0.0 2024-06-20 09:08:32,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=174414.16666666666, ans=0.0 2024-06-20 09:08:33,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=174414.16666666666, ans=0.125 2024-06-20 09:08:36,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=174414.16666666666, ans=0.125 2024-06-20 09:08:42,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=174432.5, ans=0.025 2024-06-20 09:08:44,227 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.16 vs. limit=6.0 2024-06-20 09:08:45,142 INFO [train.py:1028] (1/2) Epoch 10, batch 4100, loss[loss=0.2307, simple_loss=0.2592, pruned_loss=0.1011, over 13055.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2653, pruned_loss=0.09269, over 2577841.98 frames. ], batch size: 102, lr: 5.91e-03, grad_scale: 64.0 2024-06-20 09:08:45,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=174450.83333333334, ans=0.125 2024-06-20 09:08:46,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=174450.83333333334, ans=0.025 2024-06-20 09:08:59,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=174469.16666666666, ans=0.1 2024-06-20 09:09:24,794 INFO [train.py:1028] (1/2) Epoch 10, batch 4150, loss[loss=0.2371, simple_loss=0.2701, pruned_loss=0.1021, over 13157.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2648, pruned_loss=0.09228, over 2576551.90 frames. ], batch size: 55, lr: 5.90e-03, grad_scale: 64.0 2024-06-20 09:09:28,832 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.772e+02 1.903e+02 2.039e+02 2.663e+02, threshold=3.806e+02, percent-clipped=0.0 2024-06-20 09:09:35,191 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.39 vs. limit=15.0 2024-06-20 09:09:36,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=174560.83333333334, ans=0.125 2024-06-20 09:09:49,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=174597.5, ans=0.125 2024-06-20 09:09:56,938 INFO [train.py:1028] (1/2) Epoch 10, batch 4200, loss[loss=0.2287, simple_loss=0.2601, pruned_loss=0.09863, over 13050.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2639, pruned_loss=0.09186, over 2579231.82 frames. ], batch size: 102, lr: 5.90e-03, grad_scale: 64.0 2024-06-20 09:10:07,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=174652.5, ans=0.125 2024-06-20 09:10:07,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=174652.5, ans=0.125 2024-06-20 09:10:08,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=174652.5, ans=0.125 2024-06-20 09:10:16,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=174689.16666666666, ans=0.1 2024-06-20 09:10:24,224 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.30 vs. limit=15.0 2024-06-20 09:10:26,905 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.78 vs. limit=15.0 2024-06-20 09:10:29,698 INFO [train.py:1028] (1/2) Epoch 10, batch 4250, loss[loss=0.2299, simple_loss=0.2795, pruned_loss=0.09011, over 13364.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2636, pruned_loss=0.09143, over 2582355.47 frames. ], batch size: 46, lr: 5.90e-03, grad_scale: 64.0 2024-06-20 09:10:33,627 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.735e+02 1.895e+02 2.118e+02 3.777e+02, threshold=3.789e+02, percent-clipped=0.0 2024-06-20 09:10:36,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=174744.16666666666, ans=0.0 2024-06-20 09:10:38,496 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.68 vs. limit=15.0 2024-06-20 09:10:41,743 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.81 vs. limit=10.0 2024-06-20 09:10:42,998 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.47 vs. limit=15.0 2024-06-20 09:10:44,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=174762.5, ans=0.125 2024-06-20 09:10:56,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=174780.83333333334, ans=0.1 2024-06-20 09:10:59,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=174799.16666666666, ans=0.1 2024-06-20 09:11:08,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=174799.16666666666, ans=0.125 2024-06-20 09:11:09,764 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.80 vs. limit=15.0 2024-06-20 09:11:09,979 INFO [train.py:1028] (1/2) Epoch 10, batch 4300, loss[loss=0.2189, simple_loss=0.2599, pruned_loss=0.08896, over 13210.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2636, pruned_loss=0.09164, over 2581973.61 frames. ], batch size: 59, lr: 5.90e-03, grad_scale: 128.0 2024-06-20 09:11:15,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=174817.5, ans=0.125 2024-06-20 09:11:22,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=174854.16666666666, ans=0.1 2024-06-20 09:11:24,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=174854.16666666666, ans=0.125 2024-06-20 09:11:25,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.71 vs. limit=22.5 2024-06-20 09:11:29,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=174872.5, ans=0.035 2024-06-20 09:11:30,592 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.350e+01 2024-06-20 09:11:41,698 INFO [train.py:1028] (1/2) Epoch 10, batch 4350, loss[loss=0.2171, simple_loss=0.2665, pruned_loss=0.08384, over 13230.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2635, pruned_loss=0.09204, over 2586076.75 frames. ], batch size: 59, lr: 5.90e-03, grad_scale: 128.0 2024-06-20 09:11:42,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=174909.16666666666, ans=0.0 2024-06-20 09:11:45,510 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.733e+02 1.892e+02 2.085e+02 2.858e+02, threshold=3.785e+02, percent-clipped=0.0 2024-06-20 09:11:48,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=174927.5, ans=0.125 2024-06-20 09:11:49,036 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=12.42 vs. limit=12.0 2024-06-20 09:12:02,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=174964.16666666666, ans=0.04949747468305833 2024-06-20 09:12:14,116 INFO [train.py:1028] (1/2) Epoch 10, batch 4400, loss[loss=0.2059, simple_loss=0.251, pruned_loss=0.08039, over 13208.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.263, pruned_loss=0.09169, over 2585763.23 frames. ], batch size: 83, lr: 5.90e-03, grad_scale: 128.0 2024-06-20 09:12:16,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=175000.83333333334, ans=0.0 2024-06-20 09:12:29,697 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.71 vs. limit=12.0 2024-06-20 09:12:42,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=175074.16666666666, ans=0.125 2024-06-20 09:12:47,610 INFO [train.py:1028] (1/2) Epoch 10, batch 4450, loss[loss=0.2323, simple_loss=0.273, pruned_loss=0.09577, over 12946.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.263, pruned_loss=0.09164, over 2581139.28 frames. ], batch size: 33, lr: 5.90e-03, grad_scale: 128.0 2024-06-20 09:12:47,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=175092.5, ans=0.0 2024-06-20 09:12:47,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=175092.5, ans=0.1 2024-06-20 09:12:54,734 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.725e+02 1.854e+02 2.027e+02 2.661e+02, threshold=3.709e+02, percent-clipped=0.0 2024-06-20 09:12:59,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=175110.83333333334, ans=0.2 2024-06-20 09:13:00,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=175110.83333333334, ans=0.0 2024-06-20 09:13:02,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=175110.83333333334, ans=0.125 2024-06-20 09:13:12,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=175129.16666666666, ans=0.0 2024-06-20 09:13:27,136 INFO [train.py:1028] (1/2) Epoch 10, batch 4500, loss[loss=0.2242, simple_loss=0.2605, pruned_loss=0.09399, over 13160.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2628, pruned_loss=0.09146, over 2584993.70 frames. ], batch size: 89, lr: 5.89e-03, grad_scale: 128.0 2024-06-20 09:13:35,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=175202.5, ans=0.1 2024-06-20 09:13:38,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=175202.5, ans=0.07 2024-06-20 09:13:45,293 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:13:46,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=175239.16666666666, ans=0.2 2024-06-20 09:13:52,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=175239.16666666666, ans=0.125 2024-06-20 09:13:54,005 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=13.55 vs. limit=15.0 2024-06-20 09:13:55,275 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.10 vs. limit=22.5 2024-06-20 09:13:56,949 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2024-06-20 09:14:00,525 INFO [train.py:1028] (1/2) Epoch 10, batch 4550, loss[loss=0.2118, simple_loss=0.2555, pruned_loss=0.08406, over 13285.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.263, pruned_loss=0.09168, over 2588255.28 frames. ], batch size: 52, lr: 5.89e-03, grad_scale: 128.0 2024-06-20 09:14:01,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=175275.83333333334, ans=0.2 2024-06-20 09:14:03,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=175275.83333333334, ans=0.125 2024-06-20 09:14:04,498 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.742e+02 1.850e+02 2.081e+02 2.683e+02, threshold=3.699e+02, percent-clipped=0.0 2024-06-20 09:14:05,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=175275.83333333334, ans=0.1 2024-06-20 09:14:09,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=175294.16666666666, ans=0.1 2024-06-20 09:14:15,492 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.59 vs. limit=10.0 2024-06-20 09:14:28,790 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.72 vs. limit=15.0 2024-06-20 09:14:30,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=175349.16666666666, ans=0.05 2024-06-20 09:14:31,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=175349.16666666666, ans=0.125 2024-06-20 09:14:33,375 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.01 vs. limit=10.0 2024-06-20 09:14:33,573 INFO [train.py:1028] (1/2) Epoch 10, batch 4600, loss[loss=0.2484, simple_loss=0.2801, pruned_loss=0.1084, over 12595.00 frames. ], tot_loss[loss=0.223, simple_loss=0.263, pruned_loss=0.09154, over 2584920.72 frames. ], batch size: 202, lr: 5.89e-03, grad_scale: 64.0 2024-06-20 09:14:48,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=175404.16666666666, ans=0.125 2024-06-20 09:14:48,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.96 vs. limit=22.5 2024-06-20 09:14:48,991 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.65 vs. limit=12.0 2024-06-20 09:14:51,936 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.96 vs. limit=10.0 2024-06-20 09:14:53,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=175422.5, ans=0.2 2024-06-20 09:15:01,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=175422.5, ans=0.025 2024-06-20 09:15:13,792 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.89 vs. limit=15.0 2024-06-20 09:15:14,547 INFO [train.py:1028] (1/2) Epoch 10, batch 4650, loss[loss=0.221, simple_loss=0.2573, pruned_loss=0.09232, over 13106.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2621, pruned_loss=0.09129, over 2588489.71 frames. ], batch size: 132, lr: 5.89e-03, grad_scale: 64.0 2024-06-20 09:15:14,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=175459.16666666666, ans=0.015 2024-06-20 09:15:19,062 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.743e+02 1.905e+02 2.063e+02 3.392e+02, threshold=3.811e+02, percent-clipped=0.0 2024-06-20 09:15:37,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=175514.16666666666, ans=0.0 2024-06-20 09:15:40,318 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.70 vs. limit=12.0 2024-06-20 09:15:47,797 INFO [train.py:1028] (1/2) Epoch 10, batch 4700, loss[loss=0.2186, simple_loss=0.2616, pruned_loss=0.08781, over 12374.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2624, pruned_loss=0.0918, over 2584129.47 frames. ], batch size: 25, lr: 5.89e-03, grad_scale: 64.0 2024-06-20 09:15:52,619 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.34 vs. limit=15.0 2024-06-20 09:15:53,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=175550.83333333334, ans=0.0 2024-06-20 09:15:57,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=175569.16666666666, ans=0.125 2024-06-20 09:16:05,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=175587.5, ans=0.125 2024-06-20 09:16:06,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=175605.83333333334, ans=0.125 2024-06-20 09:16:09,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=175605.83333333334, ans=0.95 2024-06-20 09:16:10,290 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.29 vs. limit=15.0 2024-06-20 09:16:20,600 INFO [train.py:1028] (1/2) Epoch 10, batch 4750, loss[loss=0.2621, simple_loss=0.2896, pruned_loss=0.1173, over 12498.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2622, pruned_loss=0.09194, over 2580872.44 frames. ], batch size: 202, lr: 5.89e-03, grad_scale: 64.0 2024-06-20 09:16:25,509 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.770e+02 1.885e+02 2.068e+02 2.454e+02, threshold=3.770e+02, percent-clipped=0.0 2024-06-20 09:16:39,216 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:16:52,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=175715.83333333334, ans=0.0 2024-06-20 09:16:57,783 INFO [train.py:1028] (1/2) Epoch 10, batch 4800, loss[loss=0.2167, simple_loss=0.2542, pruned_loss=0.08961, over 13263.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2623, pruned_loss=0.09209, over 2577022.07 frames. ], batch size: 63, lr: 5.89e-03, grad_scale: 64.0 2024-06-20 09:16:58,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=175734.16666666666, ans=0.2 2024-06-20 09:16:59,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=175734.16666666666, ans=0.0 2024-06-20 09:17:05,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=175734.16666666666, ans=0.0 2024-06-20 09:17:21,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=175789.16666666666, ans=0.0 2024-06-20 09:17:22,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=175789.16666666666, ans=0.1 2024-06-20 09:17:25,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=175789.16666666666, ans=0.2 2024-06-20 09:17:28,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=175807.5, ans=0.2 2024-06-20 09:17:34,456 INFO [train.py:1028] (1/2) Epoch 10, batch 4850, loss[loss=0.2339, simple_loss=0.2723, pruned_loss=0.09777, over 13245.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2618, pruned_loss=0.09167, over 2574094.91 frames. ], batch size: 89, lr: 5.88e-03, grad_scale: 64.0 2024-06-20 09:17:39,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=175825.83333333334, ans=0.125 2024-06-20 09:17:39,381 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.695e+02 1.831e+02 1.973e+02 3.963e+02, threshold=3.661e+02, percent-clipped=1.0 2024-06-20 09:17:43,668 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:17:44,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=175844.16666666666, ans=0.125 2024-06-20 09:17:56,171 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.59 vs. limit=22.5 2024-06-20 09:17:56,837 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.60 vs. limit=22.5 2024-06-20 09:17:59,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=175880.83333333334, ans=0.0 2024-06-20 09:18:01,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=175899.16666666666, ans=0.125 2024-06-20 09:18:08,518 INFO [train.py:1028] (1/2) Epoch 10, batch 4900, loss[loss=0.2072, simple_loss=0.2507, pruned_loss=0.08181, over 13196.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2623, pruned_loss=0.092, over 2574649.20 frames. ], batch size: 59, lr: 5.88e-03, grad_scale: 64.0 2024-06-20 09:18:15,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=175935.83333333334, ans=0.125 2024-06-20 09:18:17,921 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:18:23,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=175954.16666666666, ans=0.2 2024-06-20 09:18:29,237 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:18:29,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=175972.5, ans=0.0 2024-06-20 09:18:29,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=175972.5, ans=0.125 2024-06-20 09:18:31,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=175972.5, ans=0.025 2024-06-20 09:18:31,562 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.59 vs. limit=22.5 2024-06-20 09:18:45,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=175990.83333333334, ans=0.1 2024-06-20 09:18:45,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=175990.83333333334, ans=0.0 2024-06-20 09:18:47,000 INFO [train.py:1028] (1/2) Epoch 10, batch 4950, loss[loss=0.2222, simple_loss=0.252, pruned_loss=0.0962, over 10983.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2623, pruned_loss=0.0924, over 2569421.10 frames. ], batch size: 303, lr: 5.88e-03, grad_scale: 64.0 2024-06-20 09:18:48,095 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.05 vs. limit=22.5 2024-06-20 09:18:51,511 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.750e+02 1.864e+02 2.118e+02 3.091e+02, threshold=3.728e+02, percent-clipped=0.0 2024-06-20 09:19:09,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=176045.83333333334, ans=0.2 2024-06-20 09:19:14,808 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.41 vs. limit=15.0 2024-06-20 09:19:15,495 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.68 vs. limit=15.0 2024-06-20 09:19:17,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=176064.16666666666, ans=0.0 2024-06-20 09:19:23,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=176082.5, ans=0.125 2024-06-20 09:19:25,772 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=12.0 2024-06-20 09:19:25,938 INFO [train.py:1028] (1/2) Epoch 10, batch 5000, loss[loss=0.1945, simple_loss=0.2343, pruned_loss=0.07732, over 13137.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2616, pruned_loss=0.09161, over 2573098.67 frames. ], batch size: 95, lr: 5.88e-03, grad_scale: 64.0 2024-06-20 09:19:30,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=176100.83333333334, ans=0.0 2024-06-20 09:19:33,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=176119.16666666666, ans=0.0 2024-06-20 09:19:36,394 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.66 vs. limit=15.0 2024-06-20 09:19:36,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=176119.16666666666, ans=0.125 2024-06-20 09:19:45,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=176155.83333333334, ans=0.0 2024-06-20 09:19:48,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=176155.83333333334, ans=0.0 2024-06-20 09:19:50,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=176155.83333333334, ans=0.0 2024-06-20 09:19:50,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=176155.83333333334, ans=0.2 2024-06-20 09:19:52,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=176174.16666666666, ans=0.04949747468305833 2024-06-20 09:19:52,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=176174.16666666666, ans=0.0 2024-06-20 09:19:53,791 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.48 vs. limit=15.0 2024-06-20 09:19:59,267 INFO [train.py:1028] (1/2) Epoch 10, batch 5050, loss[loss=0.2092, simple_loss=0.2514, pruned_loss=0.0835, over 12850.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2612, pruned_loss=0.09106, over 2571318.66 frames. ], batch size: 36, lr: 5.88e-03, grad_scale: 64.0 2024-06-20 09:20:02,006 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:20:03,808 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=12.0 2024-06-20 09:20:03,891 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.693e+02 1.827e+02 1.983e+02 2.942e+02, threshold=3.653e+02, percent-clipped=0.0 2024-06-20 09:20:04,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=176192.5, ans=0.2 2024-06-20 09:20:09,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=176210.83333333334, ans=0.1 2024-06-20 09:20:13,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=176229.16666666666, ans=0.125 2024-06-20 09:20:16,216 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.78 vs. limit=15.0 2024-06-20 09:20:22,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=176247.5, ans=0.125 2024-06-20 09:20:22,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=176247.5, ans=0.05 2024-06-20 09:20:24,345 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.39 vs. limit=15.0 2024-06-20 09:20:28,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=176265.83333333334, ans=0.1 2024-06-20 09:20:32,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=176284.16666666666, ans=0.09899494936611666 2024-06-20 09:20:32,787 INFO [train.py:1028] (1/2) Epoch 10, batch 5100, loss[loss=0.199, simple_loss=0.2474, pruned_loss=0.07533, over 12900.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2613, pruned_loss=0.09158, over 2567675.70 frames. ], batch size: 39, lr: 5.88e-03, grad_scale: 64.0 2024-06-20 09:20:33,938 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.55 vs. limit=15.0 2024-06-20 09:20:34,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=176284.16666666666, ans=0.2 2024-06-20 09:20:43,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=176302.5, ans=0.125 2024-06-20 09:20:58,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=176339.16666666666, ans=0.125 2024-06-20 09:21:12,883 INFO [train.py:1028] (1/2) Epoch 10, batch 5150, loss[loss=0.2108, simple_loss=0.2414, pruned_loss=0.09009, over 13083.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2607, pruned_loss=0.0913, over 2569185.77 frames. ], batch size: 132, lr: 5.87e-03, grad_scale: 64.0 2024-06-20 09:21:17,602 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.752e+02 1.891e+02 2.163e+02 3.075e+02, threshold=3.783e+02, percent-clipped=0.0 2024-06-20 09:21:18,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=176375.83333333334, ans=0.025 2024-06-20 09:21:21,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=176394.16666666666, ans=0.0 2024-06-20 09:21:23,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=176394.16666666666, ans=0.0 2024-06-20 09:21:26,435 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.23 vs. limit=15.0 2024-06-20 09:21:29,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=176412.5, ans=0.2 2024-06-20 09:21:30,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=176412.5, ans=0.0 2024-06-20 09:21:31,877 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.78 vs. limit=22.5 2024-06-20 09:21:33,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=176430.83333333334, ans=0.05 2024-06-20 09:21:37,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=176430.83333333334, ans=0.125 2024-06-20 09:21:45,476 INFO [train.py:1028] (1/2) Epoch 10, batch 5200, loss[loss=0.2322, simple_loss=0.26, pruned_loss=0.1022, over 13189.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2612, pruned_loss=0.09126, over 2573553.28 frames. ], batch size: 95, lr: 5.87e-03, grad_scale: 64.0 2024-06-20 09:21:50,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=176467.5, ans=0.1 2024-06-20 09:21:58,639 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.59 vs. limit=15.0 2024-06-20 09:22:06,766 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.39 vs. limit=15.0 2024-06-20 09:22:07,532 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.76 vs. limit=6.0 2024-06-20 09:22:10,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=176522.5, ans=0.025 2024-06-20 09:22:13,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=176540.83333333334, ans=0.125 2024-06-20 09:22:14,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=176540.83333333334, ans=0.0 2024-06-20 09:22:19,025 INFO [train.py:1028] (1/2) Epoch 10, batch 5250, loss[loss=0.2151, simple_loss=0.2592, pruned_loss=0.08553, over 13233.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2616, pruned_loss=0.0915, over 2570240.89 frames. ], batch size: 52, lr: 5.87e-03, grad_scale: 64.0 2024-06-20 09:22:19,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=176559.16666666666, ans=0.05 2024-06-20 09:22:23,408 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.727e+02 1.851e+02 2.005e+02 3.149e+02, threshold=3.701e+02, percent-clipped=0.0 2024-06-20 09:22:28,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=176577.5, ans=0.125 2024-06-20 09:22:32,640 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.92 vs. limit=15.0 2024-06-20 09:22:34,554 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.30 vs. limit=22.5 2024-06-20 09:22:35,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=176595.83333333334, ans=0.0 2024-06-20 09:22:35,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=176595.83333333334, ans=0.0 2024-06-20 09:22:40,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=176614.16666666666, ans=0.2 2024-06-20 09:22:41,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=176614.16666666666, ans=0.125 2024-06-20 09:22:43,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=176614.16666666666, ans=0.1 2024-06-20 09:22:54,992 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.49 vs. limit=22.5 2024-06-20 09:22:55,248 INFO [train.py:1028] (1/2) Epoch 10, batch 5300, loss[loss=0.2384, simple_loss=0.2643, pruned_loss=0.1062, over 13051.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2615, pruned_loss=0.0916, over 2567127.89 frames. ], batch size: 144, lr: 5.87e-03, grad_scale: 64.0 2024-06-20 09:22:58,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=176650.83333333334, ans=0.0 2024-06-20 09:23:16,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=176687.5, ans=0.125 2024-06-20 09:23:20,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=176705.83333333334, ans=0.1 2024-06-20 09:23:20,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=176705.83333333334, ans=0.0 2024-06-20 09:23:26,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=176724.16666666666, ans=0.125 2024-06-20 09:23:32,653 INFO [train.py:1028] (1/2) Epoch 10, batch 5350, loss[loss=0.2487, simple_loss=0.2928, pruned_loss=0.1023, over 11090.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2603, pruned_loss=0.09096, over 2572893.25 frames. ], batch size: 16, lr: 5.87e-03, grad_scale: 64.0 2024-06-20 09:23:37,501 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.712e+02 1.829e+02 1.978e+02 3.041e+02, threshold=3.658e+02, percent-clipped=0.0 2024-06-20 09:23:43,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=176760.83333333334, ans=0.125 2024-06-20 09:23:44,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=176760.83333333334, ans=0.1 2024-06-20 09:23:44,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=176760.83333333334, ans=0.0 2024-06-20 09:23:45,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=176779.16666666666, ans=0.0 2024-06-20 09:23:46,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=176779.16666666666, ans=0.0 2024-06-20 09:23:49,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=176779.16666666666, ans=0.125 2024-06-20 09:23:50,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=176779.16666666666, ans=0.09899494936611666 2024-06-20 09:24:05,183 INFO [train.py:1028] (1/2) Epoch 10, batch 5400, loss[loss=0.2409, simple_loss=0.2688, pruned_loss=0.1065, over 12228.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2605, pruned_loss=0.09127, over 2566689.20 frames. ], batch size: 240, lr: 5.87e-03, grad_scale: 64.0 2024-06-20 09:24:06,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=176834.16666666666, ans=0.1 2024-06-20 09:24:07,457 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.82 vs. limit=10.0 2024-06-20 09:24:08,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=176834.16666666666, ans=0.125 2024-06-20 09:24:15,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=176852.5, ans=0.1 2024-06-20 09:24:16,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=176852.5, ans=0.125 2024-06-20 09:24:32,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=176907.5, ans=0.035 2024-06-20 09:24:36,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=176907.5, ans=0.0 2024-06-20 09:24:38,132 INFO [train.py:1028] (1/2) Epoch 10, batch 5450, loss[loss=0.2144, simple_loss=0.2646, pruned_loss=0.0821, over 12923.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2601, pruned_loss=0.09083, over 2571102.17 frames. ], batch size: 26, lr: 5.87e-03, grad_scale: 64.0 2024-06-20 09:24:44,743 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=12.92 vs. limit=15.0 2024-06-20 09:24:46,176 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.752e+02 1.878e+02 2.110e+02 3.255e+02, threshold=3.755e+02, percent-clipped=0.0 2024-06-20 09:25:00,083 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:25:01,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=176962.5, ans=10.0 2024-06-20 09:25:08,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=176980.83333333334, ans=0.125 2024-06-20 09:25:13,079 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.34 vs. limit=22.5 2024-06-20 09:25:16,479 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.01 vs. limit=15.0 2024-06-20 09:25:18,122 INFO [train.py:1028] (1/2) Epoch 10, batch 5500, loss[loss=0.2616, simple_loss=0.285, pruned_loss=0.1191, over 12210.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2605, pruned_loss=0.09085, over 2563886.02 frames. ], batch size: 241, lr: 5.86e-03, grad_scale: 64.0 2024-06-20 09:25:21,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=177017.5, ans=0.025 2024-06-20 09:25:23,055 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.89 vs. limit=22.5 2024-06-20 09:25:25,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.05 vs. limit=22.5 2024-06-20 09:25:30,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=177054.16666666666, ans=0.025 2024-06-20 09:25:51,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=177109.16666666666, ans=0.125 2024-06-20 09:25:51,593 INFO [train.py:1028] (1/2) Epoch 10, batch 5550, loss[loss=0.2266, simple_loss=0.2687, pruned_loss=0.09227, over 13230.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2601, pruned_loss=0.09055, over 2567157.98 frames. ], batch size: 43, lr: 5.86e-03, grad_scale: 64.0 2024-06-20 09:25:54,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=177109.16666666666, ans=0.1 2024-06-20 09:25:56,205 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.758e+02 1.936e+02 2.203e+02 3.256e+02, threshold=3.872e+02, percent-clipped=0.0 2024-06-20 09:26:04,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=177145.83333333334, ans=0.2 2024-06-20 09:26:16,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=177164.16666666666, ans=0.025 2024-06-20 09:26:21,873 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.35 vs. limit=10.0 2024-06-20 09:26:24,306 INFO [train.py:1028] (1/2) Epoch 10, batch 5600, loss[loss=0.2133, simple_loss=0.2547, pruned_loss=0.08594, over 13280.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2596, pruned_loss=0.09048, over 2569662.01 frames. ], batch size: 89, lr: 5.86e-03, grad_scale: 64.0 2024-06-20 09:26:26,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=177200.83333333334, ans=0.1 2024-06-20 09:26:28,069 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.89 vs. limit=6.0 2024-06-20 09:26:36,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=177219.16666666666, ans=0.0 2024-06-20 09:26:37,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=177237.5, ans=0.2 2024-06-20 09:26:38,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=177237.5, ans=0.2 2024-06-20 09:26:46,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=177237.5, ans=0.125 2024-06-20 09:26:50,681 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.72 vs. limit=15.0 2024-06-20 09:26:51,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=177255.83333333334, ans=0.2 2024-06-20 09:26:52,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=177255.83333333334, ans=0.0 2024-06-20 09:27:00,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=177274.16666666666, ans=0.1 2024-06-20 09:27:05,759 INFO [train.py:1028] (1/2) Epoch 10, batch 5650, loss[loss=0.2517, simple_loss=0.276, pruned_loss=0.1137, over 12611.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2599, pruned_loss=0.09043, over 2574757.63 frames. ], batch size: 202, lr: 5.86e-03, grad_scale: 64.0 2024-06-20 09:27:06,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=177292.5, ans=0.04949747468305833 2024-06-20 09:27:10,393 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.728e+02 1.873e+02 2.122e+02 3.234e+02, threshold=3.746e+02, percent-clipped=0.0 2024-06-20 09:27:11,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=177292.5, ans=0.0 2024-06-20 09:27:15,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=177310.83333333334, ans=0.2 2024-06-20 09:27:16,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=177310.83333333334, ans=0.1 2024-06-20 09:27:24,778 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.37 vs. limit=15.0 2024-06-20 09:27:26,260 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.53 vs. limit=15.0 2024-06-20 09:27:31,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=177347.5, ans=0.0 2024-06-20 09:27:35,281 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:27:36,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=177365.83333333334, ans=0.125 2024-06-20 09:27:39,099 INFO [train.py:1028] (1/2) Epoch 10, batch 5700, loss[loss=0.2151, simple_loss=0.2668, pruned_loss=0.08168, over 13258.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2598, pruned_loss=0.0904, over 2578600.63 frames. ], batch size: 63, lr: 5.86e-03, grad_scale: 64.0 2024-06-20 09:27:49,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=177402.5, ans=0.0 2024-06-20 09:27:51,639 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.54 vs. limit=15.0 2024-06-20 09:27:56,459 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.38 vs. limit=22.5 2024-06-20 09:28:00,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=177439.16666666666, ans=0.1 2024-06-20 09:28:04,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=177439.16666666666, ans=0.0 2024-06-20 09:28:11,560 INFO [train.py:1028] (1/2) Epoch 10, batch 5750, loss[loss=0.2368, simple_loss=0.2701, pruned_loss=0.1017, over 12783.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2608, pruned_loss=0.09079, over 2579065.71 frames. ], batch size: 177, lr: 5.86e-03, grad_scale: 64.0 2024-06-20 09:28:13,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=177475.83333333334, ans=0.0 2024-06-20 09:28:13,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=177475.83333333334, ans=0.0 2024-06-20 09:28:16,158 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.742e+02 1.909e+02 2.080e+02 3.192e+02, threshold=3.818e+02, percent-clipped=0.0 2024-06-20 09:28:18,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=177494.16666666666, ans=0.125 2024-06-20 09:28:28,414 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:28:30,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=177530.83333333334, ans=0.2 2024-06-20 09:28:32,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=177530.83333333334, ans=0.0 2024-06-20 09:28:39,662 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.23 vs. limit=6.0 2024-06-20 09:28:43,523 INFO [train.py:1028] (1/2) Epoch 10, batch 5800, loss[loss=0.2303, simple_loss=0.2641, pruned_loss=0.09824, over 12738.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2626, pruned_loss=0.09196, over 2579031.80 frames. ], batch size: 176, lr: 5.85e-03, grad_scale: 64.0 2024-06-20 09:28:52,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=177567.5, ans=0.0 2024-06-20 09:28:57,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=177585.83333333334, ans=0.125 2024-06-20 09:29:07,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=177604.16666666666, ans=0.1 2024-06-20 09:29:10,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=177622.5, ans=0.125 2024-06-20 09:29:16,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=177640.83333333334, ans=0.0 2024-06-20 09:29:23,115 INFO [train.py:1028] (1/2) Epoch 10, batch 5850, loss[loss=0.2548, simple_loss=0.2856, pruned_loss=0.1121, over 12561.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2647, pruned_loss=0.09314, over 2577228.44 frames. ], batch size: 202, lr: 5.85e-03, grad_scale: 64.0 2024-06-20 09:29:23,187 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=177659.16666666666, ans=0.0 2024-06-20 09:29:23,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=177659.16666666666, ans=0.1 2024-06-20 09:29:27,558 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.832e+02 1.973e+02 2.190e+02 2.931e+02, threshold=3.947e+02, percent-clipped=0.0 2024-06-20 09:29:29,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=177677.5, ans=0.0 2024-06-20 09:29:29,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=177677.5, ans=0.0 2024-06-20 09:29:54,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=177732.5, ans=0.125 2024-06-20 09:29:55,874 INFO [train.py:1028] (1/2) Epoch 10, batch 5900, loss[loss=0.2031, simple_loss=0.2482, pruned_loss=0.079, over 13115.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2665, pruned_loss=0.09369, over 2577304.30 frames. ], batch size: 121, lr: 5.85e-03, grad_scale: 64.0 2024-06-20 09:29:59,607 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.16 vs. limit=10.0 2024-06-20 09:30:01,757 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.07 vs. limit=22.5 2024-06-20 09:30:06,348 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=14.49 vs. limit=15.0 2024-06-20 09:30:17,612 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:30:18,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=177805.83333333334, ans=0.125 2024-06-20 09:30:30,564 INFO [train.py:1028] (1/2) Epoch 10, batch 5950, loss[loss=0.2135, simple_loss=0.2539, pruned_loss=0.08656, over 13096.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2682, pruned_loss=0.09459, over 2581887.92 frames. ], batch size: 121, lr: 5.85e-03, grad_scale: 64.0 2024-06-20 09:30:34,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=177842.5, ans=0.125 2024-06-20 09:30:35,383 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.814e+02 1.976e+02 2.492e+02 3.976e+02, threshold=3.953e+02, percent-clipped=1.0 2024-06-20 09:30:35,866 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.98 vs. limit=6.0 2024-06-20 09:30:43,347 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.09 vs. limit=12.0 2024-06-20 09:30:44,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=177879.16666666666, ans=0.125 2024-06-20 09:30:45,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=177879.16666666666, ans=0.125 2024-06-20 09:31:12,749 INFO [train.py:1028] (1/2) Epoch 10, batch 6000, loss[loss=0.2773, simple_loss=0.3046, pruned_loss=0.125, over 12220.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2693, pruned_loss=0.09502, over 2573715.83 frames. ], batch size: 241, lr: 5.85e-03, grad_scale: 64.0 2024-06-20 09:31:12,750 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 09:31:20,484 INFO [train.py:1060] (1/2) Epoch 10, validation: loss=0.1984, simple_loss=0.262, pruned_loss=0.06739, over 351949.00 frames. 2024-06-20 09:31:20,485 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 09:31:21,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=177934.16666666666, ans=0.125 2024-06-20 09:31:40,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=177989.16666666666, ans=0.125 2024-06-20 09:31:47,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=178007.5, ans=0.125 2024-06-20 09:31:50,503 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=1.052e+01 2024-06-20 09:31:54,231 INFO [train.py:1028] (1/2) Epoch 10, batch 6050, loss[loss=0.2353, simple_loss=0.2817, pruned_loss=0.09447, over 12944.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.2711, pruned_loss=0.09568, over 2576998.68 frames. ], batch size: 39, lr: 5.85e-03, grad_scale: 64.0 2024-06-20 09:31:59,093 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.597e+02 1.821e+02 2.033e+02 2.288e+02 2.953e+02, threshold=4.066e+02, percent-clipped=0.0 2024-06-20 09:32:03,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=178044.16666666666, ans=0.2 2024-06-20 09:32:06,216 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2024-06-20 09:32:11,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=178062.5, ans=0.0 2024-06-20 09:32:12,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=178062.5, ans=0.09899494936611666 2024-06-20 09:32:12,633 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.90 vs. limit=15.0 2024-06-20 09:32:28,541 INFO [train.py:1028] (1/2) Epoch 10, batch 6100, loss[loss=0.2424, simple_loss=0.2766, pruned_loss=0.1042, over 13113.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.2727, pruned_loss=0.09619, over 2579104.64 frames. ], batch size: 121, lr: 5.85e-03, grad_scale: 64.0 2024-06-20 09:32:30,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=178117.5, ans=0.2 2024-06-20 09:32:35,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=14.14 vs. limit=15.0 2024-06-20 09:32:41,759 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.906e+01 2024-06-20 09:32:43,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=178154.16666666666, ans=0.2 2024-06-20 09:32:46,473 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:32:51,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=178172.5, ans=0.125 2024-06-20 09:32:52,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=178172.5, ans=0.1 2024-06-20 09:33:09,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=178209.16666666666, ans=10.0 2024-06-20 09:33:10,075 INFO [train.py:1028] (1/2) Epoch 10, batch 6150, loss[loss=0.2462, simple_loss=0.2812, pruned_loss=0.1056, over 10880.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.2745, pruned_loss=0.09713, over 2577984.57 frames. ], batch size: 304, lr: 5.84e-03, grad_scale: 64.0 2024-06-20 09:33:14,883 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.823e+02 1.982e+02 2.110e+02 2.998e+02, threshold=3.964e+02, percent-clipped=0.0 2024-06-20 09:33:15,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=178209.16666666666, ans=0.1 2024-06-20 09:33:20,077 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.01 vs. limit=10.0 2024-06-20 09:33:37,577 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=17.55 vs. limit=15.0 2024-06-20 09:33:39,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=178282.5, ans=0.2 2024-06-20 09:33:43,798 INFO [train.py:1028] (1/2) Epoch 10, batch 6200, loss[loss=0.252, simple_loss=0.2987, pruned_loss=0.1027, over 13242.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.2762, pruned_loss=0.09804, over 2575974.95 frames. ], batch size: 89, lr: 5.84e-03, grad_scale: 64.0 2024-06-20 09:33:50,940 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.70 vs. limit=6.0 2024-06-20 09:34:07,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=178355.83333333334, ans=0.1 2024-06-20 09:34:09,493 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.40 vs. limit=22.5 2024-06-20 09:34:17,095 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.75 vs. limit=12.0 2024-06-20 09:34:17,449 INFO [train.py:1028] (1/2) Epoch 10, batch 6250, loss[loss=0.2389, simple_loss=0.2804, pruned_loss=0.09866, over 13189.00 frames. ], tot_loss[loss=0.237, simple_loss=0.2773, pruned_loss=0.09834, over 2567709.89 frames. ], batch size: 83, lr: 5.84e-03, grad_scale: 64.0 2024-06-20 09:34:19,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=178392.5, ans=0.0 2024-06-20 09:34:20,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=178392.5, ans=0.0 2024-06-20 09:34:22,144 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.816e+02 1.936e+02 2.193e+02 3.296e+02, threshold=3.873e+02, percent-clipped=0.0 2024-06-20 09:34:22,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=178392.5, ans=0.07 2024-06-20 09:34:24,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=178410.83333333334, ans=0.125 2024-06-20 09:34:27,751 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.60 vs. limit=15.0 2024-06-20 09:34:28,944 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.26 vs. limit=15.0 2024-06-20 09:34:36,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=178447.5, ans=0.125 2024-06-20 09:34:44,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=178465.83333333334, ans=0.1 2024-06-20 09:34:48,297 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.79 vs. limit=15.0 2024-06-20 09:34:49,717 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.30 vs. limit=15.0 2024-06-20 09:34:50,005 INFO [train.py:1028] (1/2) Epoch 10, batch 6300, loss[loss=0.2165, simple_loss=0.2614, pruned_loss=0.0858, over 11851.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.2783, pruned_loss=0.09864, over 2563134.27 frames. ], batch size: 17, lr: 5.84e-03, grad_scale: 64.0 2024-06-20 09:34:51,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=178484.16666666666, ans=0.125 2024-06-20 09:35:07,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=178502.5, ans=0.0 2024-06-20 09:35:08,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=178502.5, ans=0.0 2024-06-20 09:35:09,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=178502.5, ans=0.125 2024-06-20 09:35:14,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=178520.83333333334, ans=0.0 2024-06-20 09:35:22,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=178539.16666666666, ans=0.0 2024-06-20 09:35:25,652 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.47 vs. limit=15.0 2024-06-20 09:35:30,159 INFO [train.py:1028] (1/2) Epoch 10, batch 6350, loss[loss=0.2639, simple_loss=0.2975, pruned_loss=0.1151, over 12481.00 frames. ], tot_loss[loss=0.239, simple_loss=0.28, pruned_loss=0.09906, over 2573119.50 frames. ], batch size: 202, lr: 5.84e-03, grad_scale: 64.0 2024-06-20 09:35:30,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=178575.83333333334, ans=0.2 2024-06-20 09:35:35,032 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.850e+02 2.070e+02 2.243e+02 2.833e+02, threshold=4.139e+02, percent-clipped=0.0 2024-06-20 09:35:47,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=178612.5, ans=10.0 2024-06-20 09:35:51,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=15.0 2024-06-20 09:36:03,152 INFO [train.py:1028] (1/2) Epoch 10, batch 6400, loss[loss=0.2283, simple_loss=0.2742, pruned_loss=0.0912, over 13309.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.2822, pruned_loss=0.1002, over 2574424.07 frames. ], batch size: 67, lr: 5.84e-03, grad_scale: 64.0 2024-06-20 09:36:14,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=178685.83333333334, ans=0.0 2024-06-20 09:36:17,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=178704.16666666666, ans=0.125 2024-06-20 09:36:19,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=178704.16666666666, ans=0.0 2024-06-20 09:36:20,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=178704.16666666666, ans=0.125 2024-06-20 09:36:26,303 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.39 vs. limit=10.0 2024-06-20 09:36:35,540 INFO [train.py:1028] (1/2) Epoch 10, batch 6450, loss[loss=0.2802, simple_loss=0.3186, pruned_loss=0.1209, over 12483.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.2839, pruned_loss=0.101, over 2580492.75 frames. ], batch size: 202, lr: 5.84e-03, grad_scale: 64.0 2024-06-20 09:36:39,555 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.88 vs. limit=22.5 2024-06-20 09:36:40,412 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 1.860e+02 2.039e+02 2.300e+02 3.323e+02, threshold=4.077e+02, percent-clipped=0.0 2024-06-20 09:36:45,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=178777.5, ans=0.125 2024-06-20 09:36:46,372 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:36:48,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=178795.83333333334, ans=0.125 2024-06-20 09:36:50,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=178795.83333333334, ans=0.025 2024-06-20 09:36:52,534 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.25 vs. limit=15.0 2024-06-20 09:36:52,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=178795.83333333334, ans=0.0 2024-06-20 09:37:07,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=178832.5, ans=0.0 2024-06-20 09:37:15,199 INFO [train.py:1028] (1/2) Epoch 10, batch 6500, loss[loss=0.2783, simple_loss=0.3014, pruned_loss=0.1276, over 11081.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.2866, pruned_loss=0.1021, over 2584205.96 frames. ], batch size: 304, lr: 5.83e-03, grad_scale: 64.0 2024-06-20 09:37:16,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=178850.83333333334, ans=0.025 2024-06-20 09:37:21,069 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.69 vs. limit=22.5 2024-06-20 09:37:21,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=178869.16666666666, ans=0.025 2024-06-20 09:37:36,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=178905.83333333334, ans=0.015 2024-06-20 09:37:46,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=178924.16666666666, ans=0.0 2024-06-20 09:37:48,013 INFO [train.py:1028] (1/2) Epoch 10, batch 6550, loss[loss=0.2194, simple_loss=0.2673, pruned_loss=0.0857, over 12676.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.2867, pruned_loss=0.1018, over 2588209.08 frames. ], batch size: 22, lr: 5.83e-03, grad_scale: 64.0 2024-06-20 09:37:48,172 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:37:52,479 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.821e+02 2.005e+02 2.181e+02 3.127e+02, threshold=4.010e+02, percent-clipped=0.0 2024-06-20 09:38:04,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=178979.16666666666, ans=0.035 2024-06-20 09:38:09,496 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.030e+01 2024-06-20 09:38:11,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=178997.5, ans=0.125 2024-06-20 09:38:19,115 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.42 vs. limit=15.0 2024-06-20 09:38:20,653 INFO [train.py:1028] (1/2) Epoch 10, batch 6600, loss[loss=0.2499, simple_loss=0.2967, pruned_loss=0.1016, over 13308.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.2874, pruned_loss=0.1021, over 2590307.80 frames. ], batch size: 72, lr: 5.83e-03, grad_scale: 128.0 2024-06-20 09:38:24,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=179034.16666666666, ans=0.1 2024-06-20 09:38:29,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=179052.5, ans=0.2 2024-06-20 09:38:29,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=179052.5, ans=0.1 2024-06-20 09:38:32,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=179052.5, ans=0.125 2024-06-20 09:38:38,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=179070.83333333334, ans=0.125 2024-06-20 09:38:41,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=179089.16666666666, ans=0.125 2024-06-20 09:38:42,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=179089.16666666666, ans=0.0 2024-06-20 09:38:54,005 INFO [train.py:1028] (1/2) Epoch 10, batch 6650, loss[loss=0.2852, simple_loss=0.319, pruned_loss=0.1257, over 12895.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.2894, pruned_loss=0.1028, over 2583744.26 frames. ], batch size: 158, lr: 5.83e-03, grad_scale: 128.0 2024-06-20 09:38:58,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=179125.83333333334, ans=0.125 2024-06-20 09:38:58,805 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.901e+02 2.122e+02 2.447e+02 4.029e+02, threshold=4.243e+02, percent-clipped=1.0 2024-06-20 09:39:03,150 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.45 vs. limit=22.5 2024-06-20 09:39:05,138 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.40 vs. limit=15.0 2024-06-20 09:39:07,829 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.21 vs. limit=22.5 2024-06-20 09:39:19,292 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.62 vs. limit=22.5 2024-06-20 09:39:20,431 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.92 vs. limit=15.0 2024-06-20 09:39:27,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=179199.16666666666, ans=0.0 2024-06-20 09:39:27,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=179199.16666666666, ans=0.2 2024-06-20 09:39:34,173 INFO [train.py:1028] (1/2) Epoch 10, batch 6700, loss[loss=0.2601, simple_loss=0.3061, pruned_loss=0.107, over 12751.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.2912, pruned_loss=0.1038, over 2584407.94 frames. ], batch size: 176, lr: 5.83e-03, grad_scale: 128.0 2024-06-20 09:39:49,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=179254.16666666666, ans=0.1 2024-06-20 09:40:00,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=179290.83333333334, ans=0.125 2024-06-20 09:40:01,104 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=10.71 vs. limit=12.0 2024-06-20 09:40:01,767 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.63 vs. limit=10.0 2024-06-20 09:40:07,509 INFO [train.py:1028] (1/2) Epoch 10, batch 6750, loss[loss=0.3191, simple_loss=0.3459, pruned_loss=0.1462, over 12217.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.2918, pruned_loss=0.104, over 2579049.89 frames. ], batch size: 241, lr: 5.83e-03, grad_scale: 128.0 2024-06-20 09:40:11,803 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 1.807e+02 1.958e+02 2.132e+02 2.856e+02, threshold=3.916e+02, percent-clipped=0.0 2024-06-20 09:40:15,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=179327.5, ans=0.125 2024-06-20 09:40:17,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=179327.5, ans=0.1 2024-06-20 09:40:19,164 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2024-06-20 09:40:22,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=179345.83333333334, ans=0.125 2024-06-20 09:40:22,520 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=22.5 2024-06-20 09:40:26,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=179364.16666666666, ans=0.125 2024-06-20 09:40:32,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=179364.16666666666, ans=0.0 2024-06-20 09:40:37,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=179382.5, ans=0.2 2024-06-20 09:40:40,379 INFO [train.py:1028] (1/2) Epoch 10, batch 6800, loss[loss=0.2231, simple_loss=0.2707, pruned_loss=0.0877, over 13248.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.2935, pruned_loss=0.1046, over 2579973.84 frames. ], batch size: 67, lr: 5.82e-03, grad_scale: 128.0 2024-06-20 09:40:52,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=179419.16666666666, ans=12.0 2024-06-20 09:40:53,302 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.37 vs. limit=15.0 2024-06-20 09:41:02,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=179455.83333333334, ans=0.0 2024-06-20 09:41:05,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=179455.83333333334, ans=0.035 2024-06-20 09:41:07,660 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.37 vs. limit=22.5 2024-06-20 09:41:09,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=179474.16666666666, ans=0.125 2024-06-20 09:41:17,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=179492.5, ans=0.025 2024-06-20 09:41:21,636 INFO [train.py:1028] (1/2) Epoch 10, batch 6850, loss[loss=0.281, simple_loss=0.3241, pruned_loss=0.119, over 13247.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.2938, pruned_loss=0.1043, over 2583983.86 frames. ], batch size: 63, lr: 5.82e-03, grad_scale: 128.0 2024-06-20 09:41:26,281 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.898e+02 2.017e+02 2.185e+02 3.488e+02, threshold=4.034e+02, percent-clipped=0.0 2024-06-20 09:41:33,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=179510.83333333334, ans=0.125 2024-06-20 09:41:37,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=179529.16666666666, ans=0.125 2024-06-20 09:41:54,489 INFO [train.py:1028] (1/2) Epoch 10, batch 6900, loss[loss=0.258, simple_loss=0.3043, pruned_loss=0.1058, over 13255.00 frames. ], tot_loss[loss=0.251, simple_loss=0.2938, pruned_loss=0.1042, over 2585267.98 frames. ], batch size: 49, lr: 5.82e-03, grad_scale: 128.0 2024-06-20 09:41:59,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=179584.16666666666, ans=0.2 2024-06-20 09:42:00,007 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.27 vs. limit=15.0 2024-06-20 09:42:00,310 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=179602.5, ans=0.1 2024-06-20 09:42:01,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=179602.5, ans=0.0 2024-06-20 09:42:07,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=179620.83333333334, ans=0.125 2024-06-20 09:42:07,519 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.87 vs. limit=15.0 2024-06-20 09:42:27,716 INFO [train.py:1028] (1/2) Epoch 10, batch 6950, loss[loss=0.2505, simple_loss=0.3069, pruned_loss=0.09701, over 11574.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.2937, pruned_loss=0.1037, over 2578817.92 frames. ], batch size: 17, lr: 5.82e-03, grad_scale: 128.0 2024-06-20 09:42:32,249 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.859e+02 2.083e+02 2.369e+02 3.741e+02, threshold=4.165e+02, percent-clipped=0.0 2024-06-20 09:42:46,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=179730.83333333334, ans=0.1 2024-06-20 09:42:47,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=179730.83333333334, ans=0.05 2024-06-20 09:42:47,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=179730.83333333334, ans=0.1 2024-06-20 09:42:52,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=179730.83333333334, ans=0.125 2024-06-20 09:42:56,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=179749.16666666666, ans=0.125 2024-06-20 09:43:00,388 INFO [train.py:1028] (1/2) Epoch 10, batch 7000, loss[loss=0.2582, simple_loss=0.2949, pruned_loss=0.1107, over 12869.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.2938, pruned_loss=0.1036, over 2574603.25 frames. ], batch size: 158, lr: 5.82e-03, grad_scale: 128.0 2024-06-20 09:43:01,366 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.72 vs. limit=6.0 2024-06-20 09:43:03,199 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.18 vs. limit=10.0 2024-06-20 09:43:04,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=179767.5, ans=0.0 2024-06-20 09:43:10,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=12.0 2024-06-20 09:43:16,169 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.52 vs. limit=15.0 2024-06-20 09:43:27,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2024-06-20 09:43:37,412 INFO [train.py:1028] (1/2) Epoch 10, batch 7050, loss[loss=0.2817, simple_loss=0.3189, pruned_loss=0.1223, over 12718.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.2951, pruned_loss=0.1041, over 2581925.49 frames. ], batch size: 176, lr: 5.82e-03, grad_scale: 128.0 2024-06-20 09:43:42,147 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.936e+02 2.184e+02 2.541e+02 3.673e+02, threshold=4.367e+02, percent-clipped=0.0 2024-06-20 09:43:47,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=179877.5, ans=0.05 2024-06-20 09:43:51,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=179895.83333333334, ans=0.125 2024-06-20 09:44:00,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=179914.16666666666, ans=0.125 2024-06-20 09:44:04,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=179932.5, ans=0.0 2024-06-20 09:44:05,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=179932.5, ans=0.125 2024-06-20 09:44:10,340 INFO [train.py:1028] (1/2) Epoch 10, batch 7100, loss[loss=0.2702, simple_loss=0.3188, pruned_loss=0.1108, over 13190.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.2956, pruned_loss=0.1046, over 2573702.42 frames. ], batch size: 112, lr: 5.82e-03, grad_scale: 128.0 2024-06-20 09:44:30,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=180005.83333333334, ans=0.1 2024-06-20 09:44:30,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=180005.83333333334, ans=0.0 2024-06-20 09:44:30,984 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.61 vs. limit=15.0 2024-06-20 09:44:32,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=180005.83333333334, ans=0.1 2024-06-20 09:44:42,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=180024.16666666666, ans=0.125 2024-06-20 09:44:43,771 INFO [train.py:1028] (1/2) Epoch 10, batch 7150, loss[loss=0.3036, simple_loss=0.3318, pruned_loss=0.1377, over 12413.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.2959, pruned_loss=0.1048, over 2572746.55 frames. ], batch size: 202, lr: 5.81e-03, grad_scale: 128.0 2024-06-20 09:44:48,356 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.819e+02 1.982e+02 2.257e+02 3.628e+02, threshold=3.965e+02, percent-clipped=0.0 2024-06-20 09:44:51,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=180060.83333333334, ans=0.035 2024-06-20 09:45:04,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=180097.5, ans=0.125 2024-06-20 09:45:06,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=180097.5, ans=0.125 2024-06-20 09:45:08,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=180097.5, ans=0.025 2024-06-20 09:45:13,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=180115.83333333334, ans=0.5 2024-06-20 09:45:16,069 INFO [train.py:1028] (1/2) Epoch 10, batch 7200, loss[loss=0.2825, simple_loss=0.3273, pruned_loss=0.1189, over 13139.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.2966, pruned_loss=0.1048, over 2577753.55 frames. ], batch size: 112, lr: 5.81e-03, grad_scale: 128.0 2024-06-20 09:45:17,376 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=180134.16666666666, ans=0.0 2024-06-20 09:45:23,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=180134.16666666666, ans=0.0 2024-06-20 09:45:35,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=180170.83333333334, ans=8.0 2024-06-20 09:45:39,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=180170.83333333334, ans=0.025 2024-06-20 09:45:48,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=180207.5, ans=0.0 2024-06-20 09:45:49,067 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.98 vs. limit=10.0 2024-06-20 09:45:52,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=180207.5, ans=0.125 2024-06-20 09:45:55,988 INFO [train.py:1028] (1/2) Epoch 10, batch 7250, loss[loss=0.2317, simple_loss=0.2884, pruned_loss=0.08745, over 12890.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.2975, pruned_loss=0.1048, over 2578652.31 frames. ], batch size: 36, lr: 5.81e-03, grad_scale: 128.0 2024-06-20 09:46:00,485 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.885e+02 2.035e+02 2.226e+02 3.312e+02, threshold=4.069e+02, percent-clipped=0.0 2024-06-20 09:46:06,569 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:46:14,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=180262.5, ans=0.0 2024-06-20 09:46:14,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=180262.5, ans=0.125 2024-06-20 09:46:28,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=180317.5, ans=0.2 2024-06-20 09:46:29,014 INFO [train.py:1028] (1/2) Epoch 10, batch 7300, loss[loss=0.2499, simple_loss=0.2899, pruned_loss=0.1049, over 12902.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.299, pruned_loss=0.1059, over 2579347.49 frames. ], batch size: 36, lr: 5.81e-03, grad_scale: 128.0 2024-06-20 09:46:31,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=180317.5, ans=0.125 2024-06-20 09:46:39,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=180335.83333333334, ans=0.0 2024-06-20 09:46:51,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=180372.5, ans=0.125 2024-06-20 09:46:53,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=180372.5, ans=0.2 2024-06-20 09:46:55,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=180390.83333333334, ans=0.125 2024-06-20 09:47:01,486 INFO [train.py:1028] (1/2) Epoch 10, batch 7350, loss[loss=0.2638, simple_loss=0.3141, pruned_loss=0.1067, over 13339.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3, pruned_loss=0.1062, over 2580807.81 frames. ], batch size: 46, lr: 5.81e-03, grad_scale: 128.0 2024-06-20 09:47:05,878 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 1.896e+02 2.114e+02 2.329e+02 3.154e+02, threshold=4.228e+02, percent-clipped=0.0 2024-06-20 09:47:13,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=180445.83333333334, ans=0.125 2024-06-20 09:47:14,168 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2024-06-20 09:47:19,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=180445.83333333334, ans=0.125 2024-06-20 09:47:20,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=180464.16666666666, ans=0.04949747468305833 2024-06-20 09:47:41,150 INFO [train.py:1028] (1/2) Epoch 10, batch 7400, loss[loss=0.2687, simple_loss=0.3151, pruned_loss=0.1112, over 13241.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.2995, pruned_loss=0.1059, over 2585542.55 frames. ], batch size: 63, lr: 5.81e-03, grad_scale: 128.0 2024-06-20 09:47:41,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=180500.83333333334, ans=0.125 2024-06-20 09:47:42,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=180500.83333333334, ans=0.2 2024-06-20 09:47:51,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=180519.16666666666, ans=0.125 2024-06-20 09:47:57,623 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=22.5 2024-06-20 09:48:02,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=180555.83333333334, ans=0.1 2024-06-20 09:48:13,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=180574.16666666666, ans=0.0 2024-06-20 09:48:14,457 INFO [train.py:1028] (1/2) Epoch 10, batch 7450, loss[loss=0.2482, simple_loss=0.2951, pruned_loss=0.1007, over 12599.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.2994, pruned_loss=0.1056, over 2579479.76 frames. ], batch size: 29, lr: 5.81e-03, grad_scale: 128.0 2024-06-20 09:48:19,879 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.904e+02 2.062e+02 2.264e+02 2.863e+02, threshold=4.124e+02, percent-clipped=0.0 2024-06-20 09:48:26,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=180610.83333333334, ans=0.125 2024-06-20 09:48:27,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=180629.16666666666, ans=0.125 2024-06-20 09:48:30,996 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.98 vs. limit=22.5 2024-06-20 09:48:35,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=180647.5, ans=0.0 2024-06-20 09:48:40,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=180647.5, ans=0.1 2024-06-20 09:48:47,771 INFO [train.py:1028] (1/2) Epoch 10, batch 7500, loss[loss=0.2812, simple_loss=0.3043, pruned_loss=0.1291, over 10755.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3009, pruned_loss=0.1066, over 2577413.93 frames. ], batch size: 304, lr: 5.80e-03, grad_scale: 64.0 2024-06-20 09:48:48,086 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2024-06-20 09:48:51,483 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.13 vs. limit=15.0 2024-06-20 09:48:59,620 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.33 vs. limit=22.5 2024-06-20 09:49:20,627 INFO [train.py:1028] (1/2) Epoch 10, batch 7550, loss[loss=0.2607, simple_loss=0.3011, pruned_loss=0.1102, over 12886.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3018, pruned_loss=0.1075, over 2577155.54 frames. ], batch size: 158, lr: 5.80e-03, grad_scale: 64.0 2024-06-20 09:49:22,960 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.74 vs. limit=10.0 2024-06-20 09:49:29,598 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 1.923e+02 2.156e+02 2.444e+02 3.380e+02, threshold=4.312e+02, percent-clipped=0.0 2024-06-20 09:49:32,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=180794.16666666666, ans=0.0 2024-06-20 09:49:34,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=180794.16666666666, ans=0.2 2024-06-20 09:49:47,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=180830.83333333334, ans=0.1 2024-06-20 09:49:50,120 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.79 vs. limit=10.0 2024-06-20 09:49:57,389 INFO [train.py:1028] (1/2) Epoch 10, batch 7600, loss[loss=0.2457, simple_loss=0.2942, pruned_loss=0.09853, over 13242.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3019, pruned_loss=0.1077, over 2576752.39 frames. ], batch size: 83, lr: 5.80e-03, grad_scale: 64.0 2024-06-20 09:50:09,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=180885.83333333334, ans=0.0 2024-06-20 09:50:31,355 INFO [train.py:1028] (1/2) Epoch 10, batch 7650, loss[loss=0.2877, simple_loss=0.3306, pruned_loss=0.1224, over 12933.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.302, pruned_loss=0.1077, over 2572887.23 frames. ], batch size: 33, lr: 5.80e-03, grad_scale: 64.0 2024-06-20 09:50:34,979 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:50:36,866 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.894e+02 2.012e+02 2.184e+02 3.015e+02, threshold=4.025e+02, percent-clipped=0.0 2024-06-20 09:50:43,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=180977.5, ans=0.125 2024-06-20 09:50:54,871 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.73 vs. limit=15.0 2024-06-20 09:50:56,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=181014.16666666666, ans=0.125 2024-06-20 09:50:56,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=181014.16666666666, ans=0.07 2024-06-20 09:51:05,901 INFO [train.py:1028] (1/2) Epoch 10, batch 7700, loss[loss=0.2847, simple_loss=0.3319, pruned_loss=0.1187, over 13267.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3029, pruned_loss=0.1077, over 2568724.93 frames. ], batch size: 63, lr: 5.80e-03, grad_scale: 64.0 2024-06-20 09:51:21,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=181087.5, ans=0.125 2024-06-20 09:51:41,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=181124.16666666666, ans=0.0 2024-06-20 09:51:45,672 INFO [train.py:1028] (1/2) Epoch 10, batch 7750, loss[loss=0.2636, simple_loss=0.3141, pruned_loss=0.1065, over 13284.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3043, pruned_loss=0.1084, over 2572986.02 frames. ], batch size: 72, lr: 5.80e-03, grad_scale: 64.0 2024-06-20 09:51:49,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=181142.5, ans=0.025 2024-06-20 09:51:50,941 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.54 vs. limit=15.0 2024-06-20 09:51:51,146 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.877e+02 2.043e+02 2.298e+02 3.130e+02, threshold=4.087e+02, percent-clipped=0.0 2024-06-20 09:51:52,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=181160.83333333334, ans=0.1 2024-06-20 09:51:59,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=181179.16666666666, ans=0.0 2024-06-20 09:52:13,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=181215.83333333334, ans=0.025 2024-06-20 09:52:14,302 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2024-06-20 09:52:14,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=181215.83333333334, ans=0.125 2024-06-20 09:52:18,571 INFO [train.py:1028] (1/2) Epoch 10, batch 7800, loss[loss=0.2805, simple_loss=0.32, pruned_loss=0.1205, over 13133.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.305, pruned_loss=0.1083, over 2578667.54 frames. ], batch size: 95, lr: 5.80e-03, grad_scale: 64.0 2024-06-20 09:52:38,310 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=181289.16666666666, ans=0.125 2024-06-20 09:52:39,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=181289.16666666666, ans=0.025 2024-06-20 09:52:42,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=181289.16666666666, ans=0.0 2024-06-20 09:52:45,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=181307.5, ans=0.2 2024-06-20 09:52:52,124 INFO [train.py:1028] (1/2) Epoch 10, batch 7850, loss[loss=0.2202, simple_loss=0.2712, pruned_loss=0.08461, over 11767.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3051, pruned_loss=0.1088, over 2573889.56 frames. ], batch size: 17, lr: 5.79e-03, grad_scale: 64.0 2024-06-20 09:52:53,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=181325.83333333334, ans=0.125 2024-06-20 09:52:57,652 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.746e+02 1.908e+02 2.046e+02 2.194e+02 3.225e+02, threshold=4.093e+02, percent-clipped=0.0 2024-06-20 09:53:01,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=181344.16666666666, ans=0.125 2024-06-20 09:53:06,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=181362.5, ans=0.035 2024-06-20 09:53:14,012 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:53:15,399 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.33 vs. limit=15.0 2024-06-20 09:53:29,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=181399.16666666666, ans=0.125 2024-06-20 09:53:31,799 INFO [train.py:1028] (1/2) Epoch 10, batch 7900, loss[loss=0.2702, simple_loss=0.3158, pruned_loss=0.1123, over 13172.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3054, pruned_loss=0.1093, over 2572779.13 frames. ], batch size: 77, lr: 5.79e-03, grad_scale: 64.0 2024-06-20 09:53:32,940 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2024-06-20 09:53:37,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=181435.83333333334, ans=0.125 2024-06-20 09:53:40,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=181435.83333333334, ans=0.125 2024-06-20 09:53:40,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=181435.83333333334, ans=0.125 2024-06-20 09:53:42,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=181435.83333333334, ans=0.0 2024-06-20 09:53:42,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=181435.83333333334, ans=0.0 2024-06-20 09:53:46,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=181454.16666666666, ans=0.125 2024-06-20 09:53:49,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=181454.16666666666, ans=0.035 2024-06-20 09:53:52,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=181472.5, ans=0.0 2024-06-20 09:54:01,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=181490.83333333334, ans=0.05 2024-06-20 09:54:02,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=181490.83333333334, ans=0.1 2024-06-20 09:54:04,714 INFO [train.py:1028] (1/2) Epoch 10, batch 7950, loss[loss=0.2696, simple_loss=0.304, pruned_loss=0.1176, over 10431.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3059, pruned_loss=0.1095, over 2575251.99 frames. ], batch size: 303, lr: 5.79e-03, grad_scale: 64.0 2024-06-20 09:54:09,744 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 1.896e+02 2.024e+02 2.307e+02 3.595e+02, threshold=4.048e+02, percent-clipped=0.0 2024-06-20 09:54:10,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=181527.5, ans=0.0 2024-06-20 09:54:13,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=181527.5, ans=12.0 2024-06-20 09:54:13,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=181527.5, ans=0.125 2024-06-20 09:54:18,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=181545.83333333334, ans=0.0 2024-06-20 09:54:18,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=181545.83333333334, ans=0.0 2024-06-20 09:54:19,668 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=15.0 2024-06-20 09:54:21,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=181545.83333333334, ans=0.025 2024-06-20 09:54:31,939 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:54:37,690 INFO [train.py:1028] (1/2) Epoch 10, batch 8000, loss[loss=0.2395, simple_loss=0.2927, pruned_loss=0.09312, over 12619.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3066, pruned_loss=0.1097, over 2571810.15 frames. ], batch size: 29, lr: 5.79e-03, grad_scale: 64.0 2024-06-20 09:54:47,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=181619.16666666666, ans=0.2 2024-06-20 09:55:03,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=181674.16666666666, ans=0.125 2024-06-20 09:55:07,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=181674.16666666666, ans=0.125 2024-06-20 09:55:14,264 INFO [train.py:1028] (1/2) Epoch 10, batch 8050, loss[loss=0.2602, simple_loss=0.3091, pruned_loss=0.1056, over 13182.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3063, pruned_loss=0.1095, over 2572878.09 frames. ], batch size: 83, lr: 5.79e-03, grad_scale: 64.0 2024-06-20 09:55:22,937 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.959e+02 2.156e+02 2.331e+02 3.503e+02, threshold=4.312e+02, percent-clipped=0.0 2024-06-20 09:55:25,318 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.59 vs. limit=15.0 2024-06-20 09:55:33,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=181729.16666666666, ans=0.025 2024-06-20 09:55:35,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=181729.16666666666, ans=0.0 2024-06-20 09:55:45,216 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.80 vs. limit=15.0 2024-06-20 09:55:45,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=181765.83333333334, ans=0.125 2024-06-20 09:55:48,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=181765.83333333334, ans=0.125 2024-06-20 09:55:49,899 INFO [train.py:1028] (1/2) Epoch 10, batch 8100, loss[loss=0.2462, simple_loss=0.2922, pruned_loss=0.1001, over 13177.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3064, pruned_loss=0.1093, over 2577643.91 frames. ], batch size: 112, lr: 5.79e-03, grad_scale: 64.0 2024-06-20 09:56:06,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=181820.83333333334, ans=0.07 2024-06-20 09:56:08,512 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.97 vs. limit=22.5 2024-06-20 09:56:10,606 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.03 vs. limit=15.0 2024-06-20 09:56:17,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=181857.5, ans=0.025 2024-06-20 09:56:19,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=181857.5, ans=0.125 2024-06-20 09:56:20,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=181857.5, ans=0.1 2024-06-20 09:56:22,084 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.10 vs. limit=15.0 2024-06-20 09:56:23,673 INFO [train.py:1028] (1/2) Epoch 10, batch 8150, loss[loss=0.2588, simple_loss=0.2917, pruned_loss=0.1129, over 13123.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3073, pruned_loss=0.1094, over 2580705.30 frames. ], batch size: 121, lr: 5.79e-03, grad_scale: 64.0 2024-06-20 09:56:29,231 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.891e+02 1.996e+02 2.154e+02 2.544e+02, threshold=3.991e+02, percent-clipped=0.0 2024-06-20 09:56:48,484 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=181930.83333333334, ans=0.1 2024-06-20 09:56:53,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=181949.16666666666, ans=0.125 2024-06-20 09:56:56,694 INFO [train.py:1028] (1/2) Epoch 10, batch 8200, loss[loss=0.2856, simple_loss=0.3215, pruned_loss=0.1248, over 13168.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3074, pruned_loss=0.1092, over 2583459.56 frames. ], batch size: 112, lr: 5.78e-03, grad_scale: 64.0 2024-06-20 09:56:56,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=181967.5, ans=0.1 2024-06-20 09:57:03,125 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.83 vs. limit=22.5 2024-06-20 09:57:07,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=181985.83333333334, ans=0.07 2024-06-20 09:57:07,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=181985.83333333334, ans=0.2 2024-06-20 09:57:16,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=182004.16666666666, ans=0.0 2024-06-20 09:57:19,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=182022.5, ans=0.0 2024-06-20 09:57:33,930 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.62 vs. limit=15.0 2024-06-20 09:57:36,082 INFO [train.py:1028] (1/2) Epoch 10, batch 8250, loss[loss=0.2384, simple_loss=0.2993, pruned_loss=0.08874, over 13200.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3083, pruned_loss=0.1098, over 2583473.51 frames. ], batch size: 52, lr: 5.78e-03, grad_scale: 64.0 2024-06-20 09:57:41,428 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 1.873e+02 2.016e+02 2.203e+02 3.042e+02, threshold=4.032e+02, percent-clipped=0.0 2024-06-20 09:57:55,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=182114.16666666666, ans=0.125 2024-06-20 09:58:07,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=182150.83333333334, ans=0.0 2024-06-20 09:58:08,417 INFO [train.py:1028] (1/2) Epoch 10, batch 8300, loss[loss=0.2743, simple_loss=0.3158, pruned_loss=0.1164, over 13150.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3069, pruned_loss=0.1088, over 2581463.58 frames. ], batch size: 103, lr: 5.78e-03, grad_scale: 64.0 2024-06-20 09:58:15,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=182169.16666666666, ans=0.125 2024-06-20 09:58:16,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=182169.16666666666, ans=0.1 2024-06-20 09:58:17,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=182169.16666666666, ans=12.0 2024-06-20 09:58:22,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=182187.5, ans=0.2 2024-06-20 09:58:28,687 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.63 vs. limit=12.0 2024-06-20 09:58:28,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.whiten.whitening_limit, batch_count=182205.83333333334, ans=12.0 2024-06-20 09:58:40,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=182242.5, ans=0.1 2024-06-20 09:58:41,224 INFO [train.py:1028] (1/2) Epoch 10, batch 8350, loss[loss=0.2654, simple_loss=0.3107, pruned_loss=0.11, over 13177.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3065, pruned_loss=0.1082, over 2580634.29 frames. ], batch size: 112, lr: 5.78e-03, grad_scale: 64.0 2024-06-20 09:58:46,291 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 1.908e+02 2.012e+02 2.176e+02 3.228e+02, threshold=4.025e+02, percent-clipped=0.0 2024-06-20 09:58:54,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=182279.16666666666, ans=0.125 2024-06-20 09:58:57,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=182279.16666666666, ans=0.0 2024-06-20 09:59:12,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=182315.83333333334, ans=0.0 2024-06-20 09:59:16,778 INFO [train.py:1028] (1/2) Epoch 10, batch 8400, loss[loss=0.2307, simple_loss=0.274, pruned_loss=0.09371, over 12929.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3063, pruned_loss=0.1082, over 2578324.68 frames. ], batch size: 39, lr: 5.78e-03, grad_scale: 64.0 2024-06-20 09:59:34,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=182370.83333333334, ans=0.125 2024-06-20 09:59:35,232 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.32 vs. limit=15.0 2024-06-20 09:59:35,700 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.40 vs. limit=15.0 2024-06-20 09:59:40,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=182389.16666666666, ans=0.0 2024-06-20 09:59:52,078 INFO [train.py:1028] (1/2) Epoch 10, batch 8450, loss[loss=0.2941, simple_loss=0.3465, pruned_loss=0.1208, over 13151.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3078, pruned_loss=0.1089, over 2580713.68 frames. ], batch size: 112, lr: 5.78e-03, grad_scale: 64.0 2024-06-20 09:59:53,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=182425.83333333334, ans=0.125 2024-06-20 09:59:54,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=182425.83333333334, ans=0.1 2024-06-20 09:59:57,191 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 1.954e+02 2.082e+02 2.368e+02 3.764e+02, threshold=4.163e+02, percent-clipped=0.0 2024-06-20 09:59:58,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=182444.16666666666, ans=0.125 2024-06-20 10:00:00,777 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.322e+01 2024-06-20 10:00:18,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=182499.16666666666, ans=0.2 2024-06-20 10:00:19,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=182499.16666666666, ans=0.0 2024-06-20 10:00:25,532 INFO [train.py:1028] (1/2) Epoch 10, batch 8500, loss[loss=0.2737, simple_loss=0.3144, pruned_loss=0.1165, over 12647.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3092, pruned_loss=0.1095, over 2578783.61 frames. ], batch size: 29, lr: 5.78e-03, grad_scale: 64.0 2024-06-20 10:00:35,282 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.22 vs. limit=15.0 2024-06-20 10:00:37,311 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.18 vs. limit=22.5 2024-06-20 10:00:42,858 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.96 vs. limit=22.5 2024-06-20 10:00:47,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=182572.5, ans=0.025 2024-06-20 10:00:51,847 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2024-06-20 10:00:54,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=182590.83333333334, ans=0.125 2024-06-20 10:00:57,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=182590.83333333334, ans=0.2 2024-06-20 10:00:59,666 INFO [train.py:1028] (1/2) Epoch 10, batch 8550, loss[loss=0.2495, simple_loss=0.3079, pruned_loss=0.0956, over 12431.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3091, pruned_loss=0.1094, over 2576561.35 frames. ], batch size: 22, lr: 5.77e-03, grad_scale: 64.0 2024-06-20 10:01:04,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=182609.16666666666, ans=0.1 2024-06-20 10:01:05,334 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.924e+02 2.058e+02 2.229e+02 2.913e+02, threshold=4.116e+02, percent-clipped=0.0 2024-06-20 10:01:24,820 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.80 vs. limit=15.0 2024-06-20 10:01:33,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=182682.5, ans=0.2 2024-06-20 10:01:40,303 INFO [train.py:1028] (1/2) Epoch 10, batch 8600, loss[loss=0.2563, simple_loss=0.3009, pruned_loss=0.1059, over 13114.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3091, pruned_loss=0.1094, over 2574023.02 frames. ], batch size: 121, lr: 5.77e-03, grad_scale: 64.0 2024-06-20 10:01:40,671 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=10.99 vs. limit=12.0 2024-06-20 10:01:46,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=182719.16666666666, ans=0.125 2024-06-20 10:01:49,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.04 vs. limit=15.0 2024-06-20 10:01:55,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=182737.5, ans=0.125 2024-06-20 10:02:10,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=182774.16666666666, ans=0.1 2024-06-20 10:02:14,453 INFO [train.py:1028] (1/2) Epoch 10, batch 8650, loss[loss=0.2496, simple_loss=0.2966, pruned_loss=0.1013, over 13010.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3095, pruned_loss=0.1093, over 2577270.38 frames. ], batch size: 102, lr: 5.77e-03, grad_scale: 64.0 2024-06-20 10:02:19,514 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.919e+02 2.057e+02 2.186e+02 2.789e+02, threshold=4.114e+02, percent-clipped=0.0 2024-06-20 10:02:21,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=182810.83333333334, ans=0.125 2024-06-20 10:02:26,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=182829.16666666666, ans=0.125 2024-06-20 10:02:30,482 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.07 vs. limit=10.0 2024-06-20 10:02:34,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=182847.5, ans=0.125 2024-06-20 10:02:41,467 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.18 vs. limit=15.0 2024-06-20 10:02:43,431 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.05 vs. limit=22.5 2024-06-20 10:02:43,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=182865.83333333334, ans=0.0 2024-06-20 10:02:45,277 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.72 vs. limit=6.0 2024-06-20 10:02:46,974 INFO [train.py:1028] (1/2) Epoch 10, batch 8700, loss[loss=0.2466, simple_loss=0.3039, pruned_loss=0.09469, over 13230.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3099, pruned_loss=0.1102, over 2574769.04 frames. ], batch size: 59, lr: 5.77e-03, grad_scale: 64.0 2024-06-20 10:02:57,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=182902.5, ans=0.04949747468305833 2024-06-20 10:02:57,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=182902.5, ans=0.0 2024-06-20 10:02:57,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=182902.5, ans=0.125 2024-06-20 10:02:59,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=182902.5, ans=0.2 2024-06-20 10:03:14,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=182939.16666666666, ans=0.1 2024-06-20 10:03:15,742 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=12.0 2024-06-20 10:03:16,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=182957.5, ans=0.07 2024-06-20 10:03:23,550 INFO [train.py:1028] (1/2) Epoch 10, batch 8750, loss[loss=0.236, simple_loss=0.2782, pruned_loss=0.0969, over 13141.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.309, pruned_loss=0.1093, over 2570347.33 frames. ], batch size: 121, lr: 5.77e-03, grad_scale: 64.0 2024-06-20 10:03:23,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=182975.83333333334, ans=0.125 2024-06-20 10:03:23,970 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.49 vs. limit=15.0 2024-06-20 10:03:28,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=182975.83333333334, ans=0.025 2024-06-20 10:03:31,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=182975.83333333334, ans=0.1 2024-06-20 10:03:32,461 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.933e+02 2.061e+02 2.236e+02 3.961e+02, threshold=4.122e+02, percent-clipped=0.0 2024-06-20 10:03:32,976 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.55 vs. limit=15.0 2024-06-20 10:03:56,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=183049.16666666666, ans=0.2 2024-06-20 10:04:01,172 INFO [train.py:1028] (1/2) Epoch 10, batch 8800, loss[loss=0.2698, simple_loss=0.317, pruned_loss=0.1113, over 13248.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.31, pruned_loss=0.1101, over 2575145.55 frames. ], batch size: 72, lr: 5.77e-03, grad_scale: 64.0 2024-06-20 10:04:04,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=183067.5, ans=0.0 2024-06-20 10:04:06,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=183067.5, ans=0.0 2024-06-20 10:04:09,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=183085.83333333334, ans=0.125 2024-06-20 10:04:15,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=183104.16666666666, ans=6.0 2024-06-20 10:04:17,381 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=10.59 vs. limit=12.0 2024-06-20 10:04:18,942 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.84 vs. limit=15.0 2024-06-20 10:04:35,060 INFO [train.py:1028] (1/2) Epoch 10, batch 8850, loss[loss=0.2791, simple_loss=0.3133, pruned_loss=0.1225, over 12560.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3101, pruned_loss=0.1102, over 2563272.01 frames. ], batch size: 202, lr: 5.77e-03, grad_scale: 64.0 2024-06-20 10:04:40,604 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.658e+02 1.906e+02 2.033e+02 2.167e+02 2.777e+02, threshold=4.066e+02, percent-clipped=0.0 2024-06-20 10:04:42,247 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=22.5 2024-06-20 10:04:46,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=183177.5, ans=0.0 2024-06-20 10:04:49,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=183195.83333333334, ans=10.0 2024-06-20 10:04:50,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=183195.83333333334, ans=0.2 2024-06-20 10:04:58,142 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=22.5 2024-06-20 10:05:11,870 INFO [train.py:1028] (1/2) Epoch 10, batch 8900, loss[loss=0.3019, simple_loss=0.3467, pruned_loss=0.1285, over 12957.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.311, pruned_loss=0.1109, over 2561399.80 frames. ], batch size: 33, lr: 5.76e-03, grad_scale: 64.0 2024-06-20 10:05:31,400 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.66 vs. limit=15.0 2024-06-20 10:05:36,478 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:05:51,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=183324.16666666666, ans=0.025 2024-06-20 10:05:51,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=183324.16666666666, ans=0.1 2024-06-20 10:05:53,800 INFO [train.py:1028] (1/2) Epoch 10, batch 8950, loss[loss=0.3191, simple_loss=0.3457, pruned_loss=0.1463, over 12539.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3107, pruned_loss=0.1103, over 2561760.59 frames. ], batch size: 202, lr: 5.76e-03, grad_scale: 64.0 2024-06-20 10:05:56,242 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=12.0 2024-06-20 10:05:59,226 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 1.935e+02 2.117e+02 2.442e+02 3.061e+02, threshold=4.234e+02, percent-clipped=0.0 2024-06-20 10:06:04,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=183360.83333333334, ans=0.2 2024-06-20 10:06:09,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=183379.16666666666, ans=0.125 2024-06-20 10:06:27,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=183434.16666666666, ans=0.125 2024-06-20 10:06:27,675 INFO [train.py:1028] (1/2) Epoch 10, batch 9000, loss[loss=0.2497, simple_loss=0.297, pruned_loss=0.1012, over 13320.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3112, pruned_loss=0.1103, over 2567319.39 frames. ], batch size: 46, lr: 5.76e-03, grad_scale: 64.0 2024-06-20 10:06:27,676 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 10:06:35,713 INFO [train.py:1060] (1/2) Epoch 10, validation: loss=0.1973, simple_loss=0.261, pruned_loss=0.06683, over 351949.00 frames. 2024-06-20 10:06:35,714 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 10:06:35,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=183434.16666666666, ans=0.1 2024-06-20 10:06:36,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=183434.16666666666, ans=0.05 2024-06-20 10:06:52,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=183470.83333333334, ans=0.0 2024-06-20 10:06:56,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=183489.16666666666, ans=0.0 2024-06-20 10:07:08,492 INFO [train.py:1028] (1/2) Epoch 10, batch 9050, loss[loss=0.2552, simple_loss=0.2992, pruned_loss=0.1056, over 11485.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3112, pruned_loss=0.1103, over 2566835.37 frames. ], batch size: 17, lr: 5.76e-03, grad_scale: 64.0 2024-06-20 10:07:09,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=183525.83333333334, ans=0.0 2024-06-20 10:07:11,650 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.25 vs. limit=15.0 2024-06-20 10:07:13,627 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 1.975e+02 2.103e+02 2.317e+02 3.069e+02, threshold=4.207e+02, percent-clipped=0.0 2024-06-20 10:07:15,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=183544.16666666666, ans=0.125 2024-06-20 10:07:20,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=183544.16666666666, ans=0.125 2024-06-20 10:07:25,639 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:07:31,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=183580.83333333334, ans=0.125 2024-06-20 10:07:33,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=183580.83333333334, ans=0.125 2024-06-20 10:07:41,850 INFO [train.py:1028] (1/2) Epoch 10, batch 9100, loss[loss=0.2637, simple_loss=0.3169, pruned_loss=0.1053, over 13019.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3104, pruned_loss=0.1098, over 2567409.67 frames. ], batch size: 71, lr: 5.76e-03, grad_scale: 64.0 2024-06-20 10:07:49,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=183635.83333333334, ans=0.125 2024-06-20 10:07:58,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=183654.16666666666, ans=0.025 2024-06-20 10:08:01,017 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.40 vs. limit=6.0 2024-06-20 10:08:19,607 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2024-06-20 10:08:19,887 INFO [train.py:1028] (1/2) Epoch 10, batch 9150, loss[loss=0.2531, simple_loss=0.3053, pruned_loss=0.1005, over 13184.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3106, pruned_loss=0.1097, over 2567935.31 frames. ], batch size: 77, lr: 5.76e-03, grad_scale: 64.0 2024-06-20 10:08:21,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=183709.16666666666, ans=0.0 2024-06-20 10:08:23,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=183709.16666666666, ans=0.1 2024-06-20 10:08:24,846 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.906e+02 2.110e+02 2.376e+02 3.577e+02, threshold=4.219e+02, percent-clipped=0.0 2024-06-20 10:08:33,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=183745.83333333334, ans=0.1 2024-06-20 10:08:36,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=183745.83333333334, ans=0.2 2024-06-20 10:08:43,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=183764.16666666666, ans=0.07 2024-06-20 10:08:43,209 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.72 vs. limit=15.0 2024-06-20 10:08:51,579 INFO [train.py:1028] (1/2) Epoch 10, batch 9200, loss[loss=0.2677, simple_loss=0.312, pruned_loss=0.1116, over 12991.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3112, pruned_loss=0.1095, over 2572110.33 frames. ], batch size: 36, lr: 5.76e-03, grad_scale: 64.0 2024-06-20 10:09:03,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=183837.5, ans=0.2 2024-06-20 10:09:04,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=183837.5, ans=0.0 2024-06-20 10:09:08,243 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.12 vs. limit=15.0 2024-06-20 10:09:14,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=183855.83333333334, ans=0.0 2024-06-20 10:09:21,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=183874.16666666666, ans=0.0 2024-06-20 10:09:23,270 INFO [train.py:1028] (1/2) Epoch 10, batch 9250, loss[loss=0.2546, simple_loss=0.3011, pruned_loss=0.1041, over 13206.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3101, pruned_loss=0.1087, over 2573897.10 frames. ], batch size: 67, lr: 5.75e-03, grad_scale: 64.0 2024-06-20 10:09:28,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=183892.5, ans=0.2 2024-06-20 10:09:28,607 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 1.937e+02 2.120e+02 2.256e+02 2.902e+02, threshold=4.239e+02, percent-clipped=0.0 2024-06-20 10:09:30,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=183910.83333333334, ans=0.2 2024-06-20 10:09:33,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=183910.83333333334, ans=0.1 2024-06-20 10:09:38,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=183929.16666666666, ans=0.1 2024-06-20 10:09:43,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=183947.5, ans=0.125 2024-06-20 10:09:55,670 INFO [train.py:1028] (1/2) Epoch 10, batch 9300, loss[loss=0.263, simple_loss=0.3096, pruned_loss=0.1082, over 12857.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3104, pruned_loss=0.1088, over 2570088.69 frames. ], batch size: 39, lr: 5.75e-03, grad_scale: 64.0 2024-06-20 10:09:58,722 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=12.0 2024-06-20 10:09:59,361 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=15.0 2024-06-20 10:10:00,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=183984.16666666666, ans=0.05 2024-06-20 10:10:21,791 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.91 vs. limit=15.0 2024-06-20 10:10:23,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184057.5, ans=0.1 2024-06-20 10:10:23,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=184057.5, ans=0.025 2024-06-20 10:10:27,109 INFO [train.py:1028] (1/2) Epoch 10, batch 9350, loss[loss=0.2353, simple_loss=0.2837, pruned_loss=0.09345, over 12386.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3102, pruned_loss=0.1089, over 2566827.61 frames. ], batch size: 22, lr: 5.75e-03, grad_scale: 64.0 2024-06-20 10:10:28,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=184075.83333333334, ans=0.2 2024-06-20 10:10:30,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.81 vs. limit=10.0 2024-06-20 10:10:32,014 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 1.938e+02 2.096e+02 2.300e+02 2.887e+02, threshold=4.192e+02, percent-clipped=0.0 2024-06-20 10:10:42,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=184112.5, ans=0.1 2024-06-20 10:10:42,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=184112.5, ans=0.125 2024-06-20 10:10:48,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=184130.83333333334, ans=0.0 2024-06-20 10:10:49,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=184130.83333333334, ans=0.125 2024-06-20 10:10:51,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=184149.16666666666, ans=0.1 2024-06-20 10:10:54,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=184149.16666666666, ans=0.0 2024-06-20 10:10:54,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=184149.16666666666, ans=0.125 2024-06-20 10:10:58,287 INFO [train.py:1028] (1/2) Epoch 10, batch 9400, loss[loss=0.2664, simple_loss=0.3168, pruned_loss=0.1081, over 13286.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3106, pruned_loss=0.1092, over 2566864.53 frames. ], batch size: 52, lr: 5.75e-03, grad_scale: 64.0 2024-06-20 10:11:05,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=184185.83333333334, ans=0.125 2024-06-20 10:11:08,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=184185.83333333334, ans=0.125 2024-06-20 10:11:09,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=184185.83333333334, ans=0.125 2024-06-20 10:11:24,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=184240.83333333334, ans=0.1 2024-06-20 10:11:26,042 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.40 vs. limit=15.0 2024-06-20 10:11:29,829 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.70 vs. limit=15.0 2024-06-20 10:11:31,262 INFO [train.py:1028] (1/2) Epoch 10, batch 9450, loss[loss=0.2419, simple_loss=0.2952, pruned_loss=0.09434, over 12734.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3117, pruned_loss=0.11, over 2566976.82 frames. ], batch size: 22, lr: 5.75e-03, grad_scale: 64.0 2024-06-20 10:11:36,174 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 1.947e+02 2.091e+02 2.298e+02 3.082e+02, threshold=4.182e+02, percent-clipped=0.0 2024-06-20 10:11:43,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=184295.83333333334, ans=0.125 2024-06-20 10:11:49,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=184295.83333333334, ans=0.0 2024-06-20 10:11:49,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=184295.83333333334, ans=0.125 2024-06-20 10:12:04,299 INFO [train.py:1028] (1/2) Epoch 10, batch 9500, loss[loss=0.2421, simple_loss=0.295, pruned_loss=0.09459, over 13259.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.311, pruned_loss=0.1092, over 2575272.22 frames. ], batch size: 43, lr: 5.75e-03, grad_scale: 128.0 2024-06-20 10:12:06,312 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.65 vs. limit=15.0 2024-06-20 10:12:17,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=184387.5, ans=0.125 2024-06-20 10:12:33,541 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=22.71 vs. limit=15.0 2024-06-20 10:12:35,099 INFO [train.py:1028] (1/2) Epoch 10, batch 9550, loss[loss=0.2704, simple_loss=0.3095, pruned_loss=0.1157, over 12895.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3105, pruned_loss=0.1092, over 2569928.52 frames. ], batch size: 39, lr: 5.75e-03, grad_scale: 128.0 2024-06-20 10:12:35,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=184442.5, ans=0.0 2024-06-20 10:12:37,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=184442.5, ans=0.125 2024-06-20 10:12:40,227 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.919e+02 2.058e+02 2.238e+02 3.785e+02, threshold=4.115e+02, percent-clipped=0.0 2024-06-20 10:12:49,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=184479.16666666666, ans=0.05 2024-06-20 10:12:52,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=184479.16666666666, ans=0.0 2024-06-20 10:12:53,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=184497.5, ans=0.125 2024-06-20 10:13:03,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=184515.83333333334, ans=0.0 2024-06-20 10:13:06,170 INFO [train.py:1028] (1/2) Epoch 10, batch 9600, loss[loss=0.2793, simple_loss=0.3125, pruned_loss=0.123, over 10497.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3103, pruned_loss=0.1094, over 2569899.97 frames. ], batch size: 304, lr: 5.74e-03, grad_scale: 128.0 2024-06-20 10:13:11,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=184552.5, ans=0.95 2024-06-20 10:13:13,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=184552.5, ans=0.125 2024-06-20 10:13:14,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=184552.5, ans=0.0 2024-06-20 10:13:18,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=184570.83333333334, ans=0.125 2024-06-20 10:13:23,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=184570.83333333334, ans=0.0 2024-06-20 10:13:23,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184570.83333333334, ans=0.1 2024-06-20 10:13:24,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=184589.16666666666, ans=0.0 2024-06-20 10:13:25,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=184589.16666666666, ans=0.125 2024-06-20 10:13:29,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=184589.16666666666, ans=0.04949747468305833 2024-06-20 10:13:36,841 INFO [train.py:1028] (1/2) Epoch 10, batch 9650, loss[loss=0.2506, simple_loss=0.2965, pruned_loss=0.1024, over 13075.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3097, pruned_loss=0.1095, over 2558699.48 frames. ], batch size: 132, lr: 5.74e-03, grad_scale: 128.0 2024-06-20 10:13:37,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=184625.83333333334, ans=0.125 2024-06-20 10:13:41,814 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 2.022e+02 2.213e+02 2.513e+02 3.683e+02, threshold=4.426e+02, percent-clipped=0.0 2024-06-20 10:13:42,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=184644.16666666666, ans=0.07 2024-06-20 10:13:44,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=184644.16666666666, ans=0.0 2024-06-20 10:13:45,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=184644.16666666666, ans=0.125 2024-06-20 10:13:45,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=184644.16666666666, ans=15.0 2024-06-20 10:13:53,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=184662.5, ans=0.1 2024-06-20 10:14:03,192 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.96 vs. limit=15.0 2024-06-20 10:14:07,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=184699.16666666666, ans=0.1 2024-06-20 10:14:11,879 INFO [train.py:1028] (1/2) Epoch 10, batch 9700, loss[loss=0.2677, simple_loss=0.3039, pruned_loss=0.1157, over 13042.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3097, pruned_loss=0.1098, over 2554307.42 frames. ], batch size: 144, lr: 5.74e-03, grad_scale: 128.0 2024-06-20 10:14:22,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=184735.83333333334, ans=0.2 2024-06-20 10:14:23,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=184754.16666666666, ans=0.0 2024-06-20 10:14:25,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=184754.16666666666, ans=0.125 2024-06-20 10:14:31,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.61 vs. limit=15.0 2024-06-20 10:14:42,999 INFO [train.py:1028] (1/2) Epoch 10, batch 9750, loss[loss=0.255, simple_loss=0.2966, pruned_loss=0.1067, over 13074.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3087, pruned_loss=0.1092, over 2551067.79 frames. ], batch size: 132, lr: 5.74e-03, grad_scale: 128.0 2024-06-20 10:14:47,861 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.920e+02 2.127e+02 2.396e+02 3.268e+02, threshold=4.254e+02, percent-clipped=0.0 2024-06-20 10:14:48,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=184809.16666666666, ans=0.1 2024-06-20 10:14:50,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184827.5, ans=0.1 2024-06-20 10:14:54,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=184827.5, ans=0.1 2024-06-20 10:15:05,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=184864.16666666666, ans=0.025 2024-06-20 10:15:06,822 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.04 vs. limit=15.0 2024-06-20 10:15:09,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=184882.5, ans=0.025 2024-06-20 10:15:11,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184882.5, ans=0.1 2024-06-20 10:15:13,906 INFO [train.py:1028] (1/2) Epoch 10, batch 9800, loss[loss=0.225, simple_loss=0.2747, pruned_loss=0.08763, over 12880.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3076, pruned_loss=0.1086, over 2544505.30 frames. ], batch size: 39, lr: 5.74e-03, grad_scale: 128.0 2024-06-20 10:15:21,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=184919.16666666666, ans=0.1 2024-06-20 10:15:29,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=184937.5, ans=0.125 2024-06-20 10:15:29,631 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.19 vs. limit=15.0 2024-06-20 10:15:42,012 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.53 vs. limit=15.0 2024-06-20 10:15:45,961 INFO [train.py:1028] (1/2) Epoch 10, batch 9850, loss[loss=0.276, simple_loss=0.3187, pruned_loss=0.1167, over 13023.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3072, pruned_loss=0.1086, over 2538606.85 frames. ], batch size: 102, lr: 5.74e-03, grad_scale: 128.0 2024-06-20 10:15:46,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=184992.5, ans=0.0 2024-06-20 10:15:49,441 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=12.0 2024-06-20 10:15:50,778 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.700e+02 1.935e+02 2.098e+02 2.260e+02 3.496e+02, threshold=4.195e+02, percent-clipped=0.0 2024-06-20 10:15:52,259 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.29 vs. limit=12.0 2024-06-20 10:15:54,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=185010.83333333334, ans=0.1 2024-06-20 10:16:17,437 INFO [train.py:1028] (1/2) Epoch 10, batch 9900, loss[loss=0.2591, simple_loss=0.303, pruned_loss=0.1077, over 13262.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3069, pruned_loss=0.1088, over 2532916.30 frames. ], batch size: 40, lr: 5.74e-03, grad_scale: 128.0 2024-06-20 10:16:30,146 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.68 vs. limit=15.0 2024-06-20 10:16:35,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=185139.16666666666, ans=0.125 2024-06-20 10:16:37,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=185139.16666666666, ans=0.0 2024-06-20 10:16:37,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=185139.16666666666, ans=0.0 2024-06-20 10:16:39,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=185139.16666666666, ans=0.025 2024-06-20 10:16:40,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=185139.16666666666, ans=0.125 2024-06-20 10:16:48,643 INFO [train.py:1028] (1/2) Epoch 10, batch 9950, loss[loss=0.2643, simple_loss=0.3082, pruned_loss=0.1102, over 12648.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3067, pruned_loss=0.1092, over 2525881.37 frames. ], batch size: 29, lr: 5.73e-03, grad_scale: 128.0 2024-06-20 10:16:49,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=185175.83333333334, ans=0.125 2024-06-20 10:16:50,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=185175.83333333334, ans=0.0 2024-06-20 10:16:53,323 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 1.899e+02 2.047e+02 2.266e+02 2.909e+02, threshold=4.093e+02, percent-clipped=0.0 2024-06-20 10:17:15,671 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.51 vs. limit=15.0 2024-06-20 10:17:16,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=185249.16666666666, ans=0.07 2024-06-20 10:17:20,216 INFO [train.py:1028] (1/2) Epoch 10, batch 10000, loss[loss=0.2738, simple_loss=0.3178, pruned_loss=0.1149, over 12559.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3082, pruned_loss=0.1104, over 2488050.38 frames. ], batch size: 22, lr: 5.73e-03, grad_scale: 128.0 2024-06-20 10:17:25,151 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.99 vs. limit=15.0 2024-06-20 10:17:35,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=185304.16666666666, ans=0.125 2024-06-20 10:17:52,371 INFO [train.py:1028] (1/2) Epoch 10, batch 10050, loss[loss=0.2452, simple_loss=0.2924, pruned_loss=0.09903, over 12369.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.308, pruned_loss=0.1112, over 2446876.99 frames. ], batch size: 22, lr: 5.73e-03, grad_scale: 128.0 2024-06-20 10:17:52,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=185359.16666666666, ans=0.125 2024-06-20 10:17:56,852 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 2.050e+02 2.263e+02 2.609e+02 3.503e+02, threshold=4.525e+02, percent-clipped=0.0 2024-06-20 10:18:09,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=185414.16666666666, ans=0.0 2024-06-20 10:18:22,263 INFO [train.py:1028] (1/2) Epoch 10, batch 10100, loss[loss=0.2513, simple_loss=0.2962, pruned_loss=0.1033, over 11630.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3065, pruned_loss=0.1098, over 2426339.65 frames. ], batch size: 17, lr: 5.73e-03, grad_scale: 128.0 2024-06-20 10:18:22,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=185450.83333333334, ans=0.2 2024-06-20 10:18:24,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=185450.83333333334, ans=0.0 2024-06-20 10:18:28,465 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.35 vs. limit=10.0 2024-06-20 10:18:29,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=185469.16666666666, ans=0.05 2024-06-20 10:20:35,856 INFO [train.py:1028] (1/2) Epoch 11, batch 0, loss[loss=0.2391, simple_loss=0.2861, pruned_loss=0.09602, over 12930.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.2861, pruned_loss=0.09602, over 12930.00 frames. ], batch size: 36, lr: 5.47e-03, grad_scale: 128.0 2024-06-20 10:20:35,857 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 10:20:42,752 INFO [train.py:1060] (1/2) Epoch 11, validation: loss=0.199, simple_loss=0.2631, pruned_loss=0.06746, over 351949.00 frames. 2024-06-20 10:20:42,753 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 10:20:49,474 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.48 vs. limit=15.0 2024-06-20 10:20:55,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=185520.5, ans=0.125 2024-06-20 10:20:56,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=185520.5, ans=0.0 2024-06-20 10:20:57,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=185520.5, ans=0.125 2024-06-20 10:21:03,966 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.72 vs. limit=22.5 2024-06-20 10:21:04,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=185520.5, ans=0.125 2024-06-20 10:21:06,795 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.59 vs. limit=22.5 2024-06-20 10:21:08,859 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.55 vs. limit=15.0 2024-06-20 10:21:12,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=185557.16666666666, ans=0.0 2024-06-20 10:21:12,873 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 1.810e+02 1.985e+02 2.238e+02 3.284e+02, threshold=3.969e+02, percent-clipped=0.0 2024-06-20 10:21:19,358 INFO [train.py:1028] (1/2) Epoch 11, batch 50, loss[loss=0.249, simple_loss=0.29, pruned_loss=0.104, over 12703.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.288, pruned_loss=0.1012, over 573831.94 frames. ], batch size: 29, lr: 5.47e-03, grad_scale: 128.0 2024-06-20 10:21:22,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=185575.5, ans=0.125 2024-06-20 10:21:22,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=185575.5, ans=0.2 2024-06-20 10:21:24,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185575.5, ans=0.1 2024-06-20 10:21:27,593 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.24 vs. limit=15.0 2024-06-20 10:21:29,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=185593.83333333334, ans=0.0 2024-06-20 10:21:31,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185593.83333333334, ans=0.1 2024-06-20 10:21:36,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=185612.16666666666, ans=0.125 2024-06-20 10:21:42,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=185630.5, ans=0.09899494936611666 2024-06-20 10:21:42,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=185630.5, ans=0.2 2024-06-20 10:21:45,799 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.64 vs. limit=15.0 2024-06-20 10:21:51,140 INFO [train.py:1028] (1/2) Epoch 11, batch 100, loss[loss=0.2234, simple_loss=0.2781, pruned_loss=0.08432, over 13253.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.2863, pruned_loss=0.1003, over 1017012.35 frames. ], batch size: 46, lr: 5.47e-03, grad_scale: 128.0 2024-06-20 10:22:02,330 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:22:03,743 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.70 vs. limit=15.0 2024-06-20 10:22:19,397 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.879e+02 2.057e+02 2.246e+02 2.900e+02, threshold=4.113e+02, percent-clipped=0.0 2024-06-20 10:22:23,686 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.59 vs. limit=10.0 2024-06-20 10:22:24,868 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.93 vs. limit=22.5 2024-06-20 10:22:25,866 INFO [train.py:1028] (1/2) Epoch 11, batch 150, loss[loss=0.2086, simple_loss=0.26, pruned_loss=0.0786, over 12737.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.285, pruned_loss=0.0982, over 1365054.41 frames. ], batch size: 29, lr: 5.47e-03, grad_scale: 128.0 2024-06-20 10:22:33,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=185777.16666666666, ans=0.1 2024-06-20 10:22:39,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=185795.5, ans=0.125 2024-06-20 10:22:42,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=185795.5, ans=0.025 2024-06-20 10:22:42,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=185795.5, ans=0.2 2024-06-20 10:22:42,837 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.58 vs. limit=15.0 2024-06-20 10:22:45,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=185813.83333333334, ans=0.2 2024-06-20 10:22:49,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=185813.83333333334, ans=0.1 2024-06-20 10:22:50,452 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.08 vs. limit=15.0 2024-06-20 10:22:55,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=185832.16666666666, ans=0.125 2024-06-20 10:23:00,935 INFO [train.py:1028] (1/2) Epoch 11, batch 200, loss[loss=0.2539, simple_loss=0.2855, pruned_loss=0.1111, over 12549.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.2848, pruned_loss=0.0978, over 1634751.10 frames. ], batch size: 202, lr: 5.46e-03, grad_scale: 128.0 2024-06-20 10:23:11,154 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.34 vs. limit=10.0 2024-06-20 10:23:12,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=185868.83333333334, ans=0.1 2024-06-20 10:23:15,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=185887.16666666666, ans=0.0 2024-06-20 10:23:15,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=185887.16666666666, ans=0.1 2024-06-20 10:23:16,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=185887.16666666666, ans=0.125 2024-06-20 10:23:26,997 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.759e+02 1.866e+02 2.012e+02 2.445e+02, threshold=3.731e+02, percent-clipped=0.0 2024-06-20 10:23:33,558 INFO [train.py:1028] (1/2) Epoch 11, batch 250, loss[loss=0.2322, simple_loss=0.2742, pruned_loss=0.09504, over 13039.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.2843, pruned_loss=0.09764, over 1845790.64 frames. ], batch size: 144, lr: 5.46e-03, grad_scale: 128.0 2024-06-20 10:23:40,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=185960.5, ans=0.0 2024-06-20 10:23:40,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=185960.5, ans=0.125 2024-06-20 10:23:41,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=185960.5, ans=0.125 2024-06-20 10:24:08,577 INFO [train.py:1028] (1/2) Epoch 11, batch 300, loss[loss=0.2376, simple_loss=0.2772, pruned_loss=0.09902, over 13198.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.2846, pruned_loss=0.09788, over 2008867.02 frames. ], batch size: 112, lr: 5.46e-03, grad_scale: 128.0 2024-06-20 10:24:21,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=186070.5, ans=0.125 2024-06-20 10:24:34,182 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.791e+02 1.930e+02 2.071e+02 2.838e+02, threshold=3.859e+02, percent-clipped=0.0 2024-06-20 10:24:40,434 INFO [train.py:1028] (1/2) Epoch 11, batch 350, loss[loss=0.2332, simple_loss=0.2822, pruned_loss=0.09207, over 12938.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.2842, pruned_loss=0.09751, over 2138593.60 frames. ], batch size: 33, lr: 5.46e-03, grad_scale: 128.0 2024-06-20 10:24:42,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=186125.5, ans=0.0 2024-06-20 10:24:47,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=186143.83333333334, ans=0.125 2024-06-20 10:24:51,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=186143.83333333334, ans=0.2 2024-06-20 10:25:05,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=186180.5, ans=0.125 2024-06-20 10:25:08,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=186180.5, ans=0.0 2024-06-20 10:25:14,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=186198.83333333334, ans=0.0 2024-06-20 10:25:15,894 INFO [train.py:1028] (1/2) Epoch 11, batch 400, loss[loss=0.263, simple_loss=0.3102, pruned_loss=0.1079, over 13238.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.2842, pruned_loss=0.09748, over 2239072.89 frames. ], batch size: 63, lr: 5.46e-03, grad_scale: 128.0 2024-06-20 10:25:16,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=186217.16666666666, ans=0.125 2024-06-20 10:25:17,594 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.02 vs. limit=15.0 2024-06-20 10:25:19,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=186217.16666666666, ans=0.0 2024-06-20 10:25:21,955 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.68 vs. limit=22.5 2024-06-20 10:25:41,760 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.798e+02 1.897e+02 2.080e+02 3.041e+02, threshold=3.795e+02, percent-clipped=0.0 2024-06-20 10:25:46,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=186290.5, ans=0.05 2024-06-20 10:25:48,430 INFO [train.py:1028] (1/2) Epoch 11, batch 450, loss[loss=0.2058, simple_loss=0.2603, pruned_loss=0.07564, over 13155.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.2843, pruned_loss=0.09724, over 2313302.56 frames. ], batch size: 67, lr: 5.46e-03, grad_scale: 128.0 2024-06-20 10:26:01,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=186345.5, ans=0.125 2024-06-20 10:26:10,321 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.41 vs. limit=22.5 2024-06-20 10:26:21,487 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.71 vs. limit=15.0 2024-06-20 10:26:24,180 INFO [train.py:1028] (1/2) Epoch 11, batch 500, loss[loss=0.2265, simple_loss=0.2666, pruned_loss=0.09323, over 13100.00 frames. ], tot_loss[loss=0.239, simple_loss=0.284, pruned_loss=0.09698, over 2375463.10 frames. ], batch size: 121, lr: 5.46e-03, grad_scale: 128.0 2024-06-20 10:26:29,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=186400.5, ans=0.0 2024-06-20 10:26:47,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=186455.5, ans=0.0 2024-06-20 10:26:48,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=186455.5, ans=0.2 2024-06-20 10:26:49,923 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.839e+02 2.032e+02 2.323e+02 3.466e+02, threshold=4.064e+02, percent-clipped=0.0 2024-06-20 10:26:53,206 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2024-06-20 10:26:54,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=186473.83333333334, ans=0.125 2024-06-20 10:26:57,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=186473.83333333334, ans=0.025 2024-06-20 10:26:58,817 INFO [train.py:1028] (1/2) Epoch 11, batch 550, loss[loss=0.2499, simple_loss=0.2883, pruned_loss=0.1057, over 12952.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.2845, pruned_loss=0.09748, over 2420530.94 frames. ], batch size: 158, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:26:59,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=186492.16666666666, ans=10.0 2024-06-20 10:27:00,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=186492.16666666666, ans=0.125 2024-06-20 10:27:02,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=186492.16666666666, ans=0.125 2024-06-20 10:27:02,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=186492.16666666666, ans=0.125 2024-06-20 10:27:07,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=186510.5, ans=0.125 2024-06-20 10:27:30,210 INFO [train.py:1028] (1/2) Epoch 11, batch 600, loss[loss=0.2255, simple_loss=0.2705, pruned_loss=0.09029, over 13054.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.284, pruned_loss=0.09715, over 2458238.39 frames. ], batch size: 144, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:27:38,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=186602.16666666666, ans=0.0 2024-06-20 10:27:43,392 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.82 vs. limit=6.0 2024-06-20 10:27:48,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=186638.83333333334, ans=0.1 2024-06-20 10:27:50,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=186638.83333333334, ans=0.1 2024-06-20 10:27:55,707 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.652e+02 1.861e+02 2.031e+02 2.323e+02 3.478e+02, threshold=4.063e+02, percent-clipped=0.0 2024-06-20 10:27:59,360 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.92 vs. limit=15.0 2024-06-20 10:28:01,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=186675.5, ans=0.125 2024-06-20 10:28:02,306 INFO [train.py:1028] (1/2) Epoch 11, batch 650, loss[loss=0.2258, simple_loss=0.2762, pruned_loss=0.08771, over 13192.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.2845, pruned_loss=0.09711, over 2489729.52 frames. ], batch size: 59, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:28:20,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=186712.16666666666, ans=0.0 2024-06-20 10:28:25,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=186730.5, ans=0.0 2024-06-20 10:28:29,066 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.24 vs. limit=12.0 2024-06-20 10:28:29,199 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=11.47 vs. limit=15.0 2024-06-20 10:28:37,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=186748.83333333334, ans=0.125 2024-06-20 10:28:38,124 INFO [train.py:1028] (1/2) Epoch 11, batch 700, loss[loss=0.2359, simple_loss=0.2889, pruned_loss=0.09138, over 13301.00 frames. ], tot_loss[loss=0.239, simple_loss=0.2839, pruned_loss=0.09702, over 2512881.91 frames. ], batch size: 46, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:28:42,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=186767.16666666666, ans=0.09899494936611666 2024-06-20 10:28:43,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=186785.5, ans=0.0 2024-06-20 10:28:54,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=186803.83333333334, ans=0.0 2024-06-20 10:28:57,144 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.49 vs. limit=15.0 2024-06-20 10:29:02,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=186822.16666666666, ans=0.2 2024-06-20 10:29:06,592 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.834e+02 1.996e+02 2.197e+02 3.613e+02, threshold=3.993e+02, percent-clipped=0.0 2024-06-20 10:29:12,925 INFO [train.py:1028] (1/2) Epoch 11, batch 750, loss[loss=0.2331, simple_loss=0.2822, pruned_loss=0.09201, over 13284.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.2839, pruned_loss=0.097, over 2527831.61 frames. ], batch size: 63, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:29:19,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=186877.16666666666, ans=0.125 2024-06-20 10:29:22,945 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.60 vs. limit=22.5 2024-06-20 10:29:25,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=186895.5, ans=0.125 2024-06-20 10:29:33,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=186913.83333333334, ans=0.0 2024-06-20 10:29:36,539 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.28 vs. limit=22.5 2024-06-20 10:29:36,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=186913.83333333334, ans=0.2 2024-06-20 10:29:45,065 INFO [train.py:1028] (1/2) Epoch 11, batch 800, loss[loss=0.2357, simple_loss=0.2873, pruned_loss=0.09201, over 13278.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.2839, pruned_loss=0.09712, over 2541016.95 frames. ], batch size: 37, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:29:47,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=186950.5, ans=0.025 2024-06-20 10:29:55,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=186968.83333333334, ans=0.125 2024-06-20 10:30:06,616 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.83 vs. limit=22.5 2024-06-20 10:30:13,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=187023.83333333334, ans=0.0 2024-06-20 10:30:13,971 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.759e+02 1.880e+02 2.028e+02 2.505e+02, threshold=3.760e+02, percent-clipped=0.0 2024-06-20 10:30:21,052 INFO [train.py:1028] (1/2) Epoch 11, batch 850, loss[loss=0.2321, simple_loss=0.282, pruned_loss=0.0911, over 13108.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.2829, pruned_loss=0.0964, over 2550524.69 frames. ], batch size: 95, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:30:21,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=187042.16666666666, ans=0.125 2024-06-20 10:30:23,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=187042.16666666666, ans=0.2 2024-06-20 10:30:25,971 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.84 vs. limit=15.0 2024-06-20 10:30:27,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=187060.5, ans=0.1 2024-06-20 10:30:30,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=187060.5, ans=0.0 2024-06-20 10:30:35,642 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.26 vs. limit=15.0 2024-06-20 10:30:41,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=187097.16666666666, ans=0.07 2024-06-20 10:30:44,993 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.68 vs. limit=15.0 2024-06-20 10:30:46,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=187097.16666666666, ans=0.0 2024-06-20 10:30:48,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=187115.5, ans=0.025 2024-06-20 10:30:54,387 INFO [train.py:1028] (1/2) Epoch 11, batch 900, loss[loss=0.2438, simple_loss=0.2921, pruned_loss=0.09778, over 12883.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.2826, pruned_loss=0.09645, over 2555350.51 frames. ], batch size: 36, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:31:10,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=187152.16666666666, ans=0.07 2024-06-20 10:31:13,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=187170.5, ans=0.1 2024-06-20 10:31:16,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=187170.5, ans=0.125 2024-06-20 10:31:24,683 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.811e+02 1.950e+02 2.150e+02 2.737e+02, threshold=3.899e+02, percent-clipped=0.0 2024-06-20 10:31:31,517 INFO [train.py:1028] (1/2) Epoch 11, batch 950, loss[loss=0.2463, simple_loss=0.2948, pruned_loss=0.09894, over 12970.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.2833, pruned_loss=0.09691, over 2558687.49 frames. ], batch size: 39, lr: 5.44e-03, grad_scale: 128.0 2024-06-20 10:31:31,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=187225.5, ans=0.025 2024-06-20 10:31:32,707 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.28 vs. limit=15.0 2024-06-20 10:31:41,412 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.28 vs. limit=15.0 2024-06-20 10:31:53,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=187280.5, ans=0.0 2024-06-20 10:31:54,832 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.50 vs. limit=10.0 2024-06-20 10:31:58,793 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.77 vs. limit=15.0 2024-06-20 10:32:02,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=187298.83333333334, ans=0.1 2024-06-20 10:32:03,868 INFO [train.py:1028] (1/2) Epoch 11, batch 1000, loss[loss=0.2703, simple_loss=0.3064, pruned_loss=0.1171, over 13308.00 frames. ], tot_loss[loss=0.239, simple_loss=0.2835, pruned_loss=0.09727, over 2560953.57 frames. ], batch size: 49, lr: 5.44e-03, grad_scale: 128.0 2024-06-20 10:32:08,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=187317.16666666666, ans=0.0 2024-06-20 10:32:11,822 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.57 vs. limit=15.0 2024-06-20 10:32:14,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=187335.5, ans=0.1 2024-06-20 10:32:20,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=187353.83333333334, ans=0.0 2024-06-20 10:32:32,507 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.857e+02 2.015e+02 2.276e+02 3.043e+02, threshold=4.029e+02, percent-clipped=0.0 2024-06-20 10:32:38,407 INFO [train.py:1028] (1/2) Epoch 11, batch 1050, loss[loss=0.2281, simple_loss=0.2753, pruned_loss=0.09048, over 13204.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.2841, pruned_loss=0.09735, over 2564141.16 frames. ], batch size: 77, lr: 5.44e-03, grad_scale: 64.0 2024-06-20 10:32:43,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=187408.83333333334, ans=0.125 2024-06-20 10:32:56,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=187445.5, ans=0.0 2024-06-20 10:33:07,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=187482.16666666666, ans=0.125 2024-06-20 10:33:14,470 INFO [train.py:1028] (1/2) Epoch 11, batch 1100, loss[loss=0.2712, simple_loss=0.3175, pruned_loss=0.1125, over 13272.00 frames. ], tot_loss[loss=0.24, simple_loss=0.2847, pruned_loss=0.09763, over 2569750.93 frames. ], batch size: 52, lr: 5.44e-03, grad_scale: 64.0 2024-06-20 10:33:14,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=187500.5, ans=0.125 2024-06-20 10:33:23,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=187518.83333333334, ans=0.1 2024-06-20 10:33:25,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=187518.83333333334, ans=0.125 2024-06-20 10:33:25,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=187518.83333333334, ans=0.125 2024-06-20 10:33:25,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=187518.83333333334, ans=0.125 2024-06-20 10:33:25,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=187518.83333333334, ans=0.2 2024-06-20 10:33:40,740 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.772e+02 1.919e+02 2.075e+02 2.696e+02, threshold=3.838e+02, percent-clipped=0.0 2024-06-20 10:33:42,264 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.77 vs. limit=6.0 2024-06-20 10:33:46,361 INFO [train.py:1028] (1/2) Epoch 11, batch 1150, loss[loss=0.2353, simple_loss=0.2858, pruned_loss=0.09246, over 13251.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.2848, pruned_loss=0.09774, over 2571637.93 frames. ], batch size: 52, lr: 5.44e-03, grad_scale: 64.0 2024-06-20 10:33:48,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=187592.16666666666, ans=0.07 2024-06-20 10:33:49,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=187592.16666666666, ans=0.125 2024-06-20 10:33:52,449 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:33:58,443 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.30 vs. limit=22.5 2024-06-20 10:34:14,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=187665.5, ans=0.125 2024-06-20 10:34:18,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=187665.5, ans=0.2 2024-06-20 10:34:21,051 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.98 vs. limit=15.0 2024-06-20 10:34:21,254 INFO [train.py:1028] (1/2) Epoch 11, batch 1200, loss[loss=0.2588, simple_loss=0.3034, pruned_loss=0.1071, over 13144.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.2841, pruned_loss=0.09773, over 2573817.04 frames. ], batch size: 77, lr: 5.44e-03, grad_scale: 64.0 2024-06-20 10:34:27,556 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.38 vs. limit=22.5 2024-06-20 10:34:28,153 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.25 vs. limit=15.0 2024-06-20 10:34:29,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=187702.16666666666, ans=0.025 2024-06-20 10:34:42,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=187738.83333333334, ans=0.125 2024-06-20 10:34:47,767 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.616e+02 1.801e+02 1.947e+02 2.075e+02 2.851e+02, threshold=3.895e+02, percent-clipped=0.0 2024-06-20 10:34:51,317 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.43 vs. limit=15.0 2024-06-20 10:34:53,331 INFO [train.py:1028] (1/2) Epoch 11, batch 1250, loss[loss=0.2105, simple_loss=0.2565, pruned_loss=0.0823, over 13177.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.2836, pruned_loss=0.09735, over 2583516.80 frames. ], batch size: 112, lr: 5.44e-03, grad_scale: 64.0 2024-06-20 10:34:54,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=187775.5, ans=0.125 2024-06-20 10:34:59,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=187793.83333333334, ans=0.0 2024-06-20 10:35:14,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=187830.5, ans=0.0 2024-06-20 10:35:25,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=187848.83333333334, ans=0.125 2024-06-20 10:35:28,013 INFO [train.py:1028] (1/2) Epoch 11, batch 1300, loss[loss=0.2556, simple_loss=0.2908, pruned_loss=0.1102, over 12838.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.2841, pruned_loss=0.0972, over 2583222.77 frames. ], batch size: 177, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:35:33,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=187867.16666666666, ans=0.125 2024-06-20 10:35:34,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=187885.5, ans=0.1 2024-06-20 10:35:44,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=187903.83333333334, ans=0.0 2024-06-20 10:35:51,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=187922.16666666666, ans=0.1 2024-06-20 10:35:54,578 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.806e+02 1.912e+02 2.070e+02 2.814e+02, threshold=3.824e+02, percent-clipped=0.0 2024-06-20 10:36:00,761 INFO [train.py:1028] (1/2) Epoch 11, batch 1350, loss[loss=0.2484, simple_loss=0.2947, pruned_loss=0.101, over 13213.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.284, pruned_loss=0.09684, over 2585052.26 frames. ], batch size: 59, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:36:02,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=187958.83333333334, ans=0.1 2024-06-20 10:36:07,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=187977.16666666666, ans=0.2 2024-06-20 10:36:17,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=187995.5, ans=0.125 2024-06-20 10:36:17,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=187995.5, ans=0.0 2024-06-20 10:36:18,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=187995.5, ans=0.025 2024-06-20 10:36:21,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=187995.5, ans=0.125 2024-06-20 10:36:23,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=187995.5, ans=0.125 2024-06-20 10:36:25,400 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.37 vs. limit=10.0 2024-06-20 10:36:29,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=188013.83333333334, ans=0.025 2024-06-20 10:36:32,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=188032.16666666666, ans=0.1 2024-06-20 10:36:37,685 INFO [train.py:1028] (1/2) Epoch 11, batch 1400, loss[loss=0.2609, simple_loss=0.2998, pruned_loss=0.111, over 12292.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.2843, pruned_loss=0.09718, over 2586452.96 frames. ], batch size: 25, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:36:48,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=188068.83333333334, ans=0.025 2024-06-20 10:36:49,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.96 vs. limit=15.0 2024-06-20 10:36:50,707 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=12.0 2024-06-20 10:37:04,416 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.835e+02 1.959e+02 2.201e+02 3.148e+02, threshold=3.917e+02, percent-clipped=0.0 2024-06-20 10:37:12,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=188123.83333333334, ans=0.1 2024-06-20 10:37:13,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=188123.83333333334, ans=0.125 2024-06-20 10:37:15,159 INFO [train.py:1028] (1/2) Epoch 11, batch 1450, loss[loss=0.2241, simple_loss=0.2685, pruned_loss=0.0898, over 12994.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.2838, pruned_loss=0.09689, over 2586908.48 frames. ], batch size: 121, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:37:15,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=188142.16666666666, ans=0.0 2024-06-20 10:37:27,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=188160.5, ans=0.125 2024-06-20 10:37:37,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=188197.16666666666, ans=0.0 2024-06-20 10:37:39,098 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2024-06-20 10:37:48,540 INFO [train.py:1028] (1/2) Epoch 11, batch 1500, loss[loss=0.2303, simple_loss=0.279, pruned_loss=0.09075, over 13208.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.2833, pruned_loss=0.09694, over 2589260.33 frames. ], batch size: 83, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:37:48,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=188233.83333333334, ans=0.125 2024-06-20 10:37:49,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=188233.83333333334, ans=0.125 2024-06-20 10:38:07,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=188288.83333333334, ans=0.125 2024-06-20 10:38:17,896 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.799e+02 1.888e+02 2.058e+02 2.888e+02, threshold=3.776e+02, percent-clipped=0.0 2024-06-20 10:38:18,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=188307.16666666666, ans=0.025 2024-06-20 10:38:21,054 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.65 vs. limit=15.0 2024-06-20 10:38:21,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=188307.16666666666, ans=0.1 2024-06-20 10:38:23,765 INFO [train.py:1028] (1/2) Epoch 11, batch 1550, loss[loss=0.236, simple_loss=0.2794, pruned_loss=0.09626, over 12981.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.2838, pruned_loss=0.09719, over 2584951.55 frames. ], batch size: 102, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:38:24,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=188325.5, ans=0.1 2024-06-20 10:38:29,380 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.02 vs. limit=22.5 2024-06-20 10:38:38,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=188362.16666666666, ans=0.0 2024-06-20 10:38:42,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=188380.5, ans=0.025 2024-06-20 10:38:53,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=22.35 vs. limit=15.0 2024-06-20 10:38:55,959 INFO [train.py:1028] (1/2) Epoch 11, batch 1600, loss[loss=0.2227, simple_loss=0.2735, pruned_loss=0.08595, over 13162.00 frames. ], tot_loss[loss=0.239, simple_loss=0.2839, pruned_loss=0.09703, over 2581725.60 frames. ], batch size: 77, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:39:23,331 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.43 vs. limit=22.5 2024-06-20 10:39:24,746 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.840e+02 1.996e+02 2.166e+02 3.343e+02, threshold=3.991e+02, percent-clipped=0.0 2024-06-20 10:39:28,523 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2024-06-20 10:39:29,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=188490.5, ans=0.025 2024-06-20 10:39:30,678 INFO [train.py:1028] (1/2) Epoch 11, batch 1650, loss[loss=0.2428, simple_loss=0.2836, pruned_loss=0.101, over 13169.00 frames. ], tot_loss[loss=0.24, simple_loss=0.2844, pruned_loss=0.09774, over 2577016.73 frames. ], batch size: 95, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:39:30,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=188508.83333333334, ans=0.125 2024-06-20 10:39:48,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=188545.5, ans=15.0 2024-06-20 10:39:53,092 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.39 vs. limit=15.0 2024-06-20 10:40:03,389 INFO [train.py:1028] (1/2) Epoch 11, batch 1700, loss[loss=0.228, simple_loss=0.2777, pruned_loss=0.08916, over 12456.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.2845, pruned_loss=0.09736, over 2581400.53 frames. ], batch size: 25, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:40:03,560 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:40:07,116 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.34 vs. limit=15.0 2024-06-20 10:40:11,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=188618.83333333334, ans=0.0 2024-06-20 10:40:16,784 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.20 vs. limit=6.0 2024-06-20 10:40:27,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=188655.5, ans=0.0 2024-06-20 10:40:27,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=188655.5, ans=0.025 2024-06-20 10:40:29,799 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.39 vs. limit=10.0 2024-06-20 10:40:34,641 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.768e+02 1.843e+02 2.059e+02 3.009e+02, threshold=3.686e+02, percent-clipped=0.0 2024-06-20 10:40:40,764 INFO [train.py:1028] (1/2) Epoch 11, batch 1750, loss[loss=0.2394, simple_loss=0.2918, pruned_loss=0.09352, over 12657.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.2842, pruned_loss=0.097, over 2582446.17 frames. ], batch size: 22, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:40:59,965 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.75 vs. limit=22.5 2024-06-20 10:41:01,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=188747.16666666666, ans=0.0 2024-06-20 10:41:08,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=188765.5, ans=0.0 2024-06-20 10:41:13,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=188765.5, ans=0.2 2024-06-20 10:41:15,738 INFO [train.py:1028] (1/2) Epoch 11, batch 1800, loss[loss=0.2278, simple_loss=0.2739, pruned_loss=0.09089, over 13243.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.2848, pruned_loss=0.09773, over 2582295.59 frames. ], batch size: 67, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:41:15,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=188783.83333333334, ans=0.0 2024-06-20 10:41:20,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=188783.83333333334, ans=0.125 2024-06-20 10:41:22,723 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.94 vs. limit=15.0 2024-06-20 10:41:25,484 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=188802.16666666666, ans=0.125 2024-06-20 10:41:40,540 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.77 vs. limit=10.0 2024-06-20 10:41:42,668 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.882e+02 2.024e+02 2.262e+02 3.124e+02, threshold=4.047e+02, percent-clipped=0.0 2024-06-20 10:41:48,478 INFO [train.py:1028] (1/2) Epoch 11, batch 1850, loss[loss=0.2445, simple_loss=0.2858, pruned_loss=0.1016, over 13218.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.2846, pruned_loss=0.09742, over 2583795.39 frames. ], batch size: 83, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:41:54,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=188875.5, ans=0.02 2024-06-20 10:41:58,029 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.57 vs. limit=15.0 2024-06-20 10:42:04,753 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.40 vs. limit=12.0 2024-06-20 10:42:07,314 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.58 vs. limit=15.0 2024-06-20 10:42:10,793 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.15 vs. limit=22.5 2024-06-20 10:42:11,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=188930.5, ans=0.2 2024-06-20 10:42:24,408 INFO [train.py:1028] (1/2) Epoch 11, batch 1900, loss[loss=0.2493, simple_loss=0.2888, pruned_loss=0.1049, over 13180.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.2844, pruned_loss=0.09776, over 2586560.72 frames. ], batch size: 95, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:42:26,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=188967.16666666666, ans=0.125 2024-06-20 10:42:29,478 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.56 vs. limit=22.5 2024-06-20 10:42:30,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=188985.5, ans=0.2 2024-06-20 10:42:35,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=188985.5, ans=0.0 2024-06-20 10:42:44,163 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.58 vs. limit=15.0 2024-06-20 10:42:51,479 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.805e+02 1.918e+02 2.088e+02 2.984e+02, threshold=3.837e+02, percent-clipped=0.0 2024-06-20 10:42:52,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=189040.5, ans=0.125 2024-06-20 10:42:53,852 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2024-06-20 10:42:57,364 INFO [train.py:1028] (1/2) Epoch 11, batch 1950, loss[loss=0.2301, simple_loss=0.2824, pruned_loss=0.08894, over 13251.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.2836, pruned_loss=0.09737, over 2592290.70 frames. ], batch size: 52, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:43:12,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=189095.5, ans=0.125 2024-06-20 10:43:18,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189095.5, ans=0.1 2024-06-20 10:43:29,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=189132.16666666666, ans=0.2 2024-06-20 10:43:32,160 INFO [train.py:1028] (1/2) Epoch 11, batch 2000, loss[loss=0.249, simple_loss=0.3036, pruned_loss=0.09722, over 12575.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.2837, pruned_loss=0.09736, over 2588106.20 frames. ], batch size: 22, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:43:36,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=189150.5, ans=0.125 2024-06-20 10:43:37,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=189150.5, ans=0.125 2024-06-20 10:43:38,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=189168.83333333334, ans=0.125 2024-06-20 10:43:43,554 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.12 vs. limit=22.5 2024-06-20 10:43:47,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=189187.16666666666, ans=0.2 2024-06-20 10:43:59,062 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.867e+02 1.988e+02 2.158e+02 2.974e+02, threshold=3.975e+02, percent-clipped=0.0 2024-06-20 10:43:59,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=189223.83333333334, ans=0.125 2024-06-20 10:44:00,190 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.58 vs. limit=15.0 2024-06-20 10:44:04,772 INFO [train.py:1028] (1/2) Epoch 11, batch 2050, loss[loss=0.2335, simple_loss=0.2802, pruned_loss=0.09341, over 12488.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.2844, pruned_loss=0.09795, over 2582332.69 frames. ], batch size: 29, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:44:12,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189260.5, ans=0.1 2024-06-20 10:44:19,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=189278.83333333334, ans=0.1 2024-06-20 10:44:28,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=189297.16666666666, ans=0.025 2024-06-20 10:44:30,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=189297.16666666666, ans=0.125 2024-06-20 10:44:34,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189315.5, ans=0.1 2024-06-20 10:44:36,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=189315.5, ans=0.0 2024-06-20 10:44:37,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=189315.5, ans=0.0 2024-06-20 10:44:38,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=189315.5, ans=0.0 2024-06-20 10:44:39,111 INFO [train.py:1028] (1/2) Epoch 11, batch 2100, loss[loss=0.2321, simple_loss=0.2832, pruned_loss=0.09051, over 13204.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.2845, pruned_loss=0.09766, over 2584509.29 frames. ], batch size: 59, lr: 5.41e-03, grad_scale: 64.0 2024-06-20 10:44:39,480 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.03 vs. limit=6.0 2024-06-20 10:44:44,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=189333.83333333334, ans=0.2 2024-06-20 10:44:47,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=189352.16666666666, ans=0.1 2024-06-20 10:44:47,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=189352.16666666666, ans=0.125 2024-06-20 10:44:49,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=189352.16666666666, ans=0.125 2024-06-20 10:44:54,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=189370.5, ans=0.025 2024-06-20 10:44:55,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=189370.5, ans=0.1 2024-06-20 10:44:55,895 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=11.74 vs. limit=15.0 2024-06-20 10:45:08,789 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.802e+02 1.959e+02 2.149e+02 2.935e+02, threshold=3.918e+02, percent-clipped=0.0 2024-06-20 10:45:13,271 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.01 vs. limit=15.0 2024-06-20 10:45:13,831 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.72 vs. limit=15.0 2024-06-20 10:45:14,702 INFO [train.py:1028] (1/2) Epoch 11, batch 2150, loss[loss=0.22, simple_loss=0.2686, pruned_loss=0.08565, over 13192.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.2838, pruned_loss=0.09726, over 2588243.50 frames. ], batch size: 52, lr: 5.41e-03, grad_scale: 64.0 2024-06-20 10:45:30,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=189462.16666666666, ans=0.125 2024-06-20 10:45:35,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=189480.5, ans=0.0 2024-06-20 10:45:47,660 INFO [train.py:1028] (1/2) Epoch 11, batch 2200, loss[loss=0.2485, simple_loss=0.2941, pruned_loss=0.1014, over 13232.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.284, pruned_loss=0.09742, over 2588400.09 frames. ], batch size: 83, lr: 5.41e-03, grad_scale: 64.0 2024-06-20 10:45:51,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten.whitening_limit, batch_count=189517.16666666666, ans=15.0 2024-06-20 10:45:51,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=189517.16666666666, ans=0.04949747468305833 2024-06-20 10:46:05,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=189553.83333333334, ans=0.125 2024-06-20 10:46:14,552 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.823e+02 1.935e+02 2.111e+02 2.773e+02, threshold=3.870e+02, percent-clipped=0.0 2024-06-20 10:46:15,001 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.57 vs. limit=15.0 2024-06-20 10:46:20,326 INFO [train.py:1028] (1/2) Epoch 11, batch 2250, loss[loss=0.2392, simple_loss=0.2905, pruned_loss=0.09397, over 13236.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.2839, pruned_loss=0.09762, over 2587045.34 frames. ], batch size: 63, lr: 5.41e-03, grad_scale: 64.0 2024-06-20 10:46:35,453 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.82 vs. limit=15.0 2024-06-20 10:46:38,743 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.39 vs. limit=22.5 2024-06-20 10:46:41,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=189645.5, ans=0.0 2024-06-20 10:46:52,935 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.89 vs. limit=15.0 2024-06-20 10:46:56,466 INFO [train.py:1028] (1/2) Epoch 11, batch 2300, loss[loss=0.2729, simple_loss=0.3127, pruned_loss=0.1165, over 12857.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.2845, pruned_loss=0.09752, over 2581472.01 frames. ], batch size: 33, lr: 5.41e-03, grad_scale: 64.0 2024-06-20 10:47:21,689 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.17 vs. limit=15.0 2024-06-20 10:47:25,823 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.835e+02 1.966e+02 2.134e+02 3.036e+02, threshold=3.933e+02, percent-clipped=0.0 2024-06-20 10:47:25,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=189773.83333333334, ans=0.125 2024-06-20 10:47:26,241 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.14 vs. limit=15.0 2024-06-20 10:47:31,467 INFO [train.py:1028] (1/2) Epoch 11, batch 2350, loss[loss=0.2285, simple_loss=0.2742, pruned_loss=0.09137, over 13238.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.2843, pruned_loss=0.09768, over 2585648.41 frames. ], batch size: 67, lr: 5.41e-03, grad_scale: 64.0 2024-06-20 10:47:35,052 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2024-06-20 10:47:44,895 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.49 vs. limit=15.0 2024-06-20 10:47:48,057 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.52 vs. limit=15.0 2024-06-20 10:47:48,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=189828.83333333334, ans=0.125 2024-06-20 10:47:50,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=189847.16666666666, ans=0.0 2024-06-20 10:47:56,593 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.49 vs. limit=15.0 2024-06-20 10:47:57,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=189865.5, ans=0.125 2024-06-20 10:48:03,712 INFO [train.py:1028] (1/2) Epoch 11, batch 2400, loss[loss=0.2482, simple_loss=0.2954, pruned_loss=0.1005, over 13343.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.2838, pruned_loss=0.09766, over 2588816.07 frames. ], batch size: 46, lr: 5.41e-03, grad_scale: 64.0 2024-06-20 10:48:03,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=189883.83333333334, ans=0.0 2024-06-20 10:48:08,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189883.83333333334, ans=0.1 2024-06-20 10:48:08,930 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=8.757e+01 2024-06-20 10:48:09,165 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.74 vs. limit=15.0 2024-06-20 10:48:12,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=189902.16666666666, ans=0.0 2024-06-20 10:48:12,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=189902.16666666666, ans=0.0 2024-06-20 10:48:13,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=189902.16666666666, ans=0.125 2024-06-20 10:48:25,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=189938.83333333334, ans=0.025 2024-06-20 10:48:25,829 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.61 vs. limit=15.0 2024-06-20 10:48:29,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=189938.83333333334, ans=0.125 2024-06-20 10:48:33,119 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.550e+02 1.825e+02 1.933e+02 2.100e+02 2.976e+02, threshold=3.866e+02, percent-clipped=0.0 2024-06-20 10:48:35,942 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189957.16666666666, ans=0.1 2024-06-20 10:48:36,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=189957.16666666666, ans=0.05 2024-06-20 10:48:39,019 INFO [train.py:1028] (1/2) Epoch 11, batch 2450, loss[loss=0.2276, simple_loss=0.2745, pruned_loss=0.09031, over 13291.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.2832, pruned_loss=0.0978, over 2584834.38 frames. ], batch size: 63, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:48:59,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.01 vs. limit=22.5 2024-06-20 10:49:03,691 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.53 vs. limit=10.0 2024-06-20 10:49:14,486 INFO [train.py:1028] (1/2) Epoch 11, batch 2500, loss[loss=0.2271, simple_loss=0.2651, pruned_loss=0.0946, over 13174.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.2821, pruned_loss=0.09739, over 2586702.62 frames. ], batch size: 83, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:49:14,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=190067.16666666666, ans=0.125 2024-06-20 10:49:31,088 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.21 vs. limit=22.5 2024-06-20 10:49:39,264 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.76 vs. limit=6.0 2024-06-20 10:49:39,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=190140.5, ans=0.035 2024-06-20 10:49:41,072 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.835e+02 2.008e+02 2.321e+02 3.089e+02, threshold=4.016e+02, percent-clipped=0.0 2024-06-20 10:49:43,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=190140.5, ans=0.125 2024-06-20 10:49:46,948 INFO [train.py:1028] (1/2) Epoch 11, batch 2550, loss[loss=0.2484, simple_loss=0.2999, pruned_loss=0.09847, over 12614.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.2816, pruned_loss=0.09706, over 2586638.90 frames. ], batch size: 22, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:49:47,959 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.54 vs. limit=22.5 2024-06-20 10:49:53,061 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=190177.16666666666, ans=0.125 2024-06-20 10:49:55,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=190177.16666666666, ans=0.2 2024-06-20 10:49:56,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=190177.16666666666, ans=0.0 2024-06-20 10:50:20,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=190232.16666666666, ans=0.0 2024-06-20 10:50:22,004 INFO [train.py:1028] (1/2) Epoch 11, batch 2600, loss[loss=0.2478, simple_loss=0.2986, pruned_loss=0.09851, over 13271.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.2801, pruned_loss=0.09656, over 2585934.48 frames. ], batch size: 52, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:50:36,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=190287.16666666666, ans=0.0 2024-06-20 10:50:51,019 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.769e+02 1.984e+02 2.157e+02 2.753e+02, threshold=3.968e+02, percent-clipped=0.0 2024-06-20 10:50:57,013 INFO [train.py:1028] (1/2) Epoch 11, batch 2650, loss[loss=0.2346, simple_loss=0.2714, pruned_loss=0.09893, over 13063.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.2783, pruned_loss=0.09566, over 2585855.93 frames. ], batch size: 144, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:50:59,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=190342.16666666666, ans=0.0 2024-06-20 10:50:59,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=190342.16666666666, ans=0.125 2024-06-20 10:51:01,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=190342.16666666666, ans=0.5 2024-06-20 10:51:03,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=190360.5, ans=0.2 2024-06-20 10:51:14,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=190378.83333333334, ans=0.125 2024-06-20 10:51:18,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.05 vs. limit=22.5 2024-06-20 10:51:20,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=190397.16666666666, ans=0.125 2024-06-20 10:51:24,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=190415.5, ans=0.125 2024-06-20 10:51:29,415 INFO [train.py:1028] (1/2) Epoch 11, batch 2700, loss[loss=0.2378, simple_loss=0.2743, pruned_loss=0.1006, over 13209.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.2769, pruned_loss=0.09534, over 2584553.80 frames. ], batch size: 89, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:51:37,457 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=3.524e+00 2024-06-20 10:51:50,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=190488.83333333334, ans=0.2 2024-06-20 10:51:52,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=190488.83333333334, ans=0.025 2024-06-20 10:51:55,794 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.881e+02 2.051e+02 2.264e+02 3.364e+02, threshold=4.103e+02, percent-clipped=0.0 2024-06-20 10:51:57,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=190507.16666666666, ans=0.125 2024-06-20 10:51:57,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=190507.16666666666, ans=0.1 2024-06-20 10:52:00,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=190507.16666666666, ans=0.0 2024-06-20 10:52:01,910 INFO [train.py:1028] (1/2) Epoch 11, batch 2750, loss[loss=0.2397, simple_loss=0.2802, pruned_loss=0.09959, over 13205.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.2756, pruned_loss=0.09442, over 2581119.17 frames. ], batch size: 43, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:52:07,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=190525.5, ans=0.0 2024-06-20 10:52:10,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=190525.5, ans=0.0 2024-06-20 10:52:15,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=190543.83333333334, ans=0.0 2024-06-20 10:52:15,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=190543.83333333334, ans=0.125 2024-06-20 10:52:15,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=190543.83333333334, ans=0.0 2024-06-20 10:52:27,468 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.65 vs. limit=6.0 2024-06-20 10:52:39,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=190598.83333333334, ans=0.125 2024-06-20 10:52:41,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=190598.83333333334, ans=0.125 2024-06-20 10:52:42,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=190617.16666666666, ans=0.125 2024-06-20 10:52:43,026 INFO [train.py:1028] (1/2) Epoch 11, batch 2800, loss[loss=0.2375, simple_loss=0.2748, pruned_loss=0.1001, over 10843.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.2753, pruned_loss=0.0947, over 2579196.44 frames. ], batch size: 305, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:52:48,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=190617.16666666666, ans=0.0 2024-06-20 10:52:48,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=190617.16666666666, ans=0.0 2024-06-20 10:52:48,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=190617.16666666666, ans=0.125 2024-06-20 10:52:52,129 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=12.0 2024-06-20 10:52:53,697 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:53:13,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=190690.5, ans=0.125 2024-06-20 10:53:14,453 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.802e+02 1.931e+02 2.136e+02 2.816e+02, threshold=3.861e+02, percent-clipped=0.0 2024-06-20 10:53:20,507 INFO [train.py:1028] (1/2) Epoch 11, batch 2850, loss[loss=0.2149, simple_loss=0.2662, pruned_loss=0.08182, over 13328.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.274, pruned_loss=0.09415, over 2576795.63 frames. ], batch size: 49, lr: 5.39e-03, grad_scale: 64.0 2024-06-20 10:53:24,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=190708.83333333334, ans=0.125 2024-06-20 10:53:26,718 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.79 vs. limit=22.5 2024-06-20 10:53:34,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=190745.5, ans=0.125 2024-06-20 10:53:35,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=190745.5, ans=0.5 2024-06-20 10:53:41,539 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2024-06-20 10:53:45,460 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:53:49,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=190782.16666666666, ans=0.025 2024-06-20 10:53:49,540 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.63 vs. limit=15.0 2024-06-20 10:53:52,338 INFO [train.py:1028] (1/2) Epoch 11, batch 2900, loss[loss=0.2113, simple_loss=0.2597, pruned_loss=0.08146, over 13105.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2722, pruned_loss=0.09322, over 2584780.59 frames. ], batch size: 55, lr: 5.39e-03, grad_scale: 64.0 2024-06-20 10:53:54,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.90 vs. limit=15.0 2024-06-20 10:54:06,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=190818.83333333334, ans=0.125 2024-06-20 10:54:10,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=190837.16666666666, ans=0.025 2024-06-20 10:54:14,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=190855.5, ans=0.2 2024-06-20 10:54:15,024 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.64 vs. limit=6.0 2024-06-20 10:54:20,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=190855.5, ans=0.2 2024-06-20 10:54:21,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=190873.83333333334, ans=0.1 2024-06-20 10:54:21,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=190873.83333333334, ans=0.2 2024-06-20 10:54:22,527 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.741e+02 1.853e+02 2.002e+02 2.630e+02, threshold=3.707e+02, percent-clipped=0.0 2024-06-20 10:54:23,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=190873.83333333334, ans=0.1 2024-06-20 10:54:25,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=190873.83333333334, ans=0.0 2024-06-20 10:54:26,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=190873.83333333334, ans=0.125 2024-06-20 10:54:28,373 INFO [train.py:1028] (1/2) Epoch 11, batch 2950, loss[loss=0.2309, simple_loss=0.2761, pruned_loss=0.09283, over 13253.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.272, pruned_loss=0.09313, over 2577937.55 frames. ], batch size: 43, lr: 5.39e-03, grad_scale: 64.0 2024-06-20 10:54:32,632 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=6.988e+00 2024-06-20 10:54:49,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=190928.83333333334, ans=0.125 2024-06-20 10:54:51,129 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:54:53,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=190947.16666666666, ans=0.2 2024-06-20 10:54:59,286 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=9.76 vs. limit=12.0 2024-06-20 10:55:05,018 INFO [train.py:1028] (1/2) Epoch 11, batch 3000, loss[loss=0.2512, simple_loss=0.2952, pruned_loss=0.1036, over 13231.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2711, pruned_loss=0.0924, over 2577378.41 frames. ], batch size: 59, lr: 5.39e-03, grad_scale: 64.0 2024-06-20 10:55:05,019 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 10:55:10,513 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.0581, 3.5835, 4.0008, 3.7081], device='cuda:1') 2024-06-20 10:55:11,907 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.0180, 1.6165, 1.6365, 2.0783], device='cuda:1') 2024-06-20 10:55:12,925 INFO [train.py:1060] (1/2) Epoch 11, validation: loss=0.1958, simple_loss=0.2602, pruned_loss=0.06568, over 351949.00 frames. 2024-06-20 10:55:12,925 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 10:55:36,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=191038.83333333334, ans=0.1 2024-06-20 10:55:40,817 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.809e+02 1.927e+02 2.146e+02 2.811e+02, threshold=3.855e+02, percent-clipped=0.0 2024-06-20 10:55:46,904 INFO [train.py:1028] (1/2) Epoch 11, batch 3050, loss[loss=0.2017, simple_loss=0.2492, pruned_loss=0.07707, over 13267.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2707, pruned_loss=0.09267, over 2576413.79 frames. ], batch size: 46, lr: 5.39e-03, grad_scale: 128.0 2024-06-20 10:55:50,010 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.38 vs. limit=12.0 2024-06-20 10:55:52,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=191093.83333333334, ans=0.125 2024-06-20 10:55:59,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=191093.83333333334, ans=0.125 2024-06-20 10:56:02,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=191112.16666666666, ans=0.0 2024-06-20 10:56:10,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=191130.5, ans=0.0 2024-06-20 10:56:17,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=191148.83333333334, ans=0.0 2024-06-20 10:56:23,237 INFO [train.py:1028] (1/2) Epoch 11, batch 3100, loss[loss=0.2504, simple_loss=0.2862, pruned_loss=0.1074, over 13016.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2702, pruned_loss=0.09258, over 2577555.00 frames. ], batch size: 144, lr: 5.39e-03, grad_scale: 128.0 2024-06-20 10:56:24,548 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.69 vs. limit=5.0 2024-06-20 10:56:29,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=191185.5, ans=0.125 2024-06-20 10:56:30,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=191185.5, ans=0.2 2024-06-20 10:56:30,837 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.88 vs. limit=10.0 2024-06-20 10:56:47,019 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.25 vs. limit=10.0 2024-06-20 10:56:48,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=191222.16666666666, ans=0.0 2024-06-20 10:56:53,349 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.796e+02 1.917e+02 2.100e+02 2.638e+02, threshold=3.834e+02, percent-clipped=0.0 2024-06-20 10:56:54,496 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.88 vs. limit=15.0 2024-06-20 10:56:58,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=191240.5, ans=0.2 2024-06-20 10:56:59,206 INFO [train.py:1028] (1/2) Epoch 11, batch 3150, loss[loss=0.2316, simple_loss=0.2667, pruned_loss=0.09826, over 12919.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2689, pruned_loss=0.09194, over 2579688.47 frames. ], batch size: 158, lr: 5.39e-03, grad_scale: 128.0 2024-06-20 10:57:04,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=191258.83333333334, ans=0.0 2024-06-20 10:57:08,870 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=7.409e+00 2024-06-20 10:57:11,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=191277.16666666666, ans=0.125 2024-06-20 10:57:11,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=191277.16666666666, ans=0.125 2024-06-20 10:57:12,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=191295.5, ans=0.0 2024-06-20 10:57:12,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=191295.5, ans=0.125 2024-06-20 10:57:15,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=191295.5, ans=0.125 2024-06-20 10:57:22,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=191313.83333333334, ans=0.1 2024-06-20 10:57:29,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=191332.16666666666, ans=0.2 2024-06-20 10:57:31,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=191350.5, ans=0.07 2024-06-20 10:57:32,415 INFO [train.py:1028] (1/2) Epoch 11, batch 3200, loss[loss=0.1885, simple_loss=0.2367, pruned_loss=0.07015, over 13157.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2681, pruned_loss=0.09145, over 2579574.69 frames. ], batch size: 55, lr: 5.39e-03, grad_scale: 128.0 2024-06-20 10:57:40,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=191368.83333333334, ans=0.0 2024-06-20 10:57:46,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=191387.16666666666, ans=0.125 2024-06-20 10:57:49,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=191387.16666666666, ans=0.125 2024-06-20 10:57:57,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=191405.5, ans=0.0 2024-06-20 10:58:00,729 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.87 vs. limit=10.0 2024-06-20 10:58:01,495 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.791e+02 1.907e+02 2.153e+02 2.812e+02, threshold=3.815e+02, percent-clipped=0.0 2024-06-20 10:58:06,880 INFO [train.py:1028] (1/2) Epoch 11, batch 3250, loss[loss=0.2118, simple_loss=0.2507, pruned_loss=0.0864, over 13244.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2681, pruned_loss=0.09166, over 2584564.89 frames. ], batch size: 72, lr: 5.38e-03, grad_scale: 128.0 2024-06-20 10:58:08,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191442.16666666666, ans=0.1 2024-06-20 10:58:11,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=191442.16666666666, ans=0.2 2024-06-20 10:58:11,494 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.57 vs. limit=22.5 2024-06-20 10:58:25,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=191478.83333333334, ans=0.125 2024-06-20 10:58:26,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=191497.16666666666, ans=0.125 2024-06-20 10:58:31,467 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.50 vs. limit=15.0 2024-06-20 10:58:34,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=191515.5, ans=0.0 2024-06-20 10:58:44,315 INFO [train.py:1028] (1/2) Epoch 11, batch 3300, loss[loss=0.2347, simple_loss=0.2764, pruned_loss=0.09647, over 12726.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2671, pruned_loss=0.09081, over 2581747.11 frames. ], batch size: 176, lr: 5.38e-03, grad_scale: 64.0 2024-06-20 10:58:59,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=191570.5, ans=0.1 2024-06-20 10:59:04,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=191588.83333333334, ans=0.125 2024-06-20 10:59:10,974 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.800e+02 1.946e+02 2.154e+02 3.177e+02, threshold=3.891e+02, percent-clipped=0.0 2024-06-20 10:59:14,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=191607.16666666666, ans=0.0 2024-06-20 10:59:15,809 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2024-06-20 10:59:16,116 INFO [train.py:1028] (1/2) Epoch 11, batch 3350, loss[loss=0.2461, simple_loss=0.279, pruned_loss=0.1066, over 12948.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2668, pruned_loss=0.0911, over 2576827.06 frames. ], batch size: 158, lr: 5.38e-03, grad_scale: 64.0 2024-06-20 10:59:18,381 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2024-06-20 10:59:27,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=191643.83333333334, ans=0.125 2024-06-20 10:59:29,484 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.46 vs. limit=15.0 2024-06-20 10:59:30,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=191662.16666666666, ans=0.125 2024-06-20 10:59:38,625 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2024-06-20 10:59:41,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=191698.83333333334, ans=0.125 2024-06-20 10:59:43,095 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.06 vs. limit=10.0 2024-06-20 10:59:51,616 INFO [train.py:1028] (1/2) Epoch 11, batch 3400, loss[loss=0.2196, simple_loss=0.2752, pruned_loss=0.08198, over 12663.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2665, pruned_loss=0.09133, over 2574923.00 frames. ], batch size: 22, lr: 5.38e-03, grad_scale: 64.0 2024-06-20 10:59:58,287 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.22 vs. limit=22.5 2024-06-20 10:59:59,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=191735.5, ans=0.0 2024-06-20 11:00:00,150 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.98 vs. limit=6.0 2024-06-20 11:00:00,956 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=12.0 2024-06-20 11:00:05,811 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=10.02 vs. limit=15.0 2024-06-20 11:00:08,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=191753.83333333334, ans=0.0 2024-06-20 11:00:13,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=191772.16666666666, ans=0.125 2024-06-20 11:00:15,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=191772.16666666666, ans=0.0 2024-06-20 11:00:18,191 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:00:19,750 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.751e+02 1.900e+02 2.077e+02 2.634e+02, threshold=3.801e+02, percent-clipped=0.0 2024-06-20 11:00:28,441 INFO [train.py:1028] (1/2) Epoch 11, batch 3450, loss[loss=0.2379, simple_loss=0.2776, pruned_loss=0.09916, over 12720.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2653, pruned_loss=0.09061, over 2575361.05 frames. ], batch size: 176, lr: 5.38e-03, grad_scale: 64.0 2024-06-20 11:00:39,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=191827.16666666666, ans=0.125 2024-06-20 11:00:39,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=191827.16666666666, ans=0.0 2024-06-20 11:00:41,637 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.75 vs. limit=6.0 2024-06-20 11:00:44,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191845.5, ans=0.1 2024-06-20 11:00:47,081 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.32 vs. limit=15.0 2024-06-20 11:00:51,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=191863.83333333334, ans=0.04949747468305833 2024-06-20 11:00:56,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=191882.16666666666, ans=0.1 2024-06-20 11:00:58,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=191882.16666666666, ans=0.125 2024-06-20 11:01:02,166 INFO [train.py:1028] (1/2) Epoch 11, batch 3500, loss[loss=0.2248, simple_loss=0.2637, pruned_loss=0.09298, over 12916.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2654, pruned_loss=0.09059, over 2574100.97 frames. ], batch size: 33, lr: 5.38e-03, grad_scale: 64.0 2024-06-20 11:01:04,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=10.48 vs. limit=12.0 2024-06-20 11:01:07,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=191900.5, ans=0.0 2024-06-20 11:01:10,390 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=191918.83333333334, ans=0.0 2024-06-20 11:01:10,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=191918.83333333334, ans=0.125 2024-06-20 11:01:11,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=191918.83333333334, ans=0.0 2024-06-20 11:01:23,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=191955.5, ans=0.125 2024-06-20 11:01:24,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=191955.5, ans=0.5 2024-06-20 11:01:24,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=191955.5, ans=0.125 2024-06-20 11:01:26,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191955.5, ans=0.1 2024-06-20 11:01:27,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=191955.5, ans=0.0 2024-06-20 11:01:30,662 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.714e+02 1.844e+02 1.989e+02 2.720e+02, threshold=3.687e+02, percent-clipped=0.0 2024-06-20 11:01:36,090 INFO [train.py:1028] (1/2) Epoch 11, batch 3550, loss[loss=0.2113, simple_loss=0.2459, pruned_loss=0.08835, over 13149.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2643, pruned_loss=0.08995, over 2576888.02 frames. ], batch size: 95, lr: 5.38e-03, grad_scale: 64.0 2024-06-20 11:01:44,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=192010.5, ans=0.125 2024-06-20 11:01:52,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=192028.83333333334, ans=0.125 2024-06-20 11:01:56,024 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.73 vs. limit=10.0 2024-06-20 11:01:56,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=192047.16666666666, ans=0.025 2024-06-20 11:01:56,725 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.66 vs. limit=15.0 2024-06-20 11:02:02,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=192047.16666666666, ans=0.025 2024-06-20 11:02:08,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=192065.5, ans=0.125 2024-06-20 11:02:12,391 INFO [train.py:1028] (1/2) Epoch 11, batch 3600, loss[loss=0.2182, simple_loss=0.2669, pruned_loss=0.08468, over 13281.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2642, pruned_loss=0.09019, over 2580146.62 frames. ], batch size: 49, lr: 5.38e-03, grad_scale: 64.0 2024-06-20 11:02:17,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=192083.83333333334, ans=0.125 2024-06-20 11:02:27,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=192120.5, ans=0.125 2024-06-20 11:02:29,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=192120.5, ans=0.0 2024-06-20 11:02:41,084 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.65 vs. limit=22.5 2024-06-20 11:02:43,231 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.766e+02 1.959e+02 2.210e+02 3.050e+02, threshold=3.917e+02, percent-clipped=0.0 2024-06-20 11:02:45,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=192157.16666666666, ans=0.0 2024-06-20 11:02:48,507 INFO [train.py:1028] (1/2) Epoch 11, batch 3650, loss[loss=0.2091, simple_loss=0.2433, pruned_loss=0.08752, over 12997.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2639, pruned_loss=0.0897, over 2578501.10 frames. ], batch size: 102, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:02:52,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=192175.5, ans=0.1 2024-06-20 11:03:00,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=192212.16666666666, ans=0.125 2024-06-20 11:03:02,586 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.63 vs. limit=15.0 2024-06-20 11:03:06,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=192212.16666666666, ans=0.125 2024-06-20 11:03:13,272 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.56 vs. limit=6.0 2024-06-20 11:03:14,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=192248.83333333334, ans=0.125 2024-06-20 11:03:17,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=192248.83333333334, ans=0.125 2024-06-20 11:03:21,537 INFO [train.py:1028] (1/2) Epoch 11, batch 3700, loss[loss=0.2261, simple_loss=0.2717, pruned_loss=0.09029, over 13221.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2631, pruned_loss=0.0892, over 2583643.80 frames. ], batch size: 72, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:03:32,548 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=12.0 2024-06-20 11:03:43,482 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.60 vs. limit=15.0 2024-06-20 11:03:44,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=192322.16666666666, ans=0.125 2024-06-20 11:03:47,608 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.33 vs. limit=22.5 2024-06-20 11:03:49,150 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.686e+02 1.812e+02 1.906e+02 2.838e+02, threshold=3.625e+02, percent-clipped=0.0 2024-06-20 11:03:54,489 INFO [train.py:1028] (1/2) Epoch 11, batch 3750, loss[loss=0.23, simple_loss=0.2646, pruned_loss=0.0977, over 12679.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2626, pruned_loss=0.08904, over 2585998.75 frames. ], batch size: 22, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:04:11,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=192395.5, ans=0.125 2024-06-20 11:04:11,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=192395.5, ans=0.0 2024-06-20 11:04:23,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=192413.83333333334, ans=0.125 2024-06-20 11:04:27,080 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.74 vs. limit=15.0 2024-06-20 11:04:31,830 INFO [train.py:1028] (1/2) Epoch 11, batch 3800, loss[loss=0.2181, simple_loss=0.2592, pruned_loss=0.0885, over 13200.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2629, pruned_loss=0.08916, over 2583765.06 frames. ], batch size: 83, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:04:37,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=192450.5, ans=0.125 2024-06-20 11:04:49,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=192487.16666666666, ans=0.2 2024-06-20 11:04:49,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=192487.16666666666, ans=0.2 2024-06-20 11:04:52,747 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.83 vs. limit=6.0 2024-06-20 11:04:59,723 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.56 vs. limit=15.0 2024-06-20 11:05:00,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=192505.5, ans=0.125 2024-06-20 11:05:04,067 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.774e+02 1.905e+02 2.091e+02 2.703e+02, threshold=3.810e+02, percent-clipped=0.0 2024-06-20 11:05:09,710 INFO [train.py:1028] (1/2) Epoch 11, batch 3850, loss[loss=0.2148, simple_loss=0.2548, pruned_loss=0.08738, over 13035.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2628, pruned_loss=0.08919, over 2584176.45 frames. ], batch size: 144, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:05:15,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=192542.16666666666, ans=0.125 2024-06-20 11:05:39,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=192615.5, ans=0.125 2024-06-20 11:05:42,655 INFO [train.py:1028] (1/2) Epoch 11, batch 3900, loss[loss=0.2011, simple_loss=0.2435, pruned_loss=0.07934, over 13207.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2623, pruned_loss=0.08903, over 2586830.38 frames. ], batch size: 83, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:05:44,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=192633.83333333334, ans=0.1 2024-06-20 11:05:47,704 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.05 vs. limit=15.0 2024-06-20 11:05:52,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=192652.16666666666, ans=0.125 2024-06-20 11:05:54,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=192670.5, ans=0.125 2024-06-20 11:05:57,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=192670.5, ans=0.0 2024-06-20 11:05:58,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=192670.5, ans=0.125 2024-06-20 11:06:09,881 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.796e+02 1.976e+02 2.264e+02 3.821e+02, threshold=3.951e+02, percent-clipped=1.0 2024-06-20 11:06:10,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=192707.16666666666, ans=0.125 2024-06-20 11:06:16,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=192707.16666666666, ans=0.125 2024-06-20 11:06:16,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=192707.16666666666, ans=0.125 2024-06-20 11:06:18,606 INFO [train.py:1028] (1/2) Epoch 11, batch 3950, loss[loss=0.2345, simple_loss=0.266, pruned_loss=0.1015, over 13093.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2621, pruned_loss=0.08871, over 2589082.35 frames. ], batch size: 132, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:06:26,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=192743.83333333334, ans=0.125 2024-06-20 11:06:35,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=192762.16666666666, ans=0.1 2024-06-20 11:06:35,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=192762.16666666666, ans=0.05 2024-06-20 11:06:37,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=192762.16666666666, ans=0.0 2024-06-20 11:06:41,147 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:06:54,439 INFO [train.py:1028] (1/2) Epoch 11, batch 4000, loss[loss=0.2298, simple_loss=0.2737, pruned_loss=0.09293, over 12913.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2613, pruned_loss=0.08845, over 2583252.80 frames. ], batch size: 39, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:06:57,625 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.78 vs. limit=15.0 2024-06-20 11:06:58,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=192817.16666666666, ans=0.1 2024-06-20 11:07:02,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=192835.5, ans=0.025 2024-06-20 11:07:05,553 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.47 vs. limit=22.5 2024-06-20 11:07:09,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=192853.83333333334, ans=0.0 2024-06-20 11:07:11,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=192853.83333333334, ans=0.125 2024-06-20 11:07:12,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=192853.83333333334, ans=0.1 2024-06-20 11:07:15,102 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.97 vs. limit=10.0 2024-06-20 11:07:15,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=192872.16666666666, ans=0.035 2024-06-20 11:07:22,676 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.781e+02 1.904e+02 2.079e+02 2.447e+02, threshold=3.808e+02, percent-clipped=0.0 2024-06-20 11:07:22,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=192890.5, ans=0.125 2024-06-20 11:07:27,921 INFO [train.py:1028] (1/2) Epoch 11, batch 4050, loss[loss=0.233, simple_loss=0.2621, pruned_loss=0.1019, over 11080.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2614, pruned_loss=0.08882, over 2581635.02 frames. ], batch size: 304, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:07:37,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=192927.16666666666, ans=0.1 2024-06-20 11:08:00,950 INFO [train.py:1028] (1/2) Epoch 11, batch 4100, loss[loss=0.2364, simple_loss=0.2685, pruned_loss=0.1022, over 13018.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2613, pruned_loss=0.08896, over 2578738.08 frames. ], batch size: 102, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:08:08,555 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.54 vs. limit=15.0 2024-06-20 11:08:31,794 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.776e+02 1.880e+02 2.082e+02 2.955e+02, threshold=3.761e+02, percent-clipped=0.0 2024-06-20 11:08:37,265 INFO [train.py:1028] (1/2) Epoch 11, batch 4150, loss[loss=0.2124, simple_loss=0.2548, pruned_loss=0.08497, over 13135.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2605, pruned_loss=0.08853, over 2575503.78 frames. ], batch size: 55, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:08:40,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=193092.16666666666, ans=0.1 2024-06-20 11:08:41,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=193092.16666666666, ans=0.2 2024-06-20 11:08:50,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=193110.5, ans=0.1 2024-06-20 11:08:51,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=193110.5, ans=0.05 2024-06-20 11:08:53,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=193128.83333333334, ans=0.125 2024-06-20 11:08:56,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=193128.83333333334, ans=0.125 2024-06-20 11:09:13,161 INFO [train.py:1028] (1/2) Epoch 11, batch 4200, loss[loss=0.1976, simple_loss=0.2408, pruned_loss=0.07716, over 13131.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2598, pruned_loss=0.08833, over 2577588.42 frames. ], batch size: 103, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:09:17,708 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.51 vs. limit=22.5 2024-06-20 11:09:22,772 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.01 vs. limit=22.5 2024-06-20 11:09:34,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=193238.83333333334, ans=15.0 2024-06-20 11:09:35,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=193238.83333333334, ans=0.0 2024-06-20 11:09:41,024 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.714e+02 1.886e+02 2.127e+02 3.060e+02, threshold=3.772e+02, percent-clipped=0.0 2024-06-20 11:09:46,461 INFO [train.py:1028] (1/2) Epoch 11, batch 4250, loss[loss=0.1933, simple_loss=0.2447, pruned_loss=0.07096, over 13263.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2595, pruned_loss=0.08794, over 2580398.34 frames. ], batch size: 46, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:09:46,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=193275.5, ans=0.2 2024-06-20 11:09:52,453 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=193293.83333333334, ans=0.0 2024-06-20 11:09:53,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=193293.83333333334, ans=0.2 2024-06-20 11:09:53,266 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.70 vs. limit=12.0 2024-06-20 11:09:54,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=193293.83333333334, ans=0.1 2024-06-20 11:10:10,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=193330.5, ans=0.125 2024-06-20 11:10:11,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=193348.83333333334, ans=0.125 2024-06-20 11:10:19,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=193348.83333333334, ans=0.07 2024-06-20 11:10:22,215 INFO [train.py:1028] (1/2) Epoch 11, batch 4300, loss[loss=0.225, simple_loss=0.2728, pruned_loss=0.08864, over 13216.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2587, pruned_loss=0.08766, over 2580127.52 frames. ], batch size: 59, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:10:29,546 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.28 vs. limit=22.5 2024-06-20 11:10:31,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=193385.5, ans=0.0 2024-06-20 11:10:34,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=193403.83333333334, ans=0.125 2024-06-20 11:10:46,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=193422.16666666666, ans=0.1 2024-06-20 11:10:51,764 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.726e+02 1.861e+02 2.085e+02 2.586e+02, threshold=3.721e+02, percent-clipped=0.0 2024-06-20 11:10:54,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=193440.5, ans=0.1 2024-06-20 11:10:56,972 INFO [train.py:1028] (1/2) Epoch 11, batch 4350, loss[loss=0.2233, simple_loss=0.2716, pruned_loss=0.08745, over 13188.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2584, pruned_loss=0.08752, over 2585709.56 frames. ], batch size: 59, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:11:14,500 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.84 vs. limit=15.0 2024-06-20 11:11:20,145 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.29 vs. limit=6.0 2024-06-20 11:11:25,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=193532.16666666666, ans=0.0 2024-06-20 11:11:29,749 INFO [train.py:1028] (1/2) Epoch 11, batch 4400, loss[loss=0.21, simple_loss=0.2528, pruned_loss=0.08357, over 13257.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2582, pruned_loss=0.08754, over 2585888.62 frames. ], batch size: 83, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:11:57,195 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.735e+02 1.867e+02 2.033e+02 2.727e+02, threshold=3.735e+02, percent-clipped=0.0 2024-06-20 11:12:02,710 INFO [train.py:1028] (1/2) Epoch 11, batch 4450, loss[loss=0.2025, simple_loss=0.2436, pruned_loss=0.08066, over 13020.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2586, pruned_loss=0.08783, over 2580411.90 frames. ], batch size: 33, lr: 5.35e-03, grad_scale: 64.0 2024-06-20 11:12:35,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=193715.5, ans=0.125 2024-06-20 11:12:37,569 INFO [train.py:1028] (1/2) Epoch 11, batch 4500, loss[loss=0.2245, simple_loss=0.2577, pruned_loss=0.09559, over 13244.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2574, pruned_loss=0.0872, over 2585053.93 frames. ], batch size: 89, lr: 5.35e-03, grad_scale: 64.0 2024-06-20 11:12:37,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=193733.83333333334, ans=0.125 2024-06-20 11:12:42,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=193733.83333333334, ans=0.125 2024-06-20 11:12:50,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=193752.16666666666, ans=0.125 2024-06-20 11:12:50,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=193752.16666666666, ans=0.0 2024-06-20 11:12:56,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=193770.5, ans=0.125 2024-06-20 11:13:00,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=193770.5, ans=0.0 2024-06-20 11:13:02,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=193788.83333333334, ans=0.125 2024-06-20 11:13:09,474 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.709e+02 1.848e+02 1.984e+02 3.055e+02, threshold=3.695e+02, percent-clipped=0.0 2024-06-20 11:13:11,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=193807.16666666666, ans=0.2 2024-06-20 11:13:14,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=193825.5, ans=0.0 2024-06-20 11:13:14,761 INFO [train.py:1028] (1/2) Epoch 11, batch 4550, loss[loss=0.1908, simple_loss=0.2373, pruned_loss=0.07212, over 13220.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2575, pruned_loss=0.08727, over 2588637.36 frames. ], batch size: 52, lr: 5.35e-03, grad_scale: 64.0 2024-06-20 11:13:17,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=193825.5, ans=0.0 2024-06-20 11:13:28,951 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:13:47,331 INFO [train.py:1028] (1/2) Epoch 11, batch 4600, loss[loss=0.2373, simple_loss=0.2715, pruned_loss=0.1015, over 12594.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2579, pruned_loss=0.08748, over 2585786.63 frames. ], batch size: 202, lr: 5.35e-03, grad_scale: 64.0 2024-06-20 11:14:19,651 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.727e+02 1.828e+02 1.972e+02 2.584e+02, threshold=3.656e+02, percent-clipped=0.0 2024-06-20 11:14:21,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=193990.5, ans=0.07 2024-06-20 11:14:24,722 INFO [train.py:1028] (1/2) Epoch 11, batch 4650, loss[loss=0.2208, simple_loss=0.2553, pruned_loss=0.09312, over 13147.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2573, pruned_loss=0.0875, over 2587758.55 frames. ], batch size: 132, lr: 5.35e-03, grad_scale: 64.0 2024-06-20 11:14:24,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=194008.83333333334, ans=0.04949747468305833 2024-06-20 11:14:31,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=194027.16666666666, ans=0.0 2024-06-20 11:14:32,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=194027.16666666666, ans=0.0 2024-06-20 11:14:41,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=194045.5, ans=0.0 2024-06-20 11:14:58,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=194082.16666666666, ans=0.125 2024-06-20 11:14:59,218 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.87 vs. limit=15.0 2024-06-20 11:15:00,800 INFO [train.py:1028] (1/2) Epoch 11, batch 4700, loss[loss=0.2122, simple_loss=0.261, pruned_loss=0.08173, over 12467.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2578, pruned_loss=0.08782, over 2583189.01 frames. ], batch size: 25, lr: 5.35e-03, grad_scale: 64.0 2024-06-20 11:15:04,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=194100.5, ans=0.0 2024-06-20 11:15:07,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=194118.83333333334, ans=0.2 2024-06-20 11:15:25,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=194155.5, ans=0.035 2024-06-20 11:15:28,314 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.745e+02 1.870e+02 2.132e+02 2.626e+02, threshold=3.741e+02, percent-clipped=0.0 2024-06-20 11:15:29,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=194173.83333333334, ans=0.125 2024-06-20 11:15:31,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=194173.83333333334, ans=0.0 2024-06-20 11:15:32,869 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.92 vs. limit=15.0 2024-06-20 11:15:33,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=194192.16666666666, ans=0.125 2024-06-20 11:15:33,805 INFO [train.py:1028] (1/2) Epoch 11, batch 4750, loss[loss=0.2486, simple_loss=0.2804, pruned_loss=0.1084, over 12628.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2571, pruned_loss=0.08736, over 2582155.03 frames. ], batch size: 204, lr: 5.35e-03, grad_scale: 64.0 2024-06-20 11:15:37,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=194192.16666666666, ans=0.125 2024-06-20 11:15:53,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=194247.16666666666, ans=0.125 2024-06-20 11:15:54,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=194247.16666666666, ans=0.125 2024-06-20 11:15:54,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=194247.16666666666, ans=0.125 2024-06-20 11:16:00,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=194265.5, ans=0.0 2024-06-20 11:16:01,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=194265.5, ans=0.1 2024-06-20 11:16:02,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=194265.5, ans=0.2 2024-06-20 11:16:03,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=194265.5, ans=0.1 2024-06-20 11:16:07,228 INFO [train.py:1028] (1/2) Epoch 11, batch 4800, loss[loss=0.2291, simple_loss=0.2682, pruned_loss=0.095, over 13329.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2568, pruned_loss=0.08697, over 2579314.02 frames. ], batch size: 63, lr: 5.34e-03, grad_scale: 64.0 2024-06-20 11:16:20,033 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.94 vs. limit=15.0 2024-06-20 11:16:27,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=194320.5, ans=0.0 2024-06-20 11:16:38,821 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.753e+02 1.925e+02 2.184e+02 3.304e+02, threshold=3.850e+02, percent-clipped=0.0 2024-06-20 11:16:45,797 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.42 vs. limit=22.5 2024-06-20 11:16:47,540 INFO [train.py:1028] (1/2) Epoch 11, batch 4850, loss[loss=0.2184, simple_loss=0.2537, pruned_loss=0.09161, over 13253.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2569, pruned_loss=0.08682, over 2575966.09 frames. ], batch size: 89, lr: 5.34e-03, grad_scale: 64.0 2024-06-20 11:16:56,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=194393.83333333334, ans=0.125 2024-06-20 11:16:56,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=194393.83333333334, ans=0.1 2024-06-20 11:17:03,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=194412.16666666666, ans=0.125 2024-06-20 11:17:16,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=194448.83333333334, ans=0.2 2024-06-20 11:17:21,270 INFO [train.py:1028] (1/2) Epoch 11, batch 4900, loss[loss=0.188, simple_loss=0.2347, pruned_loss=0.07061, over 13168.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2571, pruned_loss=0.08697, over 2577442.60 frames. ], batch size: 59, lr: 5.34e-03, grad_scale: 64.0 2024-06-20 11:17:22,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=194467.16666666666, ans=0.1 2024-06-20 11:17:25,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=194467.16666666666, ans=0.125 2024-06-20 11:17:25,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=194467.16666666666, ans=0.125 2024-06-20 11:17:27,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=194485.5, ans=0.125 2024-06-20 11:17:30,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=194485.5, ans=0.025 2024-06-20 11:17:33,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=194485.5, ans=0.0 2024-06-20 11:17:34,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=194503.83333333334, ans=0.125 2024-06-20 11:17:38,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=194503.83333333334, ans=0.2 2024-06-20 11:17:49,010 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.737e+02 1.830e+02 2.030e+02 3.131e+02, threshold=3.660e+02, percent-clipped=0.0 2024-06-20 11:17:49,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=194540.5, ans=0.07 2024-06-20 11:17:50,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=194540.5, ans=0.0 2024-06-20 11:17:54,459 INFO [train.py:1028] (1/2) Epoch 11, batch 4950, loss[loss=0.2439, simple_loss=0.2715, pruned_loss=0.1081, over 10989.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2571, pruned_loss=0.08733, over 2570546.96 frames. ], batch size: 304, lr: 5.34e-03, grad_scale: 64.0 2024-06-20 11:18:02,629 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.92 vs. limit=15.0 2024-06-20 11:18:06,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=194577.16666666666, ans=0.125 2024-06-20 11:18:17,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=194613.83333333334, ans=0.0 2024-06-20 11:18:19,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=194613.83333333334, ans=0.0 2024-06-20 11:18:24,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=194632.16666666666, ans=0.125 2024-06-20 11:18:30,311 INFO [train.py:1028] (1/2) Epoch 11, batch 5000, loss[loss=0.2174, simple_loss=0.2599, pruned_loss=0.08748, over 13183.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2565, pruned_loss=0.0869, over 2575853.10 frames. ], batch size: 95, lr: 5.34e-03, grad_scale: 64.0 2024-06-20 11:18:30,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=194650.5, ans=0.125 2024-06-20 11:18:52,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=194687.16666666666, ans=0.125 2024-06-20 11:18:54,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=194705.5, ans=0.025 2024-06-20 11:18:56,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=194705.5, ans=0.0 2024-06-20 11:18:56,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=194705.5, ans=10.0 2024-06-20 11:19:01,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=194723.83333333334, ans=0.125 2024-06-20 11:19:01,724 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.710e+02 1.858e+02 1.986e+02 2.517e+02, threshold=3.716e+02, percent-clipped=0.0 2024-06-20 11:19:07,265 INFO [train.py:1028] (1/2) Epoch 11, batch 5050, loss[loss=0.21, simple_loss=0.2497, pruned_loss=0.08517, over 12952.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2564, pruned_loss=0.0866, over 2575120.12 frames. ], batch size: 36, lr: 5.34e-03, grad_scale: 64.0 2024-06-20 11:19:14,033 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=194760.5, ans=0.0 2024-06-20 11:19:14,934 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.17 vs. limit=22.5 2024-06-20 11:19:18,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=194760.5, ans=0.0 2024-06-20 11:19:21,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=194778.83333333334, ans=0.1 2024-06-20 11:19:21,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=194778.83333333334, ans=0.125 2024-06-20 11:19:26,832 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.84 vs. limit=15.0 2024-06-20 11:19:34,234 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2024-06-20 11:19:35,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=194815.5, ans=0.1 2024-06-20 11:19:40,489 INFO [train.py:1028] (1/2) Epoch 11, batch 5100, loss[loss=0.2268, simple_loss=0.2672, pruned_loss=0.09322, over 12963.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2564, pruned_loss=0.08707, over 2571513.69 frames. ], batch size: 39, lr: 5.34e-03, grad_scale: 32.0 2024-06-20 11:19:44,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=194833.83333333334, ans=0.125 2024-06-20 11:19:44,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=194833.83333333334, ans=0.0 2024-06-20 11:19:44,903 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2024-06-20 11:19:59,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=194888.83333333334, ans=0.2 2024-06-20 11:20:08,174 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.791e+02 1.947e+02 2.161e+02 2.735e+02, threshold=3.895e+02, percent-clipped=0.0 2024-06-20 11:20:15,959 INFO [train.py:1028] (1/2) Epoch 11, batch 5150, loss[loss=0.2179, simple_loss=0.2514, pruned_loss=0.09215, over 13112.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2563, pruned_loss=0.08721, over 2572132.65 frames. ], batch size: 132, lr: 5.34e-03, grad_scale: 32.0 2024-06-20 11:20:35,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=194962.16666666666, ans=0.0 2024-06-20 11:20:36,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=194962.16666666666, ans=0.125 2024-06-20 11:20:46,294 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.04 vs. limit=15.0 2024-06-20 11:20:51,432 INFO [train.py:1028] (1/2) Epoch 11, batch 5200, loss[loss=0.2306, simple_loss=0.2659, pruned_loss=0.09765, over 13146.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2563, pruned_loss=0.08704, over 2575348.62 frames. ], batch size: 95, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:20:56,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=195017.16666666666, ans=0.0 2024-06-20 11:20:58,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=195035.5, ans=0.07 2024-06-20 11:21:20,016 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.726e+02 1.910e+02 2.059e+02 2.832e+02, threshold=3.820e+02, percent-clipped=0.0 2024-06-20 11:21:24,650 INFO [train.py:1028] (1/2) Epoch 11, batch 5250, loss[loss=0.2304, simple_loss=0.2709, pruned_loss=0.09494, over 13251.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2572, pruned_loss=0.08754, over 2570347.56 frames. ], batch size: 52, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:21:24,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=195108.83333333334, ans=0.125 2024-06-20 11:21:29,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=195108.83333333334, ans=0.2 2024-06-20 11:21:51,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=195182.16666666666, ans=0.025 2024-06-20 11:21:55,495 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.20 vs. limit=22.5 2024-06-20 11:21:55,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=195182.16666666666, ans=0.0 2024-06-20 11:21:57,626 INFO [train.py:1028] (1/2) Epoch 11, batch 5300, loss[loss=0.2273, simple_loss=0.2678, pruned_loss=0.09341, over 13034.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2569, pruned_loss=0.08732, over 2567207.07 frames. ], batch size: 144, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:21:59,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=195200.5, ans=0.0 2024-06-20 11:22:12,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=195218.83333333334, ans=0.0 2024-06-20 11:22:13,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=195237.16666666666, ans=0.0 2024-06-20 11:22:15,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=195237.16666666666, ans=0.1 2024-06-20 11:22:22,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=195255.5, ans=0.125 2024-06-20 11:22:22,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=195255.5, ans=0.1 2024-06-20 11:22:32,041 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.753e+02 1.866e+02 2.076e+02 2.891e+02, threshold=3.732e+02, percent-clipped=0.0 2024-06-20 11:22:33,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=195273.83333333334, ans=0.0 2024-06-20 11:22:37,174 INFO [train.py:1028] (1/2) Epoch 11, batch 5350, loss[loss=0.2172, simple_loss=0.2718, pruned_loss=0.0813, over 11250.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2565, pruned_loss=0.08719, over 2573777.21 frames. ], batch size: 16, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:22:38,919 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2024-06-20 11:22:39,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=195292.16666666666, ans=0.0 2024-06-20 11:22:41,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=195292.16666666666, ans=0.2 2024-06-20 11:22:42,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=195292.16666666666, ans=0.0 2024-06-20 11:22:46,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=195310.5, ans=0.125 2024-06-20 11:22:58,457 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.82 vs. limit=15.0 2024-06-20 11:23:00,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=195347.16666666666, ans=0.1 2024-06-20 11:23:07,396 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.41 vs. limit=15.0 2024-06-20 11:23:08,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=195365.5, ans=0.125 2024-06-20 11:23:09,540 INFO [train.py:1028] (1/2) Epoch 11, batch 5400, loss[loss=0.2608, simple_loss=0.2866, pruned_loss=0.1176, over 12232.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.257, pruned_loss=0.08763, over 2566702.18 frames. ], batch size: 240, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:23:09,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=195383.83333333334, ans=0.125 2024-06-20 11:23:18,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=195402.16666666666, ans=0.125 2024-06-20 11:23:23,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=195420.5, ans=0.125 2024-06-20 11:23:28,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=195420.5, ans=0.125 2024-06-20 11:23:32,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=195438.83333333334, ans=0.0 2024-06-20 11:23:33,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=195438.83333333334, ans=0.1 2024-06-20 11:23:34,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=195438.83333333334, ans=0.1 2024-06-20 11:23:36,033 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=195457.16666666666, ans=0.0 2024-06-20 11:23:38,564 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.795e+02 1.962e+02 2.264e+02 3.593e+02, threshold=3.924e+02, percent-clipped=0.0 2024-06-20 11:23:43,063 INFO [train.py:1028] (1/2) Epoch 11, batch 5450, loss[loss=0.2485, simple_loss=0.2746, pruned_loss=0.1112, over 12460.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2571, pruned_loss=0.08775, over 2569732.98 frames. ], batch size: 25, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:23:43,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=195475.5, ans=0.125 2024-06-20 11:23:49,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=195493.83333333334, ans=0.1 2024-06-20 11:23:51,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=195493.83333333334, ans=0.125 2024-06-20 11:23:54,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195493.83333333334, ans=0.1 2024-06-20 11:23:57,122 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.60 vs. limit=22.5 2024-06-20 11:24:03,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=195512.16666666666, ans=0.125 2024-06-20 11:24:10,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=195530.5, ans=0.125 2024-06-20 11:24:18,844 INFO [train.py:1028] (1/2) Epoch 11, batch 5500, loss[loss=0.238, simple_loss=0.2677, pruned_loss=0.1041, over 12253.00 frames. ], tot_loss[loss=0.216, simple_loss=0.257, pruned_loss=0.08747, over 2563415.67 frames. ], batch size: 240, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:24:19,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=195567.16666666666, ans=0.025 2024-06-20 11:24:24,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=195567.16666666666, ans=0.125 2024-06-20 11:24:35,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=195603.83333333334, ans=0.2 2024-06-20 11:24:45,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten.whitening_limit, batch_count=195622.16666666666, ans=22.5 2024-06-20 11:24:48,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=195640.5, ans=0.0 2024-06-20 11:24:50,702 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.743e+02 1.905e+02 2.292e+02 3.353e+02, threshold=3.809e+02, percent-clipped=0.0 2024-06-20 11:24:50,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195640.5, ans=0.1 2024-06-20 11:24:55,250 INFO [train.py:1028] (1/2) Epoch 11, batch 5550, loss[loss=0.2116, simple_loss=0.255, pruned_loss=0.08411, over 13221.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2562, pruned_loss=0.08674, over 2568179.90 frames. ], batch size: 43, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:24:55,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=195658.83333333334, ans=0.125 2024-06-20 11:24:58,366 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.98 vs. limit=10.0 2024-06-20 11:25:02,201 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.97 vs. limit=15.0 2024-06-20 11:25:03,538 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.89 vs. limit=15.0 2024-06-20 11:25:06,060 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.89 vs. limit=15.0 2024-06-20 11:25:12,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=195695.5, ans=0.125 2024-06-20 11:25:14,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=195713.83333333334, ans=0.125 2024-06-20 11:25:23,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=195732.16666666666, ans=0.0 2024-06-20 11:25:26,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=195732.16666666666, ans=0.035 2024-06-20 11:25:26,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=195750.5, ans=0.0 2024-06-20 11:25:27,390 INFO [train.py:1028] (1/2) Epoch 11, batch 5600, loss[loss=0.211, simple_loss=0.256, pruned_loss=0.08299, over 13247.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2556, pruned_loss=0.08648, over 2569271.20 frames. ], batch size: 89, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:25:28,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=195750.5, ans=0.2 2024-06-20 11:25:28,380 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2024-06-20 11:25:31,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=195750.5, ans=0.2 2024-06-20 11:25:39,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=195768.83333333334, ans=0.0 2024-06-20 11:25:42,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=195787.16666666666, ans=0.07 2024-06-20 11:25:47,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=195805.5, ans=0.125 2024-06-20 11:25:48,250 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:25:51,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=195805.5, ans=0.0 2024-06-20 11:25:55,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=195823.83333333334, ans=0.025 2024-06-20 11:25:56,182 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.836e+02 1.971e+02 2.207e+02 3.418e+02, threshold=3.942e+02, percent-clipped=0.0 2024-06-20 11:26:00,607 INFO [train.py:1028] (1/2) Epoch 11, batch 5650, loss[loss=0.2411, simple_loss=0.2729, pruned_loss=0.1046, over 12495.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2555, pruned_loss=0.08646, over 2574644.31 frames. ], batch size: 202, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:26:00,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=195842.16666666666, ans=0.125 2024-06-20 11:26:01,085 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.79 vs. limit=15.0 2024-06-20 11:26:08,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=195842.16666666666, ans=0.1 2024-06-20 11:26:10,029 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195860.5, ans=0.1 2024-06-20 11:26:15,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=195860.5, ans=0.0 2024-06-20 11:26:15,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=195860.5, ans=0.025 2024-06-20 11:26:16,872 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.14 vs. limit=15.0 2024-06-20 11:26:18,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=195878.83333333334, ans=0.125 2024-06-20 11:26:26,681 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.60 vs. limit=15.0 2024-06-20 11:26:27,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=195897.16666666666, ans=0.0 2024-06-20 11:26:39,568 INFO [train.py:1028] (1/2) Epoch 11, batch 5700, loss[loss=0.2157, simple_loss=0.2641, pruned_loss=0.08362, over 13259.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.255, pruned_loss=0.08612, over 2578427.26 frames. ], batch size: 63, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:27:04,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=196007.16666666666, ans=0.0 2024-06-20 11:27:05,002 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.01 vs. limit=15.0 2024-06-20 11:27:06,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=196007.16666666666, ans=0.125 2024-06-20 11:27:07,241 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.747e+02 1.894e+02 2.154e+02 3.380e+02, threshold=3.789e+02, percent-clipped=0.0 2024-06-20 11:27:09,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=196007.16666666666, ans=0.0 2024-06-20 11:27:11,592 INFO [train.py:1028] (1/2) Epoch 11, batch 5750, loss[loss=0.21, simple_loss=0.2487, pruned_loss=0.08561, over 12794.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2558, pruned_loss=0.08644, over 2579403.92 frames. ], batch size: 176, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:27:12,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=196025.5, ans=0.125 2024-06-20 11:27:16,942 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=196025.5, ans=0.125 2024-06-20 11:27:18,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=196043.83333333334, ans=0.125 2024-06-20 11:27:23,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=196043.83333333334, ans=0.125 2024-06-20 11:27:29,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=196062.16666666666, ans=0.0 2024-06-20 11:27:40,536 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.93 vs. limit=22.5 2024-06-20 11:27:44,491 INFO [train.py:1028] (1/2) Epoch 11, batch 5800, loss[loss=0.2071, simple_loss=0.2447, pruned_loss=0.08473, over 12818.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2569, pruned_loss=0.08721, over 2578661.49 frames. ], batch size: 176, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:27:53,611 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.76 vs. limit=15.0 2024-06-20 11:28:05,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=196153.83333333334, ans=0.1 2024-06-20 11:28:19,611 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.768e+02 1.870e+02 1.994e+02 2.604e+02, threshold=3.740e+02, percent-clipped=0.0 2024-06-20 11:28:23,445 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.69 vs. limit=15.0 2024-06-20 11:28:24,328 INFO [train.py:1028] (1/2) Epoch 11, batch 5850, loss[loss=0.2557, simple_loss=0.2869, pruned_loss=0.1122, over 12537.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2589, pruned_loss=0.08801, over 2576772.35 frames. ], batch size: 202, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:28:26,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=196208.83333333334, ans=0.125 2024-06-20 11:28:32,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=196227.16666666666, ans=0.0 2024-06-20 11:28:41,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=196245.5, ans=0.2 2024-06-20 11:28:43,427 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.55 vs. limit=15.0 2024-06-20 11:28:47,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=196263.83333333334, ans=0.2 2024-06-20 11:28:50,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=196282.16666666666, ans=0.125 2024-06-20 11:28:55,770 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.32 vs. limit=15.0 2024-06-20 11:28:57,305 INFO [train.py:1028] (1/2) Epoch 11, batch 5900, loss[loss=0.2192, simple_loss=0.2569, pruned_loss=0.09077, over 13057.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2604, pruned_loss=0.08855, over 2577495.58 frames. ], batch size: 121, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:29:08,245 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.02 vs. limit=15.0 2024-06-20 11:29:22,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=196373.83333333334, ans=0.0 2024-06-20 11:29:25,558 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.771e+02 1.959e+02 2.095e+02 3.057e+02, threshold=3.919e+02, percent-clipped=0.0 2024-06-20 11:29:30,210 INFO [train.py:1028] (1/2) Epoch 11, batch 5950, loss[loss=0.2171, simple_loss=0.2536, pruned_loss=0.09028, over 13143.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2628, pruned_loss=0.08969, over 2582303.60 frames. ], batch size: 121, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:29:39,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=196410.5, ans=0.0 2024-06-20 11:29:53,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=196447.16666666666, ans=0.1 2024-06-20 11:29:53,903 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.48 vs. limit=15.0 2024-06-20 11:30:05,971 INFO [train.py:1028] (1/2) Epoch 11, batch 6000, loss[loss=0.256, simple_loss=0.2871, pruned_loss=0.1124, over 12185.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2643, pruned_loss=0.09038, over 2574563.48 frames. ], batch size: 241, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:30:05,972 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 11:30:14,267 INFO [train.py:1060] (1/2) Epoch 11, validation: loss=0.1963, simple_loss=0.2602, pruned_loss=0.06621, over 351949.00 frames. 2024-06-20 11:30:14,267 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 11:30:16,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=196483.83333333334, ans=0.125 2024-06-20 11:30:17,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=196483.83333333334, ans=0.125 2024-06-20 11:30:25,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=196502.16666666666, ans=0.125 2024-06-20 11:30:34,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=196520.5, ans=0.0 2024-06-20 11:30:39,019 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.20 vs. limit=15.0 2024-06-20 11:30:45,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=196557.16666666666, ans=0.1 2024-06-20 11:30:48,455 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.815e+02 1.948e+02 2.204e+02 3.333e+02, threshold=3.897e+02, percent-clipped=0.0 2024-06-20 11:30:49,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=196557.16666666666, ans=0.2 2024-06-20 11:30:51,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=196557.16666666666, ans=0.125 2024-06-20 11:30:53,431 INFO [train.py:1028] (1/2) Epoch 11, batch 6050, loss[loss=0.2068, simple_loss=0.2532, pruned_loss=0.08015, over 12946.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2656, pruned_loss=0.09113, over 2577601.04 frames. ], batch size: 39, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:30:58,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=196575.5, ans=0.125 2024-06-20 11:31:01,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=196593.83333333334, ans=0.125 2024-06-20 11:31:04,116 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:31:04,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=196593.83333333334, ans=0.0 2024-06-20 11:31:09,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=196612.16666666666, ans=0.125 2024-06-20 11:31:11,244 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.76 vs. limit=15.0 2024-06-20 11:31:13,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=196612.16666666666, ans=0.1 2024-06-20 11:31:24,463 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.52 vs. limit=22.5 2024-06-20 11:31:25,624 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.75 vs. limit=15.0 2024-06-20 11:31:28,052 INFO [train.py:1028] (1/2) Epoch 11, batch 6100, loss[loss=0.2161, simple_loss=0.252, pruned_loss=0.09008, over 13119.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2673, pruned_loss=0.09167, over 2579831.07 frames. ], batch size: 121, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:31:31,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=196667.16666666666, ans=0.125 2024-06-20 11:31:32,994 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.34 vs. limit=10.0 2024-06-20 11:31:33,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=196667.16666666666, ans=0.125 2024-06-20 11:31:37,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.15 vs. limit=22.5 2024-06-20 11:31:40,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=196685.5, ans=0.0 2024-06-20 11:31:52,251 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=12.0 2024-06-20 11:31:52,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.87 vs. limit=22.5 2024-06-20 11:31:57,687 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.778e+02 1.877e+02 2.160e+02 3.097e+02, threshold=3.754e+02, percent-clipped=0.0 2024-06-20 11:31:59,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=196740.5, ans=0.025 2024-06-20 11:32:02,551 INFO [train.py:1028] (1/2) Epoch 11, batch 6150, loss[loss=0.2325, simple_loss=0.2622, pruned_loss=0.1014, over 10859.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2698, pruned_loss=0.09306, over 2577471.62 frames. ], batch size: 304, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:32:03,027 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.80 vs. limit=10.0 2024-06-20 11:32:07,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.37 vs. limit=15.0 2024-06-20 11:32:17,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=196777.16666666666, ans=0.0 2024-06-20 11:32:37,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=196832.16666666666, ans=0.125 2024-06-20 11:32:40,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=196832.16666666666, ans=0.125 2024-06-20 11:32:42,690 INFO [train.py:1028] (1/2) Epoch 11, batch 6200, loss[loss=0.2419, simple_loss=0.2925, pruned_loss=0.09562, over 13274.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.271, pruned_loss=0.09358, over 2574563.22 frames. ], batch size: 89, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:32:44,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=196850.5, ans=0.125 2024-06-20 11:32:46,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=196850.5, ans=0.125 2024-06-20 11:32:50,455 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.63 vs. limit=22.5 2024-06-20 11:32:52,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=196868.83333333334, ans=0.1 2024-06-20 11:33:09,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=196923.83333333334, ans=0.125 2024-06-20 11:33:11,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=196923.83333333334, ans=0.125 2024-06-20 11:33:11,634 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.777e+02 1.930e+02 2.112e+02 2.896e+02, threshold=3.860e+02, percent-clipped=0.0 2024-06-20 11:33:16,530 INFO [train.py:1028] (1/2) Epoch 11, batch 6250, loss[loss=0.2204, simple_loss=0.2681, pruned_loss=0.08631, over 13240.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.2715, pruned_loss=0.09349, over 2567877.85 frames. ], batch size: 83, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:33:24,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=196960.5, ans=0.1 2024-06-20 11:33:28,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=196960.5, ans=0.0 2024-06-20 11:33:30,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=196978.83333333334, ans=0.125 2024-06-20 11:33:35,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=196978.83333333334, ans=0.0 2024-06-20 11:33:41,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=196997.16666666666, ans=0.1 2024-06-20 11:33:49,061 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.51 vs. limit=22.5 2024-06-20 11:33:49,909 INFO [train.py:1028] (1/2) Epoch 11, batch 6300, loss[loss=0.2512, simple_loss=0.2916, pruned_loss=0.1054, over 11641.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.273, pruned_loss=0.09423, over 2563103.32 frames. ], batch size: 17, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:33:49,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=197033.83333333334, ans=0.2 2024-06-20 11:33:51,033 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.40 vs. limit=22.5 2024-06-20 11:33:55,283 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=22.5 2024-06-20 11:34:03,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=197070.5, ans=0.025 2024-06-20 11:34:05,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=197070.5, ans=0.125 2024-06-20 11:34:16,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=197088.83333333334, ans=0.0 2024-06-20 11:34:17,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=197088.83333333334, ans=0.125 2024-06-20 11:34:17,899 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.32 vs. limit=22.5 2024-06-20 11:34:18,604 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.07 vs. limit=15.0 2024-06-20 11:34:21,985 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.793e+02 1.925e+02 2.118e+02 3.266e+02, threshold=3.850e+02, percent-clipped=0.0 2024-06-20 11:34:26,616 INFO [train.py:1028] (1/2) Epoch 11, batch 6350, loss[loss=0.2752, simple_loss=0.3109, pruned_loss=0.1197, over 12590.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.275, pruned_loss=0.09473, over 2572398.80 frames. ], batch size: 202, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:34:37,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=197143.83333333334, ans=0.05 2024-06-20 11:34:38,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=197143.83333333334, ans=0.2 2024-06-20 11:34:40,942 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=197143.83333333334, ans=0.0 2024-06-20 11:34:42,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=197162.16666666666, ans=0.125 2024-06-20 11:34:48,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=197180.5, ans=0.035 2024-06-20 11:34:48,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=197180.5, ans=0.025 2024-06-20 11:35:03,005 INFO [train.py:1028] (1/2) Epoch 11, batch 6400, loss[loss=0.2279, simple_loss=0.2691, pruned_loss=0.09332, over 13207.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.2773, pruned_loss=0.09596, over 2574250.79 frames. ], batch size: 67, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:35:05,279 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.46 vs. limit=15.0 2024-06-20 11:35:16,127 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.10 vs. limit=6.0 2024-06-20 11:35:22,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=197272.16666666666, ans=0.5 2024-06-20 11:35:26,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=197272.16666666666, ans=0.1 2024-06-20 11:35:30,745 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 1.869e+02 2.017e+02 2.148e+02 2.915e+02, threshold=4.035e+02, percent-clipped=0.0 2024-06-20 11:35:33,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=197290.5, ans=0.125 2024-06-20 11:35:35,133 INFO [train.py:1028] (1/2) Epoch 11, batch 6450, loss[loss=0.2762, simple_loss=0.3076, pruned_loss=0.1224, over 12575.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.2788, pruned_loss=0.09654, over 2580531.55 frames. ], batch size: 202, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:35:38,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=197308.83333333334, ans=0.0 2024-06-20 11:35:43,315 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.93 vs. limit=15.0 2024-06-20 11:35:43,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=197327.16666666666, ans=0.09899494936611666 2024-06-20 11:35:56,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=197363.83333333334, ans=0.125 2024-06-20 11:36:07,324 INFO [train.py:1028] (1/2) Epoch 11, batch 6500, loss[loss=0.2448, simple_loss=0.2703, pruned_loss=0.1097, over 10922.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.2802, pruned_loss=0.09683, over 2584438.85 frames. ], batch size: 304, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:36:07,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=197400.5, ans=0.07 2024-06-20 11:36:13,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=197418.83333333334, ans=0.125 2024-06-20 11:36:14,194 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.93 vs. limit=15.0 2024-06-20 11:36:29,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=197455.5, ans=0.125 2024-06-20 11:36:35,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=197455.5, ans=0.125 2024-06-20 11:36:41,248 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.890e+02 2.094e+02 2.274e+02 3.406e+02, threshold=4.189e+02, percent-clipped=0.0 2024-06-20 11:36:45,927 INFO [train.py:1028] (1/2) Epoch 11, batch 6550, loss[loss=0.2501, simple_loss=0.2968, pruned_loss=0.1017, over 12484.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.2812, pruned_loss=0.09676, over 2587341.13 frames. ], batch size: 22, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:36:48,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=197492.16666666666, ans=0.2 2024-06-20 11:36:49,556 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.90 vs. limit=22.5 2024-06-20 11:36:52,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=197510.5, ans=0.0 2024-06-20 11:36:54,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=197510.5, ans=0.125 2024-06-20 11:37:00,385 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:37:06,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=197547.16666666666, ans=0.125 2024-06-20 11:37:11,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=197565.5, ans=0.0 2024-06-20 11:37:13,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=197565.5, ans=0.07 2024-06-20 11:37:18,822 INFO [train.py:1028] (1/2) Epoch 11, batch 6600, loss[loss=0.2399, simple_loss=0.286, pruned_loss=0.09688, over 13217.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.2815, pruned_loss=0.0967, over 2589940.63 frames. ], batch size: 72, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:37:32,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=197620.5, ans=0.0 2024-06-20 11:37:33,122 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.31 vs. limit=22.5 2024-06-20 11:37:35,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=197620.5, ans=0.1 2024-06-20 11:37:44,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=197638.83333333334, ans=0.125 2024-06-20 11:37:47,192 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.877e+02 1.977e+02 2.147e+02 2.728e+02, threshold=3.953e+02, percent-clipped=0.0 2024-06-20 11:37:51,594 INFO [train.py:1028] (1/2) Epoch 11, batch 6650, loss[loss=0.2548, simple_loss=0.2956, pruned_loss=0.107, over 12912.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.2839, pruned_loss=0.09794, over 2583911.60 frames. ], batch size: 158, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:38:07,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=197712.16666666666, ans=0.0 2024-06-20 11:38:07,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=197712.16666666666, ans=0.125 2024-06-20 11:38:07,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=197712.16666666666, ans=0.2 2024-06-20 11:38:08,553 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.37 vs. limit=15.0 2024-06-20 11:38:19,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=197748.83333333334, ans=0.07 2024-06-20 11:38:28,340 INFO [train.py:1028] (1/2) Epoch 11, batch 6700, loss[loss=0.2638, simple_loss=0.3009, pruned_loss=0.1134, over 12792.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.2852, pruned_loss=0.09867, over 2583363.65 frames. ], batch size: 176, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:38:32,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=197767.16666666666, ans=0.2 2024-06-20 11:38:52,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=197822.16666666666, ans=0.2 2024-06-20 11:39:00,030 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.924e+02 2.089e+02 2.464e+02 3.621e+02, threshold=4.179e+02, percent-clipped=0.0 2024-06-20 11:39:04,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=197858.83333333334, ans=0.0 2024-06-20 11:39:04,655 INFO [train.py:1028] (1/2) Epoch 11, batch 6750, loss[loss=0.3095, simple_loss=0.3415, pruned_loss=0.1387, over 12144.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.2862, pruned_loss=0.09907, over 2578221.76 frames. ], batch size: 240, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:39:06,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=197858.83333333334, ans=0.0 2024-06-20 11:39:12,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=197877.16666666666, ans=0.125 2024-06-20 11:39:14,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=197877.16666666666, ans=0.0 2024-06-20 11:39:19,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=197895.5, ans=0.125 2024-06-20 11:39:26,009 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.65 vs. limit=15.0 2024-06-20 11:39:32,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=197932.16666666666, ans=0.025 2024-06-20 11:39:34,185 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.17 vs. limit=15.0 2024-06-20 11:39:35,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=197932.16666666666, ans=0.125 2024-06-20 11:39:37,142 INFO [train.py:1028] (1/2) Epoch 11, batch 6800, loss[loss=0.2426, simple_loss=0.2903, pruned_loss=0.09742, over 13225.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.2877, pruned_loss=0.09943, over 2580781.65 frames. ], batch size: 67, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:39:37,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=197950.5, ans=0.125 2024-06-20 11:39:44,426 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:39:49,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=197987.16666666666, ans=0.0 2024-06-20 11:39:52,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=197987.16666666666, ans=0.035 2024-06-20 11:39:52,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=197987.16666666666, ans=0.2 2024-06-20 11:40:00,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=197987.16666666666, ans=0.05 2024-06-20 11:40:04,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=198005.5, ans=0.035 2024-06-20 11:40:05,982 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.03 vs. limit=15.0 2024-06-20 11:40:09,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=198023.83333333334, ans=0.0 2024-06-20 11:40:10,874 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.969e+02 2.155e+02 2.386e+02 4.167e+02, threshold=4.309e+02, percent-clipped=0.0 2024-06-20 11:40:13,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=198023.83333333334, ans=0.2 2024-06-20 11:40:15,588 INFO [train.py:1028] (1/2) Epoch 11, batch 6850, loss[loss=0.2647, simple_loss=0.3179, pruned_loss=0.1057, over 13245.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.2879, pruned_loss=0.09914, over 2584307.42 frames. ], batch size: 63, lr: 5.29e-03, grad_scale: 32.0 2024-06-20 11:40:17,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=198042.16666666666, ans=0.2 2024-06-20 11:40:24,724 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:40:36,058 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.72 vs. limit=15.0 2024-06-20 11:40:42,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=198097.16666666666, ans=0.0 2024-06-20 11:40:48,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=198115.5, ans=0.2 2024-06-20 11:40:51,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=198115.5, ans=0.125 2024-06-20 11:40:55,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=198133.83333333334, ans=0.125 2024-06-20 11:40:55,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=198133.83333333334, ans=0.125 2024-06-20 11:40:56,119 INFO [train.py:1028] (1/2) Epoch 11, batch 6900, loss[loss=0.2525, simple_loss=0.3046, pruned_loss=0.1002, over 13325.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.2882, pruned_loss=0.09903, over 2586146.73 frames. ], batch size: 49, lr: 5.29e-03, grad_scale: 32.0 2024-06-20 11:41:03,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=198152.16666666666, ans=0.0 2024-06-20 11:41:10,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=198170.5, ans=0.035 2024-06-20 11:41:17,034 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.55 vs. limit=22.5 2024-06-20 11:41:21,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=198207.16666666666, ans=0.125 2024-06-20 11:41:24,022 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 1.838e+02 1.970e+02 2.138e+02 3.106e+02, threshold=3.939e+02, percent-clipped=0.0 2024-06-20 11:41:24,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=198207.16666666666, ans=0.125 2024-06-20 11:41:28,671 INFO [train.py:1028] (1/2) Epoch 11, batch 6950, loss[loss=0.2409, simple_loss=0.2802, pruned_loss=0.1008, over 11120.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.2888, pruned_loss=0.09936, over 2580335.76 frames. ], batch size: 16, lr: 5.29e-03, grad_scale: 32.0 2024-06-20 11:41:34,865 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.41 vs. limit=22.5 2024-06-20 11:41:39,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=198243.83333333334, ans=0.125 2024-06-20 11:41:42,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.88 vs. limit=22.5 2024-06-20 11:41:52,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=198280.5, ans=0.2 2024-06-20 11:42:01,424 INFO [train.py:1028] (1/2) Epoch 11, batch 7000, loss[loss=0.265, simple_loss=0.2999, pruned_loss=0.115, over 12979.00 frames. ], tot_loss[loss=0.244, simple_loss=0.2893, pruned_loss=0.09933, over 2576206.17 frames. ], batch size: 158, lr: 5.29e-03, grad_scale: 32.0 2024-06-20 11:42:02,422 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.22 vs. limit=10.0 2024-06-20 11:42:16,839 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.23 vs. limit=15.0 2024-06-20 11:42:20,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=198353.83333333334, ans=0.125 2024-06-20 11:42:25,885 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.26 vs. limit=15.0 2024-06-20 11:42:31,036 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.668e+02 1.888e+02 2.046e+02 2.225e+02 3.105e+02, threshold=4.091e+02, percent-clipped=0.0 2024-06-20 11:42:39,220 INFO [train.py:1028] (1/2) Epoch 11, batch 7050, loss[loss=0.2725, simple_loss=0.3138, pruned_loss=0.1156, over 12787.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.2908, pruned_loss=0.09989, over 2581803.35 frames. ], batch size: 176, lr: 5.29e-03, grad_scale: 32.0 2024-06-20 11:42:40,297 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.99 vs. limit=15.0 2024-06-20 11:42:44,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=198408.83333333334, ans=0.0 2024-06-20 11:42:53,732 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.21 vs. limit=15.0 2024-06-20 11:42:59,919 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=198445.5, ans=0.125 2024-06-20 11:43:08,489 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.69 vs. limit=10.0 2024-06-20 11:43:16,014 INFO [train.py:1028] (1/2) Epoch 11, batch 7100, loss[loss=0.258, simple_loss=0.3031, pruned_loss=0.1065, over 13215.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.2912, pruned_loss=0.1005, over 2574083.82 frames. ], batch size: 112, lr: 5.29e-03, grad_scale: 64.0 2024-06-20 11:43:26,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=198518.83333333334, ans=0.125 2024-06-20 11:43:42,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=198573.83333333334, ans=0.125 2024-06-20 11:43:45,292 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.901e+02 2.035e+02 2.226e+02 2.939e+02, threshold=4.070e+02, percent-clipped=0.0 2024-06-20 11:43:45,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=198573.83333333334, ans=0.125 2024-06-20 11:43:47,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=198573.83333333334, ans=0.0 2024-06-20 11:43:49,997 INFO [train.py:1028] (1/2) Epoch 11, batch 7150, loss[loss=0.2564, simple_loss=0.2974, pruned_loss=0.1077, over 12579.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.2921, pruned_loss=0.1007, over 2572052.14 frames. ], batch size: 202, lr: 5.29e-03, grad_scale: 64.0 2024-06-20 11:43:50,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=198592.16666666666, ans=0.2 2024-06-20 11:43:56,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=198610.5, ans=0.125 2024-06-20 11:44:02,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=198628.83333333334, ans=0.125 2024-06-20 11:44:10,914 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.16 vs. limit=15.0 2024-06-20 11:44:14,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=198647.16666666666, ans=0.125 2024-06-20 11:44:16,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=198665.5, ans=0.0 2024-06-20 11:44:19,509 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.82 vs. limit=10.0 2024-06-20 11:44:22,966 INFO [train.py:1028] (1/2) Epoch 11, batch 7200, loss[loss=0.2824, simple_loss=0.3224, pruned_loss=0.1213, over 13180.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.2927, pruned_loss=0.1008, over 2576382.75 frames. ], batch size: 112, lr: 5.29e-03, grad_scale: 64.0 2024-06-20 11:44:27,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=198683.83333333334, ans=0.125 2024-06-20 11:44:28,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=198683.83333333334, ans=0.0 2024-06-20 11:44:28,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=198683.83333333334, ans=0.0 2024-06-20 11:44:34,477 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.63 vs. limit=15.0 2024-06-20 11:44:38,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=198720.5, ans=0.125 2024-06-20 11:44:39,744 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.00 vs. limit=22.5 2024-06-20 11:44:45,014 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.01 vs. limit=10.0 2024-06-20 11:44:53,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=198757.16666666666, ans=0.125 2024-06-20 11:44:55,324 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.906e+02 2.094e+02 2.287e+02 3.088e+02, threshold=4.189e+02, percent-clipped=0.0 2024-06-20 11:44:59,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=198775.5, ans=0.0 2024-06-20 11:45:04,186 INFO [train.py:1028] (1/2) Epoch 11, batch 7250, loss[loss=0.2253, simple_loss=0.2761, pruned_loss=0.08727, over 13028.00 frames. ], tot_loss[loss=0.247, simple_loss=0.2931, pruned_loss=0.1005, over 2577327.91 frames. ], batch size: 36, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:45:05,133 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.89 vs. limit=15.0 2024-06-20 11:45:06,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=198775.5, ans=0.0 2024-06-20 11:45:06,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=198775.5, ans=0.0 2024-06-20 11:45:08,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=198775.5, ans=0.1 2024-06-20 11:45:13,087 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.07 vs. limit=15.0 2024-06-20 11:45:20,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=198812.16666666666, ans=0.05 2024-06-20 11:45:21,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=198812.16666666666, ans=0.015 2024-06-20 11:45:22,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=198812.16666666666, ans=0.125 2024-06-20 11:45:27,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=198830.5, ans=0.0 2024-06-20 11:45:28,362 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:45:35,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=198848.83333333334, ans=10.0 2024-06-20 11:45:37,355 INFO [train.py:1028] (1/2) Epoch 11, batch 7300, loss[loss=0.2327, simple_loss=0.2852, pruned_loss=0.09006, over 12913.00 frames. ], tot_loss[loss=0.248, simple_loss=0.2942, pruned_loss=0.1009, over 2577989.58 frames. ], batch size: 36, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:45:45,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=198885.5, ans=0.0 2024-06-20 11:45:46,674 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.84 vs. limit=10.0 2024-06-20 11:45:57,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=198922.16666666666, ans=0.0 2024-06-20 11:45:59,734 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.59 vs. limit=15.0 2024-06-20 11:46:05,813 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.669e+02 1.864e+02 2.012e+02 2.157e+02 3.018e+02, threshold=4.024e+02, percent-clipped=0.0 2024-06-20 11:46:10,543 INFO [train.py:1028] (1/2) Epoch 11, batch 7350, loss[loss=0.2652, simple_loss=0.3132, pruned_loss=0.1086, over 13301.00 frames. ], tot_loss[loss=0.249, simple_loss=0.2952, pruned_loss=0.1014, over 2580670.01 frames. ], batch size: 46, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:46:10,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=198958.83333333334, ans=0.125 2024-06-20 11:46:15,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=198958.83333333334, ans=0.07 2024-06-20 11:46:21,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198977.16666666666, ans=0.1 2024-06-20 11:46:31,621 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.59 vs. limit=10.0 2024-06-20 11:46:43,848 INFO [train.py:1028] (1/2) Epoch 11, batch 7400, loss[loss=0.2599, simple_loss=0.3127, pruned_loss=0.1035, over 13291.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.2956, pruned_loss=0.1013, over 2586764.68 frames. ], batch size: 63, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:46:48,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=199050.5, ans=0.0 2024-06-20 11:47:00,264 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.87 vs. limit=15.0 2024-06-20 11:47:02,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=199087.16666666666, ans=0.125 2024-06-20 11:47:05,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=199087.16666666666, ans=0.125 2024-06-20 11:47:10,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=199105.5, ans=0.5 2024-06-20 11:47:12,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.86 vs. limit=22.5 2024-06-20 11:47:19,882 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.904e+02 1.986e+02 2.156e+02 2.713e+02, threshold=3.972e+02, percent-clipped=0.0 2024-06-20 11:47:24,722 INFO [train.py:1028] (1/2) Epoch 11, batch 7450, loss[loss=0.2275, simple_loss=0.2766, pruned_loss=0.08914, over 12732.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.2956, pruned_loss=0.1011, over 2580609.41 frames. ], batch size: 29, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:47:25,535 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:47:26,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=199142.16666666666, ans=0.125 2024-06-20 11:47:32,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=199160.5, ans=0.0 2024-06-20 11:47:41,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=199178.83333333334, ans=0.125 2024-06-20 11:47:48,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=199197.16666666666, ans=0.1 2024-06-20 11:47:50,584 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.79 vs. limit=22.5 2024-06-20 11:47:52,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=199215.5, ans=0.1 2024-06-20 11:47:58,983 INFO [train.py:1028] (1/2) Epoch 11, batch 7500, loss[loss=0.2765, simple_loss=0.3071, pruned_loss=0.1229, over 10780.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.2968, pruned_loss=0.1021, over 2578191.98 frames. ], batch size: 303, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:48:06,489 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.98 vs. limit=12.0 2024-06-20 11:48:10,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=199252.16666666666, ans=0.0 2024-06-20 11:48:15,504 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.61 vs. limit=6.0 2024-06-20 11:48:17,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=199270.5, ans=10.0 2024-06-20 11:48:27,629 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 1.992e+02 2.217e+02 2.632e+02 3.692e+02, threshold=4.433e+02, percent-clipped=0.0 2024-06-20 11:48:29,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=199307.16666666666, ans=0.125 2024-06-20 11:48:30,209 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.91 vs. limit=10.0 2024-06-20 11:48:32,326 INFO [train.py:1028] (1/2) Epoch 11, batch 7550, loss[loss=0.2378, simple_loss=0.281, pruned_loss=0.09727, over 12960.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.2973, pruned_loss=0.1026, over 2576367.96 frames. ], batch size: 158, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:48:44,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=199343.83333333334, ans=0.2 2024-06-20 11:48:48,909 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.99 vs. limit=22.5 2024-06-20 11:48:49,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=199362.16666666666, ans=0.0 2024-06-20 11:48:54,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=199362.16666666666, ans=0.2 2024-06-20 11:48:58,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=199380.5, ans=0.0 2024-06-20 11:49:11,812 INFO [train.py:1028] (1/2) Epoch 11, batch 7600, loss[loss=0.2343, simple_loss=0.2761, pruned_loss=0.09621, over 13186.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.2973, pruned_loss=0.1027, over 2576045.80 frames. ], batch size: 83, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:49:18,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=199435.5, ans=0.125 2024-06-20 11:49:20,508 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.76 vs. limit=12.0 2024-06-20 11:49:21,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=199435.5, ans=0.0 2024-06-20 11:49:22,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=199435.5, ans=0.0 2024-06-20 11:49:25,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=199453.83333333334, ans=0.0 2024-06-20 11:49:26,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=199453.83333333334, ans=0.0 2024-06-20 11:49:29,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=199453.83333333334, ans=0.025 2024-06-20 11:49:31,131 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.63 vs. limit=15.0 2024-06-20 11:49:36,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=199472.16666666666, ans=0.025 2024-06-20 11:49:39,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=199490.5, ans=0.125 2024-06-20 11:49:40,807 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.886e+02 2.069e+02 2.331e+02 3.178e+02, threshold=4.138e+02, percent-clipped=0.0 2024-06-20 11:49:45,789 INFO [train.py:1028] (1/2) Epoch 11, batch 7650, loss[loss=0.2244, simple_loss=0.274, pruned_loss=0.08736, over 13289.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.2972, pruned_loss=0.1026, over 2572892.88 frames. ], batch size: 34, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:49:49,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=199508.83333333334, ans=0.0 2024-06-20 11:50:02,369 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.82 vs. limit=12.0 2024-06-20 11:50:13,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=199582.16666666666, ans=0.125 2024-06-20 11:50:18,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=199582.16666666666, ans=0.0 2024-06-20 11:50:19,912 INFO [train.py:1028] (1/2) Epoch 11, batch 7700, loss[loss=0.265, simple_loss=0.3192, pruned_loss=0.1054, over 13230.00 frames. ], tot_loss[loss=0.252, simple_loss=0.2979, pruned_loss=0.1031, over 2569467.28 frames. ], batch size: 63, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:50:23,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=199600.5, ans=0.125 2024-06-20 11:50:24,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=199600.5, ans=0.07 2024-06-20 11:50:30,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=199618.83333333334, ans=0.125 2024-06-20 11:50:34,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=199637.16666666666, ans=0.125 2024-06-20 11:50:39,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=199655.5, ans=0.0 2024-06-20 11:50:43,160 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:50:44,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=199655.5, ans=0.2 2024-06-20 11:50:46,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=199673.83333333334, ans=0.0 2024-06-20 11:50:47,406 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 1.944e+02 2.183e+02 2.511e+02 4.277e+02, threshold=4.366e+02, percent-clipped=1.0 2024-06-20 11:50:55,392 INFO [train.py:1028] (1/2) Epoch 11, batch 7750, loss[loss=0.225, simple_loss=0.2794, pruned_loss=0.08532, over 13045.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.299, pruned_loss=0.104, over 2573718.23 frames. ], batch size: 71, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:51:08,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=199710.5, ans=0.0 2024-06-20 11:51:22,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=199747.16666666666, ans=0.0 2024-06-20 11:51:23,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=199747.16666666666, ans=0.0 2024-06-20 11:51:24,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=199765.5, ans=0.1 2024-06-20 11:51:27,710 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.29 vs. limit=15.0 2024-06-20 11:51:31,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=199783.83333333334, ans=0.0 2024-06-20 11:51:32,116 INFO [train.py:1028] (1/2) Epoch 11, batch 7800, loss[loss=0.238, simple_loss=0.2904, pruned_loss=0.09281, over 13183.00 frames. ], tot_loss[loss=0.254, simple_loss=0.2998, pruned_loss=0.1041, over 2578236.24 frames. ], batch size: 95, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:51:34,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=199783.83333333334, ans=0.125 2024-06-20 11:51:34,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=199783.83333333334, ans=0.125 2024-06-20 11:51:34,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=199783.83333333334, ans=0.125 2024-06-20 11:51:40,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=199802.16666666666, ans=0.125 2024-06-20 11:51:43,475 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2024-06-20 11:51:44,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=199802.16666666666, ans=0.125 2024-06-20 11:51:47,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=199820.5, ans=0.04949747468305833 2024-06-20 11:51:51,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=199820.5, ans=0.125 2024-06-20 11:52:01,386 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 1.911e+02 2.073e+02 2.295e+02 3.294e+02, threshold=4.146e+02, percent-clipped=0.0 2024-06-20 11:52:06,078 INFO [train.py:1028] (1/2) Epoch 11, batch 7850, loss[loss=0.2293, simple_loss=0.274, pruned_loss=0.09232, over 10814.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3005, pruned_loss=0.1046, over 2572837.68 frames. ], batch size: 16, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:52:14,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=199893.83333333334, ans=0.125 2024-06-20 11:52:23,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=199912.16666666666, ans=0.1 2024-06-20 11:52:27,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=199930.5, ans=0.125 2024-06-20 11:52:30,070 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.16 vs. limit=15.0 2024-06-20 11:52:30,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=199930.5, ans=0.1 2024-06-20 11:52:38,048 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.25 vs. limit=15.0 2024-06-20 11:52:38,889 INFO [train.py:1028] (1/2) Epoch 11, batch 7900, loss[loss=0.2563, simple_loss=0.3087, pruned_loss=0.102, over 13182.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3001, pruned_loss=0.1043, over 2571471.22 frames. ], batch size: 77, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:52:46,148 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.40 vs. limit=15.0 2024-06-20 11:52:47,257 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.83 vs. limit=12.0 2024-06-20 11:52:49,202 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2024-06-20 11:52:54,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=199985.5, ans=0.0 2024-06-20 11:52:54,881 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.48 vs. limit=15.0 2024-06-20 11:53:11,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=200040.5, ans=0.125 2024-06-20 11:53:14,131 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 1.938e+02 2.170e+02 2.552e+02 4.164e+02, threshold=4.340e+02, percent-clipped=1.0 2024-06-20 11:53:16,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=200040.5, ans=0.0 2024-06-20 11:53:16,580 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=15.0 2024-06-20 11:53:18,677 INFO [train.py:1028] (1/2) Epoch 11, batch 7950, loss[loss=0.2556, simple_loss=0.2884, pruned_loss=0.1113, over 10633.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3005, pruned_loss=0.1042, over 2574299.96 frames. ], batch size: 304, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:53:23,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=200058.83333333334, ans=0.125 2024-06-20 11:53:26,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=200077.16666666666, ans=0.125 2024-06-20 11:53:33,956 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.72 vs. limit=10.0 2024-06-20 11:53:38,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=200113.83333333334, ans=0.125 2024-06-20 11:53:51,206 INFO [train.py:1028] (1/2) Epoch 11, batch 8000, loss[loss=0.2225, simple_loss=0.2827, pruned_loss=0.08121, over 12738.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3009, pruned_loss=0.1042, over 2572022.33 frames. ], batch size: 29, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:53:58,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=200168.83333333334, ans=0.0 2024-06-20 11:54:08,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=200187.16666666666, ans=0.125 2024-06-20 11:54:11,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=200205.5, ans=0.125 2024-06-20 11:54:14,978 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.62 vs. limit=15.0 2024-06-20 11:54:15,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=200205.5, ans=0.0 2024-06-20 11:54:19,761 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.934e+02 2.202e+02 2.647e+02 3.478e+02, threshold=4.405e+02, percent-clipped=0.0 2024-06-20 11:54:21,560 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.75 vs. limit=22.5 2024-06-20 11:54:22,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=200223.83333333334, ans=0.125 2024-06-20 11:54:24,676 INFO [train.py:1028] (1/2) Epoch 11, batch 8050, loss[loss=0.2572, simple_loss=0.3002, pruned_loss=0.1071, over 13246.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3005, pruned_loss=0.104, over 2570853.45 frames. ], batch size: 83, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:54:26,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=200242.16666666666, ans=0.125 2024-06-20 11:54:37,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=200278.83333333334, ans=0.95 2024-06-20 11:54:47,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=200297.16666666666, ans=0.125 2024-06-20 11:54:53,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=200297.16666666666, ans=0.0 2024-06-20 11:55:03,434 INFO [train.py:1028] (1/2) Epoch 11, batch 8100, loss[loss=0.2772, simple_loss=0.32, pruned_loss=0.1172, over 13109.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3016, pruned_loss=0.1047, over 2575376.04 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 64.0 2024-06-20 11:55:05,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=200333.83333333334, ans=0.2 2024-06-20 11:55:05,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=200333.83333333334, ans=15.0 2024-06-20 11:55:05,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=200333.83333333334, ans=0.125 2024-06-20 11:55:19,324 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.04 vs. limit=15.0 2024-06-20 11:55:20,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=200370.5, ans=0.125 2024-06-20 11:55:21,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=200370.5, ans=0.125 2024-06-20 11:55:23,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=200388.83333333334, ans=0.125 2024-06-20 11:55:28,833 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.56 vs. limit=12.0 2024-06-20 11:55:30,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=200407.16666666666, ans=0.125 2024-06-20 11:55:32,539 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 1.886e+02 2.007e+02 2.208e+02 2.846e+02, threshold=4.015e+02, percent-clipped=0.0 2024-06-20 11:55:34,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=200407.16666666666, ans=0.125 2024-06-20 11:55:37,399 INFO [train.py:1028] (1/2) Epoch 11, batch 8150, loss[loss=0.2428, simple_loss=0.2887, pruned_loss=0.09844, over 13120.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3007, pruned_loss=0.1037, over 2579250.43 frames. ], batch size: 121, lr: 5.26e-03, grad_scale: 64.0 2024-06-20 11:55:38,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=200425.5, ans=0.1 2024-06-20 11:55:49,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=200443.83333333334, ans=0.125 2024-06-20 11:55:49,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=200443.83333333334, ans=0.2 2024-06-20 11:55:58,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=200480.5, ans=0.125 2024-06-20 11:55:58,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=200480.5, ans=0.125 2024-06-20 11:56:01,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=200480.5, ans=0.0 2024-06-20 11:56:03,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=200498.83333333334, ans=0.95 2024-06-20 11:56:05,782 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:56:10,037 INFO [train.py:1028] (1/2) Epoch 11, batch 8200, loss[loss=0.2561, simple_loss=0.3012, pruned_loss=0.1056, over 13151.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3005, pruned_loss=0.1037, over 2582260.54 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 64.0 2024-06-20 11:56:11,062 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.10 vs. limit=15.0 2024-06-20 11:56:12,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=200517.16666666666, ans=0.125 2024-06-20 11:56:23,115 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.81 vs. limit=10.0 2024-06-20 11:56:24,473 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.36 vs. limit=15.0 2024-06-20 11:56:32,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=200572.16666666666, ans=0.1 2024-06-20 11:56:39,267 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.676e+02 1.939e+02 2.107e+02 2.313e+02 2.950e+02, threshold=4.214e+02, percent-clipped=0.0 2024-06-20 11:56:41,606 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2024-06-20 11:56:43,875 INFO [train.py:1028] (1/2) Epoch 11, batch 8250, loss[loss=0.2565, simple_loss=0.3146, pruned_loss=0.09926, over 13232.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3015, pruned_loss=0.1042, over 2583006.74 frames. ], batch size: 52, lr: 5.26e-03, grad_scale: 64.0 2024-06-20 11:57:10,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=200663.83333333334, ans=0.0 2024-06-20 11:57:19,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=200682.16666666666, ans=0.125 2024-06-20 11:57:23,970 INFO [train.py:1028] (1/2) Epoch 11, batch 8300, loss[loss=0.2696, simple_loss=0.3106, pruned_loss=0.1143, over 13101.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3013, pruned_loss=0.1041, over 2580277.82 frames. ], batch size: 103, lr: 5.26e-03, grad_scale: 64.0 2024-06-20 11:57:24,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=200700.5, ans=0.5 2024-06-20 11:57:25,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=200700.5, ans=0.125 2024-06-20 11:57:31,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=200718.83333333334, ans=0.125 2024-06-20 11:57:38,369 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.39 vs. limit=15.0 2024-06-20 11:57:52,844 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 1.902e+02 2.062e+02 2.207e+02 2.964e+02, threshold=4.124e+02, percent-clipped=0.0 2024-06-20 11:57:53,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=200773.83333333334, ans=0.1 2024-06-20 11:57:57,337 INFO [train.py:1028] (1/2) Epoch 11, batch 8350, loss[loss=0.2539, simple_loss=0.3012, pruned_loss=0.1033, over 13212.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3016, pruned_loss=0.1041, over 2580077.73 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 64.0 2024-06-20 11:58:06,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=200810.5, ans=0.0 2024-06-20 11:58:09,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=200810.5, ans=0.025 2024-06-20 11:58:20,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=200847.16666666666, ans=22.5 2024-06-20 11:58:21,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=200847.16666666666, ans=0.1 2024-06-20 11:58:21,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=200847.16666666666, ans=0.2 2024-06-20 11:58:24,675 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.44 vs. limit=5.0 2024-06-20 11:58:25,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=200865.5, ans=0.025 2024-06-20 11:58:31,066 INFO [train.py:1028] (1/2) Epoch 11, batch 8400, loss[loss=0.2277, simple_loss=0.2746, pruned_loss=0.09036, over 12859.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3015, pruned_loss=0.104, over 2575586.78 frames. ], batch size: 39, lr: 5.26e-03, grad_scale: 32.0 2024-06-20 11:58:41,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=200902.16666666666, ans=0.0 2024-06-20 11:58:48,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=200920.5, ans=0.0 2024-06-20 11:59:03,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=200938.83333333334, ans=0.0 2024-06-20 11:59:07,469 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 2.076e+02 2.218e+02 2.468e+02 3.872e+02, threshold=4.437e+02, percent-clipped=0.0 2024-06-20 11:59:11,306 INFO [train.py:1028] (1/2) Epoch 11, batch 8450, loss[loss=0.2763, simple_loss=0.3242, pruned_loss=0.1142, over 13179.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3026, pruned_loss=0.1043, over 2578008.16 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 32.0 2024-06-20 11:59:12,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=200975.5, ans=0.0 2024-06-20 11:59:13,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=200975.5, ans=0.0 2024-06-20 11:59:14,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=200975.5, ans=0.125 2024-06-20 11:59:20,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=200993.83333333334, ans=0.0 2024-06-20 11:59:27,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=201012.16666666666, ans=0.125 2024-06-20 11:59:35,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=201030.5, ans=0.125 2024-06-20 11:59:40,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=201048.83333333334, ans=0.125 2024-06-20 11:59:41,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=201048.83333333334, ans=0.125 2024-06-20 11:59:43,654 INFO [train.py:1028] (1/2) Epoch 11, batch 8500, loss[loss=0.2479, simple_loss=0.296, pruned_loss=0.09994, over 12559.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3035, pruned_loss=0.1046, over 2576926.84 frames. ], batch size: 29, lr: 5.25e-03, grad_scale: 32.0 2024-06-20 11:59:45,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=201067.16666666666, ans=0.0 2024-06-20 11:59:50,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=201085.5, ans=0.0 2024-06-20 11:59:59,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=201103.83333333334, ans=0.1 2024-06-20 12:00:01,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=201103.83333333334, ans=0.125 2024-06-20 12:00:02,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=201122.16666666666, ans=0.0 2024-06-20 12:00:12,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=201140.5, ans=0.1 2024-06-20 12:00:12,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=201140.5, ans=0.025 2024-06-20 12:00:13,491 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 2.056e+02 2.251e+02 2.543e+02 3.167e+02, threshold=4.501e+02, percent-clipped=0.0 2024-06-20 12:00:15,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=201140.5, ans=0.07 2024-06-20 12:00:17,409 INFO [train.py:1028] (1/2) Epoch 11, batch 8550, loss[loss=0.2751, simple_loss=0.3227, pruned_loss=0.1138, over 12582.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3036, pruned_loss=0.1044, over 2575120.88 frames. ], batch size: 22, lr: 5.25e-03, grad_scale: 32.0 2024-06-20 12:00:17,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=201158.83333333334, ans=0.05 2024-06-20 12:00:17,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=201158.83333333334, ans=0.125 2024-06-20 12:00:19,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=201158.83333333334, ans=0.125 2024-06-20 12:00:35,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=201195.5, ans=0.0 2024-06-20 12:00:36,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=201195.5, ans=0.1 2024-06-20 12:00:49,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=201232.16666666666, ans=0.0 2024-06-20 12:00:56,465 INFO [train.py:1028] (1/2) Epoch 11, batch 8600, loss[loss=0.2396, simple_loss=0.2861, pruned_loss=0.09657, over 13126.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3038, pruned_loss=0.1045, over 2573561.15 frames. ], batch size: 121, lr: 5.25e-03, grad_scale: 32.0 2024-06-20 12:00:56,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=201250.5, ans=0.125 2024-06-20 12:00:59,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=201250.5, ans=0.2 2024-06-20 12:01:15,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=201287.16666666666, ans=0.125 2024-06-20 12:01:30,515 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.904e+02 2.019e+02 2.207e+02 2.818e+02, threshold=4.037e+02, percent-clipped=0.0 2024-06-20 12:01:34,889 INFO [train.py:1028] (1/2) Epoch 11, batch 8650, loss[loss=0.2573, simple_loss=0.2989, pruned_loss=0.1079, over 13128.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3048, pruned_loss=0.1048, over 2576677.27 frames. ], batch size: 103, lr: 5.25e-03, grad_scale: 32.0 2024-06-20 12:01:41,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=201360.5, ans=0.125 2024-06-20 12:02:07,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=201415.5, ans=0.0 2024-06-20 12:02:08,307 INFO [train.py:1028] (1/2) Epoch 11, batch 8700, loss[loss=0.278, simple_loss=0.3253, pruned_loss=0.1154, over 13186.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3053, pruned_loss=0.1053, over 2572236.02 frames. ], batch size: 59, lr: 5.25e-03, grad_scale: 32.0 2024-06-20 12:02:23,571 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.27 vs. limit=15.0 2024-06-20 12:02:38,588 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.951e+02 2.031e+02 2.219e+02 3.073e+02, threshold=4.062e+02, percent-clipped=0.0 2024-06-20 12:02:42,674 INFO [train.py:1028] (1/2) Epoch 11, batch 8750, loss[loss=0.2532, simple_loss=0.2991, pruned_loss=0.1037, over 13134.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3053, pruned_loss=0.1056, over 2567396.12 frames. ], batch size: 121, lr: 5.25e-03, grad_scale: 32.0 2024-06-20 12:03:09,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=201580.5, ans=0.2 2024-06-20 12:03:23,392 INFO [train.py:1028] (1/2) Epoch 11, batch 8800, loss[loss=0.2484, simple_loss=0.3024, pruned_loss=0.09717, over 13251.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3052, pruned_loss=0.1055, over 2573544.37 frames. ], batch size: 72, lr: 5.25e-03, grad_scale: 32.0 2024-06-20 12:03:37,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=201653.83333333334, ans=0.125 2024-06-20 12:03:52,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=201690.5, ans=0.1 2024-06-20 12:03:53,803 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 1.958e+02 2.115e+02 2.324e+02 2.938e+02, threshold=4.230e+02, percent-clipped=0.0 2024-06-20 12:03:57,512 INFO [train.py:1028] (1/2) Epoch 11, batch 8850, loss[loss=0.2806, simple_loss=0.3204, pruned_loss=0.1203, over 12533.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3051, pruned_loss=0.1057, over 2561478.20 frames. ], batch size: 202, lr: 5.25e-03, grad_scale: 16.0 2024-06-20 12:03:59,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=201708.83333333334, ans=0.125 2024-06-20 12:04:02,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=201708.83333333334, ans=0.1 2024-06-20 12:04:05,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=201727.16666666666, ans=0.125 2024-06-20 12:04:30,788 INFO [train.py:1028] (1/2) Epoch 11, batch 8900, loss[loss=0.2713, simple_loss=0.3131, pruned_loss=0.1147, over 12917.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3049, pruned_loss=0.1057, over 2560742.83 frames. ], batch size: 33, lr: 5.24e-03, grad_scale: 16.0 2024-06-20 12:04:46,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=201837.16666666666, ans=0.2 2024-06-20 12:04:46,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=201837.16666666666, ans=0.0 2024-06-20 12:04:55,590 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.51 vs. limit=15.0 2024-06-20 12:05:04,860 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2024-06-20 12:05:07,257 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.045e+02 2.227e+02 2.477e+02 3.224e+02, threshold=4.454e+02, percent-clipped=0.0 2024-06-20 12:05:10,790 INFO [train.py:1028] (1/2) Epoch 11, batch 8950, loss[loss=0.2688, simple_loss=0.3071, pruned_loss=0.1152, over 12500.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3046, pruned_loss=0.1054, over 2559749.50 frames. ], batch size: 202, lr: 5.24e-03, grad_scale: 16.0 2024-06-20 12:05:15,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=201892.16666666666, ans=0.0 2024-06-20 12:05:16,051 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2024-06-20 12:05:35,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=201947.16666666666, ans=0.0 2024-06-20 12:05:37,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=201965.5, ans=0.0 2024-06-20 12:05:39,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=201965.5, ans=0.015 2024-06-20 12:05:43,878 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.11 vs. limit=22.5 2024-06-20 12:05:44,145 INFO [train.py:1028] (1/2) Epoch 11, batch 9000, loss[loss=0.2764, simple_loss=0.3143, pruned_loss=0.1193, over 13332.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3052, pruned_loss=0.1054, over 2564970.78 frames. ], batch size: 46, lr: 5.24e-03, grad_scale: 16.0 2024-06-20 12:05:44,145 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 12:05:52,045 INFO [train.py:1060] (1/2) Epoch 11, validation: loss=0.1959, simple_loss=0.2595, pruned_loss=0.06618, over 351949.00 frames. 2024-06-20 12:05:52,046 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 12:06:03,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=202002.16666666666, ans=0.0 2024-06-20 12:06:06,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=202020.5, ans=0.125 2024-06-20 12:06:12,722 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=202038.83333333334, ans=0.0 2024-06-20 12:06:17,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=202038.83333333334, ans=0.0 2024-06-20 12:06:18,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=202057.16666666666, ans=0.5 2024-06-20 12:06:21,415 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 1.948e+02 2.091e+02 2.280e+02 3.462e+02, threshold=4.181e+02, percent-clipped=0.0 2024-06-20 12:06:24,565 INFO [train.py:1028] (1/2) Epoch 11, batch 9050, loss[loss=0.2527, simple_loss=0.2962, pruned_loss=0.1046, over 11421.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3064, pruned_loss=0.106, over 2566035.52 frames. ], batch size: 17, lr: 5.24e-03, grad_scale: 16.0 2024-06-20 12:06:27,502 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.85 vs. limit=15.0 2024-06-20 12:06:43,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202130.5, ans=0.1 2024-06-20 12:06:46,062 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.73 vs. limit=10.0 2024-06-20 12:06:48,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=202130.5, ans=0.2 2024-06-20 12:06:49,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=202130.5, ans=0.0 2024-06-20 12:06:50,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=202148.83333333334, ans=0.0 2024-06-20 12:06:50,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=202148.83333333334, ans=0.0 2024-06-20 12:06:54,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=202148.83333333334, ans=0.1 2024-06-20 12:06:57,333 INFO [train.py:1028] (1/2) Epoch 11, batch 9100, loss[loss=0.2563, simple_loss=0.3038, pruned_loss=0.1044, over 13207.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3062, pruned_loss=0.1054, over 2566884.30 frames. ], batch size: 72, lr: 5.24e-03, grad_scale: 16.0 2024-06-20 12:07:06,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=202185.5, ans=0.025 2024-06-20 12:07:08,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=202185.5, ans=0.1 2024-06-20 12:07:09,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=202203.83333333334, ans=0.125 2024-06-20 12:07:15,710 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.49 vs. limit=10.0 2024-06-20 12:07:17,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=202222.16666666666, ans=0.125 2024-06-20 12:07:18,282 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.66 vs. limit=6.0 2024-06-20 12:07:23,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=202240.5, ans=0.0 2024-06-20 12:07:24,711 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.62 vs. limit=15.0 2024-06-20 12:07:26,322 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.875e+02 2.034e+02 2.248e+02 2.897e+02, threshold=4.068e+02, percent-clipped=0.0 2024-06-20 12:07:29,525 INFO [train.py:1028] (1/2) Epoch 11, batch 9150, loss[loss=0.2835, simple_loss=0.3283, pruned_loss=0.1193, over 13166.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3061, pruned_loss=0.1052, over 2568345.68 frames. ], batch size: 77, lr: 5.24e-03, grad_scale: 16.0 2024-06-20 12:07:42,769 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.36 vs. limit=22.5 2024-06-20 12:07:44,774 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.36 vs. limit=22.5 2024-06-20 12:07:44,801 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.62 vs. limit=15.0 2024-06-20 12:07:49,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=202313.83333333334, ans=0.0 2024-06-20 12:07:51,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=202313.83333333334, ans=0.1 2024-06-20 12:08:02,782 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.58 vs. limit=15.0 2024-06-20 12:08:03,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=202332.16666666666, ans=0.125 2024-06-20 12:08:04,253 INFO [train.py:1028] (1/2) Epoch 11, batch 9200, loss[loss=0.2552, simple_loss=0.3046, pruned_loss=0.103, over 12937.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3058, pruned_loss=0.1049, over 2572420.54 frames. ], batch size: 36, lr: 5.24e-03, grad_scale: 32.0 2024-06-20 12:08:05,929 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.87 vs. limit=15.0 2024-06-20 12:08:16,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=202387.16666666666, ans=0.2 2024-06-20 12:08:22,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=202387.16666666666, ans=0.0 2024-06-20 12:08:27,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=202405.5, ans=0.125 2024-06-20 12:08:35,622 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.874e+02 2.043e+02 2.265e+02 3.021e+02, threshold=4.086e+02, percent-clipped=0.0 2024-06-20 12:08:38,964 INFO [train.py:1028] (1/2) Epoch 11, batch 9250, loss[loss=0.2929, simple_loss=0.3426, pruned_loss=0.1217, over 13225.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3064, pruned_loss=0.1051, over 2572636.71 frames. ], batch size: 67, lr: 5.24e-03, grad_scale: 32.0 2024-06-20 12:08:57,256 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.95 vs. limit=10.0 2024-06-20 12:09:01,113 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.42 vs. limit=15.0 2024-06-20 12:09:10,969 INFO [train.py:1028] (1/2) Epoch 11, batch 9300, loss[loss=0.2526, simple_loss=0.2992, pruned_loss=0.103, over 13205.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3061, pruned_loss=0.1049, over 2570628.28 frames. ], batch size: 40, lr: 5.24e-03, grad_scale: 32.0 2024-06-20 12:09:16,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=202552.16666666666, ans=0.125 2024-06-20 12:09:29,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202588.83333333334, ans=0.1 2024-06-20 12:09:39,003 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.926e+02 2.088e+02 2.288e+02 3.539e+02, threshold=4.176e+02, percent-clipped=0.0 2024-06-20 12:09:42,174 INFO [train.py:1028] (1/2) Epoch 11, batch 9350, loss[loss=0.2537, simple_loss=0.3088, pruned_loss=0.09928, over 12439.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3058, pruned_loss=0.1048, over 2566963.79 frames. ], batch size: 22, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:09:47,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=202625.5, ans=0.1 2024-06-20 12:10:09,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=202698.83333333334, ans=0.125 2024-06-20 12:10:12,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=202698.83333333334, ans=0.035 2024-06-20 12:10:13,452 INFO [train.py:1028] (1/2) Epoch 11, batch 9400, loss[loss=0.241, simple_loss=0.2932, pruned_loss=0.0944, over 13292.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3064, pruned_loss=0.1052, over 2567714.55 frames. ], batch size: 52, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:10:18,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=202717.16666666666, ans=0.0 2024-06-20 12:10:19,773 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2024-06-20 12:10:21,107 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.11 vs. limit=15.0 2024-06-20 12:10:23,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=202735.5, ans=0.2 2024-06-20 12:10:24,621 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.88 vs. limit=15.0 2024-06-20 12:10:26,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=202753.83333333334, ans=0.125 2024-06-20 12:10:30,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=202753.83333333334, ans=0.1 2024-06-20 12:10:40,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=202790.5, ans=0.0 2024-06-20 12:10:41,521 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.909e+02 2.019e+02 2.187e+02 3.143e+02, threshold=4.037e+02, percent-clipped=0.0 2024-06-20 12:10:44,729 INFO [train.py:1028] (1/2) Epoch 11, batch 9450, loss[loss=0.237, simple_loss=0.2914, pruned_loss=0.09125, over 12508.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3074, pruned_loss=0.106, over 2567444.57 frames. ], batch size: 22, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:10:47,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=202808.83333333334, ans=0.1 2024-06-20 12:11:02,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=202863.83333333334, ans=0.5 2024-06-20 12:11:03,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=202863.83333333334, ans=0.0 2024-06-20 12:11:12,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=202882.16666666666, ans=0.2 2024-06-20 12:11:12,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=202882.16666666666, ans=0.125 2024-06-20 12:11:15,179 INFO [train.py:1028] (1/2) Epoch 11, batch 9500, loss[loss=0.2637, simple_loss=0.3173, pruned_loss=0.105, over 13238.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3063, pruned_loss=0.1052, over 2577100.60 frames. ], batch size: 43, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:11:22,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=202900.5, ans=0.0 2024-06-20 12:11:23,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=202918.83333333334, ans=0.0 2024-06-20 12:11:35,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=202937.16666666666, ans=0.0 2024-06-20 12:11:38,084 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.42 vs. limit=22.5 2024-06-20 12:11:39,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=202955.5, ans=0.0 2024-06-20 12:11:48,059 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.030e+02 2.240e+02 2.625e+02 3.872e+02, threshold=4.481e+02, percent-clipped=0.0 2024-06-20 12:11:48,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=202973.83333333334, ans=0.2 2024-06-20 12:11:49,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=202973.83333333334, ans=0.95 2024-06-20 12:11:50,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=202992.16666666666, ans=0.125 2024-06-20 12:11:50,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=202992.16666666666, ans=0.125 2024-06-20 12:11:50,948 INFO [train.py:1028] (1/2) Epoch 11, batch 9550, loss[loss=0.2545, simple_loss=0.2984, pruned_loss=0.1053, over 12870.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3062, pruned_loss=0.1053, over 2574641.33 frames. ], batch size: 39, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:11:51,659 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:12:05,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=203028.83333333334, ans=0.0 2024-06-20 12:12:06,860 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.60 vs. limit=15.0 2024-06-20 12:12:09,250 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:12:15,480 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.38 vs. limit=15.0 2024-06-20 12:12:22,021 INFO [train.py:1028] (1/2) Epoch 11, batch 9600, loss[loss=0.2849, simple_loss=0.313, pruned_loss=0.1284, over 10796.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3053, pruned_loss=0.1049, over 2571775.40 frames. ], batch size: 303, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:12:25,618 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.17 vs. limit=6.0 2024-06-20 12:12:26,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=203083.83333333334, ans=0.125 2024-06-20 12:12:35,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=203120.5, ans=0.125 2024-06-20 12:12:37,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=203120.5, ans=0.125 2024-06-20 12:12:40,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=203138.83333333334, ans=0.2 2024-06-20 12:12:49,507 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.944e+02 2.089e+02 2.222e+02 3.055e+02, threshold=4.178e+02, percent-clipped=0.0 2024-06-20 12:12:52,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=203175.5, ans=0.0 2024-06-20 12:12:52,593 INFO [train.py:1028] (1/2) Epoch 11, batch 9650, loss[loss=0.2576, simple_loss=0.2993, pruned_loss=0.1079, over 13115.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3053, pruned_loss=0.1054, over 2561708.50 frames. ], batch size: 132, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:12:52,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=203175.5, ans=0.2 2024-06-20 12:12:52,989 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2024-06-20 12:12:53,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=203175.5, ans=0.125 2024-06-20 12:12:59,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=203193.83333333334, ans=0.0 2024-06-20 12:13:01,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=203193.83333333334, ans=0.125 2024-06-20 12:13:12,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=203230.5, ans=0.0 2024-06-20 12:13:23,062 INFO [train.py:1028] (1/2) Epoch 11, batch 9700, loss[loss=0.271, simple_loss=0.3087, pruned_loss=0.1166, over 12973.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3048, pruned_loss=0.1052, over 2557304.73 frames. ], batch size: 144, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:13:27,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=203267.16666666666, ans=0.1 2024-06-20 12:13:35,514 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.46 vs. limit=15.0 2024-06-20 12:13:42,862 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:13:51,539 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.75 vs. limit=15.0 2024-06-20 12:13:53,423 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 1.895e+02 2.037e+02 2.308e+02 3.134e+02, threshold=4.075e+02, percent-clipped=0.0 2024-06-20 12:13:55,152 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.94 vs. limit=22.5 2024-06-20 12:13:56,600 INFO [train.py:1028] (1/2) Epoch 11, batch 9750, loss[loss=0.2467, simple_loss=0.2926, pruned_loss=0.1004, over 13061.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3037, pruned_loss=0.1048, over 2553698.65 frames. ], batch size: 132, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:14:00,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=203358.83333333334, ans=0.0 2024-06-20 12:14:02,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=203377.16666666666, ans=0.125 2024-06-20 12:14:04,980 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.53 vs. limit=22.5 2024-06-20 12:14:09,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=203395.5, ans=0.125 2024-06-20 12:14:18,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=203413.83333333334, ans=0.0 2024-06-20 12:14:25,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=203432.16666666666, ans=0.2 2024-06-20 12:14:27,358 INFO [train.py:1028] (1/2) Epoch 11, batch 9800, loss[loss=0.2459, simple_loss=0.2956, pruned_loss=0.09809, over 12966.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3027, pruned_loss=0.104, over 2546169.45 frames. ], batch size: 39, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:14:34,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=203468.83333333334, ans=0.2 2024-06-20 12:14:35,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=203468.83333333334, ans=0.125 2024-06-20 12:14:42,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=203487.16666666666, ans=0.125 2024-06-20 12:14:52,674 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.15 vs. limit=22.5 2024-06-20 12:14:54,683 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 1.981e+02 2.148e+02 2.391e+02 3.744e+02, threshold=4.296e+02, percent-clipped=0.0 2024-06-20 12:14:56,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=203523.83333333334, ans=0.125 2024-06-20 12:14:57,736 INFO [train.py:1028] (1/2) Epoch 11, batch 9850, loss[loss=0.2625, simple_loss=0.3066, pruned_loss=0.1092, over 13099.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.302, pruned_loss=0.1035, over 2538157.12 frames. ], batch size: 102, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:15:00,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=203542.16666666666, ans=0.025 2024-06-20 12:15:06,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=203560.5, ans=0.125 2024-06-20 12:15:09,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=203560.5, ans=0.2 2024-06-20 12:15:22,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=203597.16666666666, ans=0.1 2024-06-20 12:15:30,265 INFO [train.py:1028] (1/2) Epoch 11, batch 9900, loss[loss=0.2501, simple_loss=0.2994, pruned_loss=0.1003, over 12954.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3015, pruned_loss=0.1034, over 2530022.69 frames. ], batch size: 39, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:15:35,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=203633.83333333334, ans=0.125 2024-06-20 12:15:44,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=203670.5, ans=0.2 2024-06-20 12:15:56,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=203707.16666666666, ans=0.1 2024-06-20 12:15:57,792 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 1.910e+02 2.039e+02 2.182e+02 2.743e+02, threshold=4.078e+02, percent-clipped=0.0 2024-06-20 12:16:00,955 INFO [train.py:1028] (1/2) Epoch 11, batch 9950, loss[loss=0.2489, simple_loss=0.2966, pruned_loss=0.1006, over 12509.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3001, pruned_loss=0.1035, over 2522573.64 frames. ], batch size: 29, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:16:17,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=203762.16666666666, ans=0.125 2024-06-20 12:16:22,042 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.20 vs. limit=6.0 2024-06-20 12:16:29,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=203798.83333333334, ans=0.1 2024-06-20 12:16:34,048 INFO [train.py:1028] (1/2) Epoch 11, batch 10000, loss[loss=0.231, simple_loss=0.2834, pruned_loss=0.08929, over 12744.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3011, pruned_loss=0.1043, over 2485626.32 frames. ], batch size: 22, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:16:37,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=203817.16666666666, ans=0.2 2024-06-20 12:16:44,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=203835.5, ans=0.0 2024-06-20 12:16:51,280 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.04 vs. limit=10.0 2024-06-20 12:16:51,881 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.49 vs. limit=10.0 2024-06-20 12:16:51,944 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.94 vs. limit=10.0 2024-06-20 12:16:59,715 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=15.60 vs. limit=15.0 2024-06-20 12:17:02,296 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 1.988e+02 2.166e+02 2.593e+02 4.039e+02, threshold=4.331e+02, percent-clipped=0.0 2024-06-20 12:17:04,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=203890.5, ans=0.125 2024-06-20 12:17:05,142 INFO [train.py:1028] (1/2) Epoch 11, batch 10050, loss[loss=0.2425, simple_loss=0.2906, pruned_loss=0.09723, over 12480.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3021, pruned_loss=0.1057, over 2443809.24 frames. ], batch size: 22, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:17:06,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=203908.83333333334, ans=0.125 2024-06-20 12:17:18,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=203945.5, ans=0.125 2024-06-20 12:17:24,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=203963.83333333334, ans=0.0 2024-06-20 12:17:34,493 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.71 vs. limit=15.0 2024-06-20 12:17:34,778 INFO [train.py:1028] (1/2) Epoch 11, batch 10100, loss[loss=0.2194, simple_loss=0.2676, pruned_loss=0.08562, over 11108.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3014, pruned_loss=0.105, over 2428122.72 frames. ], batch size: 16, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:17:42,906 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:19:51,007 INFO [train.py:1028] (1/2) Epoch 12, batch 0, loss[loss=0.2273, simple_loss=0.2712, pruned_loss=0.09172, over 12941.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2712, pruned_loss=0.09172, over 12941.00 frames. ], batch size: 36, lr: 5.00e-03, grad_scale: 32.0 2024-06-20 12:19:51,007 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 12:19:55,858 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.2687, 1.5224, 1.8533, 1.7389], device='cuda:1') 2024-06-20 12:19:58,166 INFO [train.py:1060] (1/2) Epoch 12, validation: loss=0.1957, simple_loss=0.2607, pruned_loss=0.06538, over 351949.00 frames. 2024-06-20 12:19:58,167 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 12:20:12,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=204068.33333333334, ans=0.125 2024-06-20 12:20:17,169 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.937e+02 2.210e+02 2.561e+02 3.967e+02, threshold=4.421e+02, percent-clipped=0.0 2024-06-20 12:20:22,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=204086.66666666666, ans=0.1 2024-06-20 12:20:34,438 INFO [train.py:1028] (1/2) Epoch 12, batch 50, loss[loss=0.2222, simple_loss=0.2671, pruned_loss=0.08863, over 12605.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.2807, pruned_loss=0.09449, over 575292.43 frames. ], batch size: 29, lr: 5.00e-03, grad_scale: 32.0 2024-06-20 12:20:38,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=204123.33333333334, ans=0.125 2024-06-20 12:20:39,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=204123.33333333334, ans=0.2 2024-06-20 12:20:41,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=204141.66666666666, ans=0.125 2024-06-20 12:20:56,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=204160.0, ans=0.125 2024-06-20 12:21:05,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=204196.66666666666, ans=0.1 2024-06-20 12:21:07,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=204196.66666666666, ans=0.125 2024-06-20 12:21:10,180 INFO [train.py:1028] (1/2) Epoch 12, batch 100, loss[loss=0.2321, simple_loss=0.2812, pruned_loss=0.09154, over 13288.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.2799, pruned_loss=0.09472, over 1017886.35 frames. ], batch size: 46, lr: 5.00e-03, grad_scale: 32.0 2024-06-20 12:21:28,181 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.895e+02 2.134e+02 2.386e+02 3.029e+02, threshold=4.267e+02, percent-clipped=0.0 2024-06-20 12:21:34,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=204270.0, ans=0.1 2024-06-20 12:21:36,444 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.70 vs. limit=6.0 2024-06-20 12:21:42,050 INFO [train.py:1028] (1/2) Epoch 12, batch 150, loss[loss=0.2371, simple_loss=0.2805, pruned_loss=0.09685, over 12525.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.2799, pruned_loss=0.0937, over 1365720.85 frames. ], batch size: 29, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:21:46,998 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.51 vs. limit=22.5 2024-06-20 12:21:51,093 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.24 vs. limit=15.0 2024-06-20 12:21:53,001 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.23 vs. limit=15.0 2024-06-20 12:21:58,052 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.27 vs. limit=15.0 2024-06-20 12:21:58,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=204343.33333333334, ans=0.0 2024-06-20 12:22:01,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=204361.66666666666, ans=0.1 2024-06-20 12:22:07,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=204380.0, ans=0.07 2024-06-20 12:22:13,355 INFO [train.py:1028] (1/2) Epoch 12, batch 200, loss[loss=0.2711, simple_loss=0.3053, pruned_loss=0.1185, over 12465.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.2801, pruned_loss=0.09418, over 1635494.80 frames. ], batch size: 202, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:22:16,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=204398.33333333334, ans=0.0 2024-06-20 12:22:33,919 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.811e+02 1.934e+02 2.061e+02 2.880e+02, threshold=3.868e+02, percent-clipped=0.0 2024-06-20 12:22:43,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=204471.66666666666, ans=0.0 2024-06-20 12:22:48,046 INFO [train.py:1028] (1/2) Epoch 12, batch 250, loss[loss=0.234, simple_loss=0.2734, pruned_loss=0.09732, over 13024.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.2801, pruned_loss=0.09385, over 1846625.00 frames. ], batch size: 144, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:23:06,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=204526.66666666666, ans=0.0 2024-06-20 12:23:06,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=204526.66666666666, ans=0.0 2024-06-20 12:23:11,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=204545.0, ans=0.125 2024-06-20 12:23:11,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=204545.0, ans=0.0 2024-06-20 12:23:11,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=204545.0, ans=0.125 2024-06-20 12:23:18,611 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.80 vs. limit=22.5 2024-06-20 12:23:24,503 INFO [train.py:1028] (1/2) Epoch 12, batch 300, loss[loss=0.2457, simple_loss=0.2854, pruned_loss=0.103, over 13164.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.2805, pruned_loss=0.09412, over 2009832.50 frames. ], batch size: 112, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:23:42,753 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.844e+02 2.036e+02 2.184e+02 2.983e+02, threshold=4.072e+02, percent-clipped=0.0 2024-06-20 12:23:45,246 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:23:56,643 INFO [train.py:1028] (1/2) Epoch 12, batch 350, loss[loss=0.2302, simple_loss=0.2844, pruned_loss=0.08799, over 12911.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.2806, pruned_loss=0.09408, over 2138745.70 frames. ], batch size: 33, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:24:19,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=204728.33333333334, ans=0.125 2024-06-20 12:24:20,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=204728.33333333334, ans=0.07 2024-06-20 12:24:20,582 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=9.70 vs. limit=12.0 2024-06-20 12:24:20,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=204728.33333333334, ans=0.2 2024-06-20 12:24:21,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=204728.33333333334, ans=0.5 2024-06-20 12:24:32,886 INFO [train.py:1028] (1/2) Epoch 12, batch 400, loss[loss=0.2388, simple_loss=0.2845, pruned_loss=0.09661, over 13243.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.28, pruned_loss=0.09363, over 2239614.65 frames. ], batch size: 63, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:24:51,134 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.822e+02 1.928e+02 2.173e+02 3.193e+02, threshold=3.856e+02, percent-clipped=0.0 2024-06-20 12:24:51,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=204820.0, ans=0.125 2024-06-20 12:24:51,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=204820.0, ans=0.125 2024-06-20 12:25:08,435 INFO [train.py:1028] (1/2) Epoch 12, batch 450, loss[loss=0.2228, simple_loss=0.272, pruned_loss=0.08677, over 13210.00 frames. ], tot_loss[loss=0.233, simple_loss=0.2796, pruned_loss=0.09319, over 2314206.75 frames. ], batch size: 67, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:25:27,948 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.31 vs. limit=22.5 2024-06-20 12:25:36,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=204930.0, ans=0.0 2024-06-20 12:25:36,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=12.0 2024-06-20 12:25:39,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=204930.0, ans=0.0 2024-06-20 12:25:40,491 INFO [train.py:1028] (1/2) Epoch 12, batch 500, loss[loss=0.2193, simple_loss=0.2616, pruned_loss=0.08853, over 13108.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2793, pruned_loss=0.09288, over 2376897.03 frames. ], batch size: 121, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:25:48,079 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:25:48,358 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.59 vs. limit=22.5 2024-06-20 12:25:58,670 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.876e+02 2.023e+02 2.255e+02 2.809e+02, threshold=4.047e+02, percent-clipped=0.0 2024-06-20 12:26:02,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=205003.33333333334, ans=0.125 2024-06-20 12:26:12,802 INFO [train.py:1028] (1/2) Epoch 12, batch 550, loss[loss=0.2413, simple_loss=0.2794, pruned_loss=0.1016, over 12962.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.2789, pruned_loss=0.09269, over 2421505.51 frames. ], batch size: 158, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:26:16,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=205040.0, ans=0.125 2024-06-20 12:26:27,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=205076.66666666666, ans=0.125 2024-06-20 12:26:33,720 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.85 vs. limit=10.0 2024-06-20 12:26:39,752 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.89 vs. limit=15.0 2024-06-20 12:26:44,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=205113.33333333334, ans=0.125 2024-06-20 12:26:45,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=205113.33333333334, ans=0.0 2024-06-20 12:26:46,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205131.66666666666, ans=0.1 2024-06-20 12:26:46,895 INFO [train.py:1028] (1/2) Epoch 12, batch 600, loss[loss=0.2415, simple_loss=0.2818, pruned_loss=0.1006, over 13013.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.2791, pruned_loss=0.09282, over 2459333.00 frames. ], batch size: 144, lr: 4.98e-03, grad_scale: 32.0 2024-06-20 12:26:55,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=205150.0, ans=0.1 2024-06-20 12:26:57,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205150.0, ans=0.1 2024-06-20 12:26:59,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=205168.33333333334, ans=0.125 2024-06-20 12:27:04,564 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.808e+02 1.886e+02 2.012e+02 2.572e+02, threshold=3.772e+02, percent-clipped=0.0 2024-06-20 12:27:06,177 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.41 vs. limit=15.0 2024-06-20 12:27:21,359 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.66 vs. limit=6.0 2024-06-20 12:27:21,544 INFO [train.py:1028] (1/2) Epoch 12, batch 650, loss[loss=0.2166, simple_loss=0.2708, pruned_loss=0.08118, over 13179.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.2787, pruned_loss=0.09229, over 2489696.73 frames. ], batch size: 59, lr: 4.98e-03, grad_scale: 32.0 2024-06-20 12:27:23,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=205223.33333333334, ans=0.025 2024-06-20 12:27:24,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=205223.33333333334, ans=0.125 2024-06-20 12:27:39,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=205260.0, ans=10.0 2024-06-20 12:27:40,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=205278.33333333334, ans=10.0 2024-06-20 12:27:47,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=205296.66666666666, ans=0.0 2024-06-20 12:27:48,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=205296.66666666666, ans=0.0 2024-06-20 12:27:50,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=205296.66666666666, ans=0.0 2024-06-20 12:27:52,716 INFO [train.py:1028] (1/2) Epoch 12, batch 700, loss[loss=0.218, simple_loss=0.2651, pruned_loss=0.08545, over 13325.00 frames. ], tot_loss[loss=0.233, simple_loss=0.2798, pruned_loss=0.09308, over 2512063.99 frames. ], batch size: 46, lr: 4.98e-03, grad_scale: 32.0 2024-06-20 12:27:56,352 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.95 vs. limit=15.0 2024-06-20 12:27:57,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=205315.0, ans=0.125 2024-06-20 12:28:04,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=205333.33333333334, ans=0.07 2024-06-20 12:28:04,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=205333.33333333334, ans=0.0 2024-06-20 12:28:06,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.72 vs. limit=22.5 2024-06-20 12:28:07,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=205333.33333333334, ans=0.0 2024-06-20 12:28:12,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=205351.66666666666, ans=0.2 2024-06-20 12:28:15,566 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.824e+02 1.943e+02 2.079e+02 2.883e+02, threshold=3.886e+02, percent-clipped=0.0 2024-06-20 12:28:19,966 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2024-06-20 12:28:22,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=205370.0, ans=0.125 2024-06-20 12:28:24,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=205388.33333333334, ans=0.125 2024-06-20 12:28:26,082 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.66 vs. limit=10.0 2024-06-20 12:28:26,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=205388.33333333334, ans=0.1 2024-06-20 12:28:30,155 INFO [train.py:1028] (1/2) Epoch 12, batch 750, loss[loss=0.2241, simple_loss=0.2761, pruned_loss=0.0861, over 13260.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.2792, pruned_loss=0.09261, over 2527235.83 frames. ], batch size: 63, lr: 4.98e-03, grad_scale: 64.0 2024-06-20 12:28:32,416 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.36 vs. limit=22.5 2024-06-20 12:28:33,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=205406.66666666666, ans=0.125 2024-06-20 12:28:51,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=205443.33333333334, ans=0.05 2024-06-20 12:28:53,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=205461.66666666666, ans=0.2 2024-06-20 12:28:59,776 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.59 vs. limit=10.0 2024-06-20 12:29:05,258 INFO [train.py:1028] (1/2) Epoch 12, batch 800, loss[loss=0.2253, simple_loss=0.2762, pruned_loss=0.08718, over 12862.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.2792, pruned_loss=0.093, over 2540673.71 frames. ], batch size: 36, lr: 4.98e-03, grad_scale: 64.0 2024-06-20 12:29:06,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=205498.33333333334, ans=0.1 2024-06-20 12:29:08,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=205498.33333333334, ans=0.125 2024-06-20 12:29:26,128 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.823e+02 1.984e+02 2.199e+02 2.988e+02, threshold=3.968e+02, percent-clipped=0.0 2024-06-20 12:29:31,574 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.94 vs. limit=15.0 2024-06-20 12:29:40,484 INFO [train.py:1028] (1/2) Epoch 12, batch 850, loss[loss=0.2266, simple_loss=0.2724, pruned_loss=0.09043, over 13160.00 frames. ], tot_loss[loss=0.233, simple_loss=0.2796, pruned_loss=0.09319, over 2552319.01 frames. ], batch size: 95, lr: 4.98e-03, grad_scale: 64.0 2024-06-20 12:29:41,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=205590.0, ans=0.125 2024-06-20 12:29:58,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=205626.66666666666, ans=0.025 2024-06-20 12:30:03,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=205645.0, ans=0.0 2024-06-20 12:30:12,526 INFO [train.py:1028] (1/2) Epoch 12, batch 900, loss[loss=0.2596, simple_loss=0.3049, pruned_loss=0.1071, over 12861.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.2787, pruned_loss=0.09281, over 2557255.45 frames. ], batch size: 36, lr: 4.98e-03, grad_scale: 64.0 2024-06-20 12:30:16,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.43 vs. limit=10.0 2024-06-20 12:30:18,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=205700.0, ans=0.125 2024-06-20 12:30:27,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=205718.33333333334, ans=0.0 2024-06-20 12:30:31,714 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.837e+02 1.957e+02 2.166e+02 3.035e+02, threshold=3.913e+02, percent-clipped=0.0 2024-06-20 12:30:46,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=205755.0, ans=0.2 2024-06-20 12:30:49,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=205773.33333333334, ans=0.1 2024-06-20 12:30:50,257 INFO [train.py:1028] (1/2) Epoch 12, batch 950, loss[loss=0.2192, simple_loss=0.274, pruned_loss=0.08221, over 12955.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2787, pruned_loss=0.0926, over 2560827.81 frames. ], batch size: 39, lr: 4.98e-03, grad_scale: 64.0 2024-06-20 12:30:51,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205773.33333333334, ans=0.1 2024-06-20 12:30:56,187 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.95 vs. limit=15.0 2024-06-20 12:30:59,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=205791.66666666666, ans=0.125 2024-06-20 12:31:05,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=205810.0, ans=0.125 2024-06-20 12:31:09,780 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.89 vs. limit=10.0 2024-06-20 12:31:24,973 INFO [train.py:1028] (1/2) Epoch 12, batch 1000, loss[loss=0.2603, simple_loss=0.3069, pruned_loss=0.1068, over 13271.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.2786, pruned_loss=0.09277, over 2562755.90 frames. ], batch size: 49, lr: 4.98e-03, grad_scale: 64.0 2024-06-20 12:31:33,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=205883.33333333334, ans=0.05 2024-06-20 12:31:34,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=205883.33333333334, ans=0.125 2024-06-20 12:31:36,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=205883.33333333334, ans=0.125 2024-06-20 12:31:38,726 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=7.650e-01 2024-06-20 12:31:42,893 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.794e+02 1.910e+02 2.131e+02 2.869e+02, threshold=3.820e+02, percent-clipped=0.0 2024-06-20 12:31:49,732 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.71 vs. limit=12.0 2024-06-20 12:31:54,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=205938.33333333334, ans=0.0 2024-06-20 12:31:57,184 INFO [train.py:1028] (1/2) Epoch 12, batch 1050, loss[loss=0.2287, simple_loss=0.2764, pruned_loss=0.09046, over 13123.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2791, pruned_loss=0.09292, over 2566053.75 frames. ], batch size: 77, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:32:02,787 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.63 vs. limit=15.0 2024-06-20 12:32:02,817 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.63 vs. limit=6.0 2024-06-20 12:32:06,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=205975.0, ans=0.2 2024-06-20 12:32:06,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=205975.0, ans=0.0 2024-06-20 12:32:08,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=205975.0, ans=0.025 2024-06-20 12:32:10,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=205993.33333333334, ans=0.0 2024-06-20 12:32:17,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=206011.66666666666, ans=0.2 2024-06-20 12:32:20,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=206011.66666666666, ans=0.125 2024-06-20 12:32:23,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=206030.0, ans=0.1 2024-06-20 12:32:29,892 INFO [train.py:1028] (1/2) Epoch 12, batch 1100, loss[loss=0.2375, simple_loss=0.2888, pruned_loss=0.09308, over 13288.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.28, pruned_loss=0.09311, over 2570970.32 frames. ], batch size: 52, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:32:30,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=206048.33333333334, ans=0.0 2024-06-20 12:32:35,928 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.42 vs. limit=22.5 2024-06-20 12:32:50,891 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 1.907e+02 2.053e+02 2.397e+02 3.531e+02, threshold=4.107e+02, percent-clipped=0.0 2024-06-20 12:32:50,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=206085.0, ans=0.015 2024-06-20 12:32:53,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=206103.33333333334, ans=0.2 2024-06-20 12:32:55,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=206103.33333333334, ans=0.07 2024-06-20 12:32:56,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=206103.33333333334, ans=0.0 2024-06-20 12:32:57,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=206103.33333333334, ans=0.025 2024-06-20 12:32:59,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=206121.66666666666, ans=0.125 2024-06-20 12:33:04,846 INFO [train.py:1028] (1/2) Epoch 12, batch 1150, loss[loss=0.2626, simple_loss=0.3034, pruned_loss=0.1109, over 13265.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.2798, pruned_loss=0.09315, over 2572155.12 frames. ], batch size: 52, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:33:06,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=206140.0, ans=0.2 2024-06-20 12:33:07,443 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.60 vs. limit=10.0 2024-06-20 12:33:09,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=206140.0, ans=0.1 2024-06-20 12:33:13,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=206158.33333333334, ans=0.125 2024-06-20 12:33:13,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=206158.33333333334, ans=0.04949747468305833 2024-06-20 12:33:20,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=206158.33333333334, ans=0.125 2024-06-20 12:33:20,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=206158.33333333334, ans=0.0 2024-06-20 12:33:22,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=206176.66666666666, ans=0.125 2024-06-20 12:33:22,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=206176.66666666666, ans=0.0 2024-06-20 12:33:36,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=206213.33333333334, ans=0.125 2024-06-20 12:33:42,188 INFO [train.py:1028] (1/2) Epoch 12, batch 1200, loss[loss=0.2199, simple_loss=0.269, pruned_loss=0.08538, over 13212.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.2794, pruned_loss=0.09322, over 2574050.90 frames. ], batch size: 77, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:34:00,184 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 1.802e+02 1.956e+02 2.126e+02 3.629e+02, threshold=3.913e+02, percent-clipped=0.0 2024-06-20 12:34:08,022 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2024-06-20 12:34:11,434 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.27 vs. limit=15.0 2024-06-20 12:34:14,176 INFO [train.py:1028] (1/2) Epoch 12, batch 1250, loss[loss=0.2197, simple_loss=0.2643, pruned_loss=0.08751, over 13124.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.2797, pruned_loss=0.09325, over 2583062.62 frames. ], batch size: 112, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:34:14,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=206323.33333333334, ans=0.1 2024-06-20 12:34:18,094 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.43 vs. limit=15.0 2024-06-20 12:34:18,727 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.55 vs. limit=22.5 2024-06-20 12:34:23,654 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.88 vs. limit=6.0 2024-06-20 12:34:33,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=206378.33333333334, ans=0.125 2024-06-20 12:34:50,944 INFO [train.py:1028] (1/2) Epoch 12, batch 1300, loss[loss=0.2708, simple_loss=0.3076, pruned_loss=0.117, over 12743.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.2799, pruned_loss=0.09342, over 2584356.40 frames. ], batch size: 176, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:34:54,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=206415.0, ans=0.125 2024-06-20 12:34:57,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=206433.33333333334, ans=0.125 2024-06-20 12:35:09,012 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 1.818e+02 1.968e+02 2.140e+02 3.108e+02, threshold=3.937e+02, percent-clipped=0.0 2024-06-20 12:35:10,736 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.17 vs. limit=22.5 2024-06-20 12:35:11,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=206470.0, ans=0.125 2024-06-20 12:35:12,060 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.22 vs. limit=15.0 2024-06-20 12:35:23,243 INFO [train.py:1028] (1/2) Epoch 12, batch 1350, loss[loss=0.2306, simple_loss=0.2814, pruned_loss=0.08997, over 13230.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.2799, pruned_loss=0.09319, over 2585601.83 frames. ], batch size: 59, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:35:34,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=206525.0, ans=0.125 2024-06-20 12:35:36,009 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.53 vs. limit=22.5 2024-06-20 12:35:36,029 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=13.09 vs. limit=15.0 2024-06-20 12:35:37,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=206525.0, ans=0.2 2024-06-20 12:35:37,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=206525.0, ans=0.2 2024-06-20 12:35:39,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=206543.33333333334, ans=0.125 2024-06-20 12:35:51,751 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.75 vs. limit=22.5 2024-06-20 12:35:53,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=206580.0, ans=0.0 2024-06-20 12:35:59,114 INFO [train.py:1028] (1/2) Epoch 12, batch 1400, loss[loss=0.2443, simple_loss=0.277, pruned_loss=0.1059, over 12868.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.2798, pruned_loss=0.09342, over 2587872.56 frames. ], batch size: 26, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:36:16,631 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:36:17,258 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.791e+02 1.931e+02 2.109e+02 3.186e+02, threshold=3.861e+02, percent-clipped=0.0 2024-06-20 12:36:31,653 INFO [train.py:1028] (1/2) Epoch 12, batch 1450, loss[loss=0.2122, simple_loss=0.259, pruned_loss=0.08269, over 13093.00 frames. ], tot_loss[loss=0.233, simple_loss=0.2793, pruned_loss=0.09337, over 2588055.12 frames. ], batch size: 121, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:36:52,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=206726.66666666666, ans=0.0 2024-06-20 12:36:56,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=206745.0, ans=0.125 2024-06-20 12:37:07,048 INFO [train.py:1028] (1/2) Epoch 12, batch 1500, loss[loss=0.2389, simple_loss=0.2806, pruned_loss=0.09856, over 13189.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.2793, pruned_loss=0.09358, over 2590165.20 frames. ], batch size: 83, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:37:07,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=206781.66666666666, ans=0.025 2024-06-20 12:37:14,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=206800.0, ans=0.035 2024-06-20 12:37:14,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=206800.0, ans=0.0 2024-06-20 12:37:15,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=206800.0, ans=0.125 2024-06-20 12:37:29,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=206818.33333333334, ans=0.125 2024-06-20 12:37:29,489 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.841e+02 1.951e+02 2.228e+02 3.004e+02, threshold=3.901e+02, percent-clipped=0.0 2024-06-20 12:37:32,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=206836.66666666666, ans=0.0 2024-06-20 12:37:39,820 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:37:44,053 INFO [train.py:1028] (1/2) Epoch 12, batch 1550, loss[loss=0.2429, simple_loss=0.2889, pruned_loss=0.09846, over 13042.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.2798, pruned_loss=0.09383, over 2584496.85 frames. ], batch size: 102, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:37:53,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=206891.66666666666, ans=0.1 2024-06-20 12:37:56,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=206891.66666666666, ans=0.125 2024-06-20 12:37:57,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=206910.0, ans=0.125 2024-06-20 12:38:07,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=206928.33333333334, ans=0.0 2024-06-20 12:38:15,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=206946.66666666666, ans=0.0 2024-06-20 12:38:16,814 INFO [train.py:1028] (1/2) Epoch 12, batch 1600, loss[loss=0.2387, simple_loss=0.289, pruned_loss=0.09423, over 13175.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.2806, pruned_loss=0.09399, over 2581001.34 frames. ], batch size: 77, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:38:26,799 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.51 vs. limit=15.0 2024-06-20 12:38:27,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=206983.33333333334, ans=0.125 2024-06-20 12:38:33,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=207001.66666666666, ans=0.125 2024-06-20 12:38:34,015 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.58 vs. limit=10.0 2024-06-20 12:38:34,702 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.906e+02 2.032e+02 2.208e+02 3.637e+02, threshold=4.063e+02, percent-clipped=0.0 2024-06-20 12:38:34,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=207001.66666666666, ans=0.1 2024-06-20 12:38:51,568 INFO [train.py:1028] (1/2) Epoch 12, batch 1650, loss[loss=0.2482, simple_loss=0.2902, pruned_loss=0.1031, over 13103.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.2802, pruned_loss=0.09403, over 2576541.43 frames. ], batch size: 95, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:39:03,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=207075.0, ans=0.0 2024-06-20 12:39:07,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=207093.33333333334, ans=0.125 2024-06-20 12:39:10,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.84 vs. limit=15.0 2024-06-20 12:39:13,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=207111.66666666666, ans=0.125 2024-06-20 12:39:26,880 INFO [train.py:1028] (1/2) Epoch 12, batch 1700, loss[loss=0.2368, simple_loss=0.2915, pruned_loss=0.09109, over 12411.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.2797, pruned_loss=0.09364, over 2580837.39 frames. ], batch size: 25, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:39:32,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=207148.33333333334, ans=0.125 2024-06-20 12:39:34,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=207166.66666666666, ans=0.0 2024-06-20 12:39:44,674 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.802e+02 1.940e+02 2.070e+02 2.513e+02, threshold=3.881e+02, percent-clipped=0.0 2024-06-20 12:39:46,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=207203.33333333334, ans=0.125 2024-06-20 12:39:50,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=207203.33333333334, ans=0.0 2024-06-20 12:39:55,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=207221.66666666666, ans=0.2 2024-06-20 12:39:58,948 INFO [train.py:1028] (1/2) Epoch 12, batch 1750, loss[loss=0.2452, simple_loss=0.2946, pruned_loss=0.0979, over 12717.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.2798, pruned_loss=0.09373, over 2581599.07 frames. ], batch size: 22, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:40:21,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=207295.0, ans=0.125 2024-06-20 12:40:21,811 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.91 vs. limit=15.0 2024-06-20 12:40:23,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=207295.0, ans=0.125 2024-06-20 12:40:31,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=207331.66666666666, ans=0.125 2024-06-20 12:40:31,752 INFO [train.py:1028] (1/2) Epoch 12, batch 1800, loss[loss=0.236, simple_loss=0.288, pruned_loss=0.09199, over 13205.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.2799, pruned_loss=0.09395, over 2581381.79 frames. ], batch size: 67, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:40:32,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=207331.66666666666, ans=0.015 2024-06-20 12:40:36,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=207331.66666666666, ans=0.125 2024-06-20 12:40:46,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=207350.0, ans=0.1 2024-06-20 12:40:51,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=207368.33333333334, ans=0.0 2024-06-20 12:40:54,193 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.874e+02 2.062e+02 2.319e+02 3.373e+02, threshold=4.123e+02, percent-clipped=0.0 2024-06-20 12:40:59,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=207386.66666666666, ans=0.035 2024-06-20 12:41:00,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=207386.66666666666, ans=0.5 2024-06-20 12:41:08,323 INFO [train.py:1028] (1/2) Epoch 12, batch 1850, loss[loss=0.2131, simple_loss=0.2573, pruned_loss=0.08439, over 13208.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.2802, pruned_loss=0.09382, over 2582726.53 frames. ], batch size: 83, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:41:14,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=207441.66666666666, ans=0.125 2024-06-20 12:41:20,033 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=207441.66666666666, ans=0.1 2024-06-20 12:41:21,663 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.83 vs. limit=15.0 2024-06-20 12:41:22,835 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.91 vs. limit=15.0 2024-06-20 12:41:24,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=207460.0, ans=0.125 2024-06-20 12:41:36,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=207496.66666666666, ans=0.1 2024-06-20 12:41:43,706 INFO [train.py:1028] (1/2) Epoch 12, batch 1900, loss[loss=0.2374, simple_loss=0.2745, pruned_loss=0.1001, over 13199.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.2796, pruned_loss=0.09372, over 2586005.37 frames. ], batch size: 95, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:41:43,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=207515.0, ans=0.035 2024-06-20 12:41:45,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=207515.0, ans=0.1 2024-06-20 12:41:54,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=207533.33333333334, ans=0.2 2024-06-20 12:41:55,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=207533.33333333334, ans=0.0 2024-06-20 12:42:00,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=207551.66666666666, ans=0.125 2024-06-20 12:42:02,262 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.823e+02 1.954e+02 2.118e+02 2.596e+02, threshold=3.907e+02, percent-clipped=0.0 2024-06-20 12:42:12,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=207588.33333333334, ans=0.0 2024-06-20 12:42:15,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=207606.66666666666, ans=0.2 2024-06-20 12:42:16,023 INFO [train.py:1028] (1/2) Epoch 12, batch 1950, loss[loss=0.2136, simple_loss=0.2627, pruned_loss=0.08221, over 13280.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.2788, pruned_loss=0.09383, over 2592057.63 frames. ], batch size: 52, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:42:17,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=207606.66666666666, ans=0.0 2024-06-20 12:42:18,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=207606.66666666666, ans=0.0 2024-06-20 12:42:28,746 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.03 vs. limit=22.5 2024-06-20 12:42:50,898 INFO [train.py:1028] (1/2) Epoch 12, batch 2000, loss[loss=0.2521, simple_loss=0.3031, pruned_loss=0.1005, over 12810.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.2783, pruned_loss=0.09342, over 2588105.78 frames. ], batch size: 22, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:43:09,010 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 1.796e+02 1.950e+02 2.178e+02 3.040e+02, threshold=3.900e+02, percent-clipped=0.0 2024-06-20 12:43:14,292 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.51 vs. limit=22.5 2024-06-20 12:43:15,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=207753.33333333334, ans=0.0 2024-06-20 12:43:15,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=207753.33333333334, ans=0.125 2024-06-20 12:43:23,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=207771.66666666666, ans=0.035 2024-06-20 12:43:26,557 INFO [train.py:1028] (1/2) Epoch 12, batch 2050, loss[loss=0.2331, simple_loss=0.2844, pruned_loss=0.09094, over 12603.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.2797, pruned_loss=0.09399, over 2584207.29 frames. ], batch size: 29, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:43:27,577 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.71 vs. limit=15.0 2024-06-20 12:43:35,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=207808.33333333334, ans=0.125 2024-06-20 12:43:45,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=207845.0, ans=0.0 2024-06-20 12:43:51,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=207845.0, ans=0.1 2024-06-20 12:43:53,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=207863.33333333334, ans=0.0 2024-06-20 12:43:59,353 INFO [train.py:1028] (1/2) Epoch 12, batch 2100, loss[loss=0.2397, simple_loss=0.2884, pruned_loss=0.09549, over 13144.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.2799, pruned_loss=0.09353, over 2586777.22 frames. ], batch size: 59, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:43:59,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=207881.66666666666, ans=0.125 2024-06-20 12:44:17,990 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.813e+02 1.919e+02 2.058e+02 3.055e+02, threshold=3.838e+02, percent-clipped=0.0 2024-06-20 12:44:18,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=207918.33333333334, ans=0.125 2024-06-20 12:44:25,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=207955.0, ans=0.025 2024-06-20 12:44:32,279 INFO [train.py:1028] (1/2) Epoch 12, batch 2150, loss[loss=0.211, simple_loss=0.2674, pruned_loss=0.07736, over 13310.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.28, pruned_loss=0.09338, over 2588225.46 frames. ], batch size: 52, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:44:33,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=207973.33333333334, ans=0.07 2024-06-20 12:44:33,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=207973.33333333334, ans=0.125 2024-06-20 12:44:49,271 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.21 vs. limit=10.0 2024-06-20 12:44:55,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=208028.33333333334, ans=0.125 2024-06-20 12:44:56,205 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.456e+01 2024-06-20 12:45:01,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=208046.66666666666, ans=0.2 2024-06-20 12:45:07,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=208046.66666666666, ans=0.125 2024-06-20 12:45:08,301 INFO [train.py:1028] (1/2) Epoch 12, batch 2200, loss[loss=0.2336, simple_loss=0.2788, pruned_loss=0.09423, over 13222.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.2803, pruned_loss=0.09362, over 2588662.51 frames. ], batch size: 83, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:45:09,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=208065.0, ans=0.125 2024-06-20 12:45:11,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=208065.0, ans=0.125 2024-06-20 12:45:16,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=208083.33333333334, ans=0.125 2024-06-20 12:45:19,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=208083.33333333334, ans=0.0 2024-06-20 12:45:20,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=208083.33333333334, ans=0.2 2024-06-20 12:45:21,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=208101.66666666666, ans=0.2 2024-06-20 12:45:21,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=208101.66666666666, ans=0.1 2024-06-20 12:45:23,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=208101.66666666666, ans=0.125 2024-06-20 12:45:26,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=208101.66666666666, ans=0.125 2024-06-20 12:45:26,772 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.858e+02 2.045e+02 2.280e+02 3.132e+02, threshold=4.091e+02, percent-clipped=0.0 2024-06-20 12:45:31,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=208120.0, ans=0.125 2024-06-20 12:45:35,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=208120.0, ans=0.1 2024-06-20 12:45:40,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=208138.33333333334, ans=0.025 2024-06-20 12:45:44,097 INFO [train.py:1028] (1/2) Epoch 12, batch 2250, loss[loss=0.2204, simple_loss=0.2786, pruned_loss=0.08112, over 13276.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.2801, pruned_loss=0.09337, over 2587692.16 frames. ], batch size: 63, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:45:47,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=208156.66666666666, ans=0.2 2024-06-20 12:45:47,845 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.49 vs. limit=10.0 2024-06-20 12:45:54,777 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.49 vs. limit=15.0 2024-06-20 12:45:55,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=208175.0, ans=0.0 2024-06-20 12:46:05,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=208211.66666666666, ans=0.125 2024-06-20 12:46:13,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=208230.0, ans=0.95 2024-06-20 12:46:13,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=208230.0, ans=0.125 2024-06-20 12:46:16,519 INFO [train.py:1028] (1/2) Epoch 12, batch 2300, loss[loss=0.2378, simple_loss=0.2818, pruned_loss=0.09683, over 12880.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.2806, pruned_loss=0.09382, over 2581856.45 frames. ], batch size: 33, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:46:26,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=208266.66666666666, ans=0.125 2024-06-20 12:46:35,399 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.851e+02 2.035e+02 2.284e+02 3.226e+02, threshold=4.070e+02, percent-clipped=0.0 2024-06-20 12:46:36,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.09 vs. limit=6.0 2024-06-20 12:46:40,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=208303.33333333334, ans=0.1 2024-06-20 12:46:50,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=208321.66666666666, ans=0.125 2024-06-20 12:46:54,431 INFO [train.py:1028] (1/2) Epoch 12, batch 2350, loss[loss=0.2482, simple_loss=0.2944, pruned_loss=0.101, over 13219.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.28, pruned_loss=0.09371, over 2585881.07 frames. ], batch size: 67, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:47:07,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=208376.66666666666, ans=0.125 2024-06-20 12:47:11,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=208376.66666666666, ans=0.2 2024-06-20 12:47:14,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=208395.0, ans=0.0 2024-06-20 12:47:18,131 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.58 vs. limit=22.5 2024-06-20 12:47:30,272 INFO [train.py:1028] (1/2) Epoch 12, batch 2400, loss[loss=0.2286, simple_loss=0.2803, pruned_loss=0.0885, over 13324.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.2795, pruned_loss=0.09338, over 2587748.22 frames. ], batch size: 46, lr: 4.94e-03, grad_scale: 64.0 2024-06-20 12:47:38,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=208450.0, ans=0.0 2024-06-20 12:47:43,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=208468.33333333334, ans=0.2 2024-06-20 12:47:48,192 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.806e+02 1.932e+02 2.070e+02 2.777e+02, threshold=3.864e+02, percent-clipped=0.0 2024-06-20 12:47:50,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=208486.66666666666, ans=0.07 2024-06-20 12:47:51,469 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2024-06-20 12:47:55,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=208486.66666666666, ans=0.125 2024-06-20 12:48:02,686 INFO [train.py:1028] (1/2) Epoch 12, batch 2450, loss[loss=0.226, simple_loss=0.2714, pruned_loss=0.09027, over 13320.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2784, pruned_loss=0.09325, over 2584540.24 frames. ], batch size: 63, lr: 4.94e-03, grad_scale: 64.0 2024-06-20 12:48:04,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=208523.33333333334, ans=0.2 2024-06-20 12:48:07,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=208523.33333333334, ans=0.0 2024-06-20 12:48:07,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=208523.33333333334, ans=0.1 2024-06-20 12:48:08,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=208541.66666666666, ans=0.1 2024-06-20 12:48:24,127 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.92 vs. limit=10.0 2024-06-20 12:48:25,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=208578.33333333334, ans=0.2 2024-06-20 12:48:38,345 INFO [train.py:1028] (1/2) Epoch 12, batch 2500, loss[loss=0.1967, simple_loss=0.2465, pruned_loss=0.07343, over 13186.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.2771, pruned_loss=0.0925, over 2587986.95 frames. ], batch size: 83, lr: 4.94e-03, grad_scale: 64.0 2024-06-20 12:48:39,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=208615.0, ans=0.2 2024-06-20 12:48:41,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=208615.0, ans=0.0 2024-06-20 12:48:49,755 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.99 vs. limit=22.5 2024-06-20 12:48:56,296 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.878e+02 2.066e+02 2.249e+02 3.091e+02, threshold=4.132e+02, percent-clipped=0.0 2024-06-20 12:49:01,386 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.19 vs. limit=10.0 2024-06-20 12:49:09,414 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.18 vs. limit=22.5 2024-06-20 12:49:10,766 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=11.17 vs. limit=15.0 2024-06-20 12:49:10,905 INFO [train.py:1028] (1/2) Epoch 12, batch 2550, loss[loss=0.2542, simple_loss=0.2991, pruned_loss=0.1046, over 12434.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.2765, pruned_loss=0.09229, over 2587035.60 frames. ], batch size: 22, lr: 4.94e-03, grad_scale: 64.0 2024-06-20 12:49:33,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=208743.33333333334, ans=0.125 2024-06-20 12:49:38,087 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=11.17 vs. limit=15.0 2024-06-20 12:49:38,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=208761.66666666666, ans=0.2 2024-06-20 12:49:45,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=208780.0, ans=0.125 2024-06-20 12:49:47,763 INFO [train.py:1028] (1/2) Epoch 12, batch 2600, loss[loss=0.1902, simple_loss=0.2396, pruned_loss=0.07041, over 13283.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.2755, pruned_loss=0.09239, over 2586888.69 frames. ], batch size: 52, lr: 4.94e-03, grad_scale: 64.0 2024-06-20 12:49:54,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=208816.66666666666, ans=0.125 2024-06-20 12:49:57,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=208816.66666666666, ans=0.125 2024-06-20 12:50:00,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=208835.0, ans=15.0 2024-06-20 12:50:06,182 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.777e+02 1.877e+02 2.056e+02 3.218e+02, threshold=3.755e+02, percent-clipped=0.0 2024-06-20 12:50:14,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=208871.66666666666, ans=0.0 2024-06-20 12:50:20,386 INFO [train.py:1028] (1/2) Epoch 12, batch 2650, loss[loss=0.2184, simple_loss=0.2522, pruned_loss=0.09237, over 13024.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.2745, pruned_loss=0.09214, over 2587777.53 frames. ], batch size: 144, lr: 4.94e-03, grad_scale: 64.0 2024-06-20 12:50:26,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=208890.0, ans=0.125 2024-06-20 12:50:29,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=208908.33333333334, ans=0.125 2024-06-20 12:50:31,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=208908.33333333334, ans=10.0 2024-06-20 12:50:37,255 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.21 vs. limit=12.0 2024-06-20 12:50:38,609 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.19 vs. limit=12.0 2024-06-20 12:50:41,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=208926.66666666666, ans=0.0 2024-06-20 12:50:55,891 INFO [train.py:1028] (1/2) Epoch 12, batch 2700, loss[loss=0.1997, simple_loss=0.2421, pruned_loss=0.07859, over 13229.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2726, pruned_loss=0.09166, over 2585279.91 frames. ], batch size: 89, lr: 4.94e-03, grad_scale: 64.0 2024-06-20 12:50:59,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=208981.66666666666, ans=0.0 2024-06-20 12:51:03,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=209000.0, ans=0.1 2024-06-20 12:51:06,314 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.52 vs. limit=15.0 2024-06-20 12:51:08,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=209000.0, ans=0.07 2024-06-20 12:51:17,210 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.768e+02 1.872e+02 2.053e+02 2.762e+02, threshold=3.744e+02, percent-clipped=0.0 2024-06-20 12:51:31,461 INFO [train.py:1028] (1/2) Epoch 12, batch 2750, loss[loss=0.215, simple_loss=0.2596, pruned_loss=0.08516, over 13246.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2715, pruned_loss=0.09075, over 2580864.00 frames. ], batch size: 43, lr: 4.94e-03, grad_scale: 128.0 2024-06-20 12:51:36,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=209073.33333333334, ans=0.125 2024-06-20 12:51:36,733 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.40 vs. limit=15.0 2024-06-20 12:51:42,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=209091.66666666666, ans=0.0 2024-06-20 12:51:44,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=209110.0, ans=0.125 2024-06-20 12:51:45,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=209110.0, ans=15.0 2024-06-20 12:51:49,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=209110.0, ans=0.1 2024-06-20 12:52:04,026 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.75 vs. limit=10.0 2024-06-20 12:52:04,990 INFO [train.py:1028] (1/2) Epoch 12, batch 2800, loss[loss=0.241, simple_loss=0.2735, pruned_loss=0.1042, over 10746.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2716, pruned_loss=0.09096, over 2577981.18 frames. ], batch size: 304, lr: 4.94e-03, grad_scale: 128.0 2024-06-20 12:52:25,407 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.899e+02 2.043e+02 2.244e+02 3.298e+02, threshold=4.085e+02, percent-clipped=0.0 2024-06-20 12:52:30,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=209220.0, ans=0.2 2024-06-20 12:52:31,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=209220.0, ans=0.0 2024-06-20 12:52:39,504 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.51 vs. limit=5.0 2024-06-20 12:52:39,595 INFO [train.py:1028] (1/2) Epoch 12, batch 2850, loss[loss=0.2277, simple_loss=0.2745, pruned_loss=0.09039, over 13355.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2711, pruned_loss=0.09092, over 2576797.07 frames. ], batch size: 49, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:52:40,277 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:52:40,945 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:53:01,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=209311.66666666666, ans=0.125 2024-06-20 12:53:12,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=209330.0, ans=0.0 2024-06-20 12:53:14,749 INFO [train.py:1028] (1/2) Epoch 12, batch 2900, loss[loss=0.1999, simple_loss=0.2498, pruned_loss=0.07499, over 13144.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2686, pruned_loss=0.08964, over 2584640.87 frames. ], batch size: 55, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:53:22,062 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2024-06-20 12:53:33,679 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.796e+02 1.937e+02 2.090e+02 2.797e+02, threshold=3.874e+02, percent-clipped=0.0 2024-06-20 12:53:37,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=209403.33333333334, ans=0.2 2024-06-20 12:53:42,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=209421.66666666666, ans=0.125 2024-06-20 12:53:42,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=209421.66666666666, ans=0.125 2024-06-20 12:53:47,947 INFO [train.py:1028] (1/2) Epoch 12, batch 2950, loss[loss=0.2132, simple_loss=0.2597, pruned_loss=0.08333, over 13345.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2678, pruned_loss=0.08932, over 2579572.66 frames. ], batch size: 43, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:54:23,738 INFO [train.py:1028] (1/2) Epoch 12, batch 3000, loss[loss=0.1995, simple_loss=0.2481, pruned_loss=0.07549, over 13255.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.267, pruned_loss=0.08904, over 2578859.08 frames. ], batch size: 59, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:54:23,738 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 12:54:31,691 INFO [train.py:1060] (1/2) Epoch 12, validation: loss=0.1935, simple_loss=0.2575, pruned_loss=0.06476, over 351949.00 frames. 2024-06-20 12:54:31,692 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 12:54:31,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=209531.66666666666, ans=0.125 2024-06-20 12:54:39,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=209550.0, ans=0.025 2024-06-20 12:54:50,420 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.804e+02 1.908e+02 2.121e+02 3.050e+02, threshold=3.816e+02, percent-clipped=0.0 2024-06-20 12:54:59,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=209586.66666666666, ans=0.0 2024-06-20 12:55:08,028 INFO [train.py:1028] (1/2) Epoch 12, batch 3050, loss[loss=0.2416, simple_loss=0.2862, pruned_loss=0.0985, over 13298.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2673, pruned_loss=0.08985, over 2578565.48 frames. ], batch size: 46, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:55:23,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=209660.0, ans=0.125 2024-06-20 12:55:29,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=209678.33333333334, ans=0.2 2024-06-20 12:55:30,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=209678.33333333334, ans=0.125 2024-06-20 12:55:40,280 INFO [train.py:1028] (1/2) Epoch 12, batch 3100, loss[loss=0.2229, simple_loss=0.2629, pruned_loss=0.0914, over 13017.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2661, pruned_loss=0.08893, over 2580272.96 frames. ], batch size: 144, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:55:41,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=209715.0, ans=0.0 2024-06-20 12:55:54,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=209751.66666666666, ans=0.1 2024-06-20 12:55:55,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=209751.66666666666, ans=0.125 2024-06-20 12:55:58,621 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.801e+02 1.919e+02 2.117e+02 2.638e+02, threshold=3.838e+02, percent-clipped=0.0 2024-06-20 12:56:05,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=209788.33333333334, ans=0.125 2024-06-20 12:56:15,498 INFO [train.py:1028] (1/2) Epoch 12, batch 3150, loss[loss=0.214, simple_loss=0.2502, pruned_loss=0.08888, over 12955.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2645, pruned_loss=0.08822, over 2581047.92 frames. ], batch size: 158, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:56:20,520 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.77 vs. limit=22.5 2024-06-20 12:56:31,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=209843.33333333334, ans=0.125 2024-06-20 12:56:34,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=209843.33333333334, ans=0.125 2024-06-20 12:56:35,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=209861.66666666666, ans=0.125 2024-06-20 12:56:39,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=209861.66666666666, ans=0.0 2024-06-20 12:56:48,093 INFO [train.py:1028] (1/2) Epoch 12, batch 3200, loss[loss=0.2173, simple_loss=0.2647, pruned_loss=0.0849, over 13140.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2642, pruned_loss=0.08801, over 2580971.87 frames. ], batch size: 55, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:56:51,015 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.49 vs. limit=10.0 2024-06-20 12:56:55,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=209916.66666666666, ans=0.1 2024-06-20 12:56:55,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=209916.66666666666, ans=0.025 2024-06-20 12:56:59,156 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.48 vs. limit=12.0 2024-06-20 12:57:08,599 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.774e+02 1.903e+02 2.113e+02 2.650e+02, threshold=3.805e+02, percent-clipped=0.0 2024-06-20 12:57:10,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=209953.33333333334, ans=0.0 2024-06-20 12:57:10,792 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2024-06-20 12:57:17,224 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.71 vs. limit=15.0 2024-06-20 12:57:22,104 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2024-06-20 12:57:22,424 INFO [train.py:1028] (1/2) Epoch 12, batch 3250, loss[loss=0.2235, simple_loss=0.2674, pruned_loss=0.0898, over 13266.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2634, pruned_loss=0.08795, over 2585947.31 frames. ], batch size: 72, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:57:23,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=209990.0, ans=0.125 2024-06-20 12:57:39,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=210026.66666666666, ans=0.2 2024-06-20 12:57:41,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=210026.66666666666, ans=0.125 2024-06-20 12:57:43,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=210045.0, ans=0.125 2024-06-20 12:57:45,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=210045.0, ans=0.0 2024-06-20 12:57:49,782 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.08 vs. limit=12.0 2024-06-20 12:57:55,499 INFO [train.py:1028] (1/2) Epoch 12, batch 3300, loss[loss=0.2323, simple_loss=0.2728, pruned_loss=0.09592, over 12749.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2632, pruned_loss=0.08784, over 2583140.88 frames. ], batch size: 176, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:57:56,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=210081.66666666666, ans=0.04949747468305833 2024-06-20 12:57:59,741 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.61 vs. limit=15.0 2024-06-20 12:58:10,973 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.85 vs. limit=15.0 2024-06-20 12:58:17,258 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.887e+02 2.046e+02 2.191e+02 3.145e+02, threshold=4.091e+02, percent-clipped=0.0 2024-06-20 12:58:20,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=210136.66666666666, ans=0.125 2024-06-20 12:58:22,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=210136.66666666666, ans=0.125 2024-06-20 12:58:22,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=210136.66666666666, ans=0.125 2024-06-20 12:58:32,079 INFO [train.py:1028] (1/2) Epoch 12, batch 3350, loss[loss=0.2338, simple_loss=0.2724, pruned_loss=0.09758, over 12876.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2627, pruned_loss=0.08783, over 2576958.88 frames. ], batch size: 158, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 12:58:34,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=210173.33333333334, ans=0.125 2024-06-20 12:58:43,378 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.88 vs. limit=10.0 2024-06-20 12:58:51,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=210210.0, ans=0.125 2024-06-20 12:58:53,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=210210.0, ans=0.0 2024-06-20 12:58:58,209 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=11.47 vs. limit=15.0 2024-06-20 12:59:04,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=210246.66666666666, ans=0.1 2024-06-20 12:59:07,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=210265.0, ans=0.1 2024-06-20 12:59:07,936 INFO [train.py:1028] (1/2) Epoch 12, batch 3400, loss[loss=0.2625, simple_loss=0.3069, pruned_loss=0.109, over 12557.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2618, pruned_loss=0.08751, over 2575365.71 frames. ], batch size: 22, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 12:59:09,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=210265.0, ans=0.2 2024-06-20 12:59:09,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=210265.0, ans=0.125 2024-06-20 12:59:13,335 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2024-06-20 12:59:19,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=210283.33333333334, ans=0.2 2024-06-20 12:59:23,263 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.99 vs. limit=15.0 2024-06-20 12:59:26,228 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.797e+02 1.956e+02 2.193e+02 3.184e+02, threshold=3.911e+02, percent-clipped=0.0 2024-06-20 12:59:31,252 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.88 vs. limit=22.5 2024-06-20 12:59:40,767 INFO [train.py:1028] (1/2) Epoch 12, batch 3450, loss[loss=0.2323, simple_loss=0.2656, pruned_loss=0.09956, over 12704.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2613, pruned_loss=0.0873, over 2576640.21 frames. ], batch size: 176, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 12:59:46,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=210375.0, ans=0.1 2024-06-20 12:59:46,733 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.72 vs. limit=22.5 2024-06-20 12:59:58,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=210393.33333333334, ans=0.125 2024-06-20 12:59:58,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=210393.33333333334, ans=0.125 2024-06-20 13:00:10,752 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=15.91 vs. limit=15.0 2024-06-20 13:00:11,441 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.54 vs. limit=15.0 2024-06-20 13:00:14,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=210430.0, ans=0.125 2024-06-20 13:00:16,544 INFO [train.py:1028] (1/2) Epoch 12, batch 3500, loss[loss=0.2098, simple_loss=0.2579, pruned_loss=0.08086, over 12895.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.261, pruned_loss=0.08713, over 2575850.16 frames. ], batch size: 33, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 13:00:21,662 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.24 vs. limit=22.5 2024-06-20 13:00:22,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=210448.33333333334, ans=0.025 2024-06-20 13:00:32,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=210485.0, ans=0.125 2024-06-20 13:00:36,350 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.745e+02 1.849e+02 2.017e+02 2.416e+02, threshold=3.699e+02, percent-clipped=0.0 2024-06-20 13:00:42,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=210503.33333333334, ans=0.0 2024-06-20 13:00:42,401 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.93 vs. limit=10.0 2024-06-20 13:00:42,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=210503.33333333334, ans=0.0 2024-06-20 13:00:54,372 INFO [train.py:1028] (1/2) Epoch 12, batch 3550, loss[loss=0.2073, simple_loss=0.2415, pruned_loss=0.08651, over 13107.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2603, pruned_loss=0.08663, over 2576655.47 frames. ], batch size: 95, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 13:01:05,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=210558.33333333334, ans=0.125 2024-06-20 13:01:05,889 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.96 vs. limit=22.5 2024-06-20 13:01:11,380 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.64 vs. limit=15.0 2024-06-20 13:01:27,891 INFO [train.py:1028] (1/2) Epoch 12, batch 3600, loss[loss=0.2154, simple_loss=0.2627, pruned_loss=0.08411, over 13238.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2599, pruned_loss=0.08649, over 2580290.60 frames. ], batch size: 49, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 13:01:28,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=210631.66666666666, ans=0.1 2024-06-20 13:01:47,089 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.767e+02 1.919e+02 2.148e+02 3.157e+02, threshold=3.839e+02, percent-clipped=0.0 2024-06-20 13:01:52,956 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.33 vs. limit=22.5 2024-06-20 13:01:55,428 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.50 vs. limit=15.0 2024-06-20 13:01:56,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=210705.0, ans=0.0 2024-06-20 13:01:59,146 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.76 vs. limit=22.5 2024-06-20 13:01:59,654 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2024-06-20 13:02:01,923 INFO [train.py:1028] (1/2) Epoch 12, batch 3650, loss[loss=0.1989, simple_loss=0.2481, pruned_loss=0.0748, over 13066.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2599, pruned_loss=0.08653, over 2578033.34 frames. ], batch size: 102, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 13:02:14,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=210741.66666666666, ans=0.125 2024-06-20 13:02:18,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=210760.0, ans=0.1 2024-06-20 13:02:28,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=210778.33333333334, ans=0.125 2024-06-20 13:02:37,229 INFO [train.py:1028] (1/2) Epoch 12, batch 3700, loss[loss=0.1902, simple_loss=0.247, pruned_loss=0.06667, over 13253.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2587, pruned_loss=0.08592, over 2583077.49 frames. ], batch size: 72, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 13:02:40,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=210815.0, ans=0.125 2024-06-20 13:02:41,295 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.94 vs. limit=15.0 2024-06-20 13:02:50,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=210851.66666666666, ans=0.05 2024-06-20 13:02:58,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=210851.66666666666, ans=0.125 2024-06-20 13:02:58,854 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.735e+02 1.878e+02 2.094e+02 2.953e+02, threshold=3.756e+02, percent-clipped=0.0 2024-06-20 13:02:59,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=210870.0, ans=0.1 2024-06-20 13:03:13,239 INFO [train.py:1028] (1/2) Epoch 12, batch 3750, loss[loss=0.2178, simple_loss=0.259, pruned_loss=0.08833, over 12684.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2586, pruned_loss=0.0858, over 2586183.11 frames. ], batch size: 22, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 13:03:15,005 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.88 vs. limit=22.5 2024-06-20 13:03:18,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=210925.0, ans=0.0 2024-06-20 13:03:19,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=210925.0, ans=0.0 2024-06-20 13:03:22,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=210925.0, ans=0.0 2024-06-20 13:03:23,736 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2024-06-20 13:03:25,098 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.81 vs. limit=15.0 2024-06-20 13:03:31,875 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.77 vs. limit=22.5 2024-06-20 13:03:44,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=210998.33333333334, ans=0.125 2024-06-20 13:03:44,910 INFO [train.py:1028] (1/2) Epoch 12, batch 3800, loss[loss=0.2003, simple_loss=0.2434, pruned_loss=0.07858, over 13217.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2586, pruned_loss=0.08555, over 2583866.89 frames. ], batch size: 83, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:03:47,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=210998.33333333334, ans=0.2 2024-06-20 13:03:59,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.12 vs. limit=10.0 2024-06-20 13:04:03,389 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.765e+02 1.955e+02 2.089e+02 3.004e+02, threshold=3.909e+02, percent-clipped=0.0 2024-06-20 13:04:18,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=211071.66666666666, ans=0.0 2024-06-20 13:04:19,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=211071.66666666666, ans=0.125 2024-06-20 13:04:21,559 INFO [train.py:1028] (1/2) Epoch 12, batch 3850, loss[loss=0.2199, simple_loss=0.2597, pruned_loss=0.09008, over 13005.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2584, pruned_loss=0.08539, over 2584040.94 frames. ], batch size: 144, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:04:34,342 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.12 vs. limit=15.0 2024-06-20 13:04:35,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=211126.66666666666, ans=0.07 2024-06-20 13:04:37,202 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:04:39,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=211126.66666666666, ans=0.2 2024-06-20 13:04:40,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=211126.66666666666, ans=0.1 2024-06-20 13:04:44,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=211145.0, ans=0.0 2024-06-20 13:04:44,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=211145.0, ans=0.2 2024-06-20 13:04:49,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=211163.33333333334, ans=0.125 2024-06-20 13:04:53,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=211163.33333333334, ans=0.125 2024-06-20 13:04:54,267 INFO [train.py:1028] (1/2) Epoch 12, batch 3900, loss[loss=0.1985, simple_loss=0.2418, pruned_loss=0.0776, over 13226.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2581, pruned_loss=0.08524, over 2587406.57 frames. ], batch size: 83, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:04:55,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=211181.66666666666, ans=0.2 2024-06-20 13:05:04,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=211200.0, ans=0.0 2024-06-20 13:05:11,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=211218.33333333334, ans=0.125 2024-06-20 13:05:15,711 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.792e+02 1.896e+02 2.111e+02 3.584e+02, threshold=3.792e+02, percent-clipped=0.0 2024-06-20 13:05:18,144 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.48 vs. limit=22.5 2024-06-20 13:05:20,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=211236.66666666666, ans=0.035 2024-06-20 13:05:24,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=211255.0, ans=0.0 2024-06-20 13:05:26,192 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.45 vs. limit=22.5 2024-06-20 13:05:27,641 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.41 vs. limit=15.0 2024-06-20 13:05:27,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=211255.0, ans=0.0 2024-06-20 13:05:28,405 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.02 vs. limit=15.0 2024-06-20 13:05:30,480 INFO [train.py:1028] (1/2) Epoch 12, batch 3950, loss[loss=0.1975, simple_loss=0.239, pruned_loss=0.07804, over 13110.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2573, pruned_loss=0.08457, over 2590535.61 frames. ], batch size: 132, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:05:33,762 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.88 vs. limit=8.0 2024-06-20 13:05:35,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=211273.33333333334, ans=0.07 2024-06-20 13:05:44,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=211310.0, ans=0.2 2024-06-20 13:05:54,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=211328.33333333334, ans=0.125 2024-06-20 13:05:56,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=211346.66666666666, ans=0.025 2024-06-20 13:05:58,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=211346.66666666666, ans=0.2 2024-06-20 13:06:03,371 INFO [train.py:1028] (1/2) Epoch 12, batch 4000, loss[loss=0.2154, simple_loss=0.2596, pruned_loss=0.08557, over 12853.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.257, pruned_loss=0.08485, over 2585134.48 frames. ], batch size: 39, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:06:07,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=211365.0, ans=0.0 2024-06-20 13:06:21,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=211401.66666666666, ans=0.125 2024-06-20 13:06:24,887 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.825e+02 1.977e+02 2.201e+02 3.215e+02, threshold=3.954e+02, percent-clipped=0.0 2024-06-20 13:06:30,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=211420.0, ans=10.0 2024-06-20 13:06:39,736 INFO [train.py:1028] (1/2) Epoch 12, batch 4050, loss[loss=0.2318, simple_loss=0.2631, pruned_loss=0.1003, over 10982.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2571, pruned_loss=0.08495, over 2582283.31 frames. ], batch size: 304, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:06:45,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=211475.0, ans=0.1 2024-06-20 13:06:46,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=211475.0, ans=0.125 2024-06-20 13:06:52,639 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.51 vs. limit=6.0 2024-06-20 13:06:59,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=211511.66666666666, ans=0.09899494936611666 2024-06-20 13:07:03,815 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.45 vs. limit=22.5 2024-06-20 13:07:16,183 INFO [train.py:1028] (1/2) Epoch 12, batch 4100, loss[loss=0.2314, simple_loss=0.2631, pruned_loss=0.0999, over 13047.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2573, pruned_loss=0.08554, over 2578181.45 frames. ], batch size: 102, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:07:16,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=211548.33333333334, ans=0.0 2024-06-20 13:07:23,072 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.15 vs. limit=22.5 2024-06-20 13:07:25,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=211566.66666666666, ans=0.125 2024-06-20 13:07:26,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=211566.66666666666, ans=0.125 2024-06-20 13:07:32,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=211585.0, ans=0.0 2024-06-20 13:07:34,456 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.798e+02 1.927e+02 2.096e+02 2.654e+02, threshold=3.854e+02, percent-clipped=0.0 2024-06-20 13:07:42,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=211621.66666666666, ans=0.0 2024-06-20 13:07:46,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=211621.66666666666, ans=0.0 2024-06-20 13:07:48,894 INFO [train.py:1028] (1/2) Epoch 12, batch 4150, loss[loss=0.2166, simple_loss=0.2621, pruned_loss=0.08554, over 13138.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.257, pruned_loss=0.08527, over 2575531.03 frames. ], batch size: 55, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:07:50,652 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.92 vs. limit=10.0 2024-06-20 13:07:52,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=211640.0, ans=0.125 2024-06-20 13:08:21,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=211713.33333333334, ans=0.125 2024-06-20 13:08:25,144 INFO [train.py:1028] (1/2) Epoch 12, batch 4200, loss[loss=0.2011, simple_loss=0.2468, pruned_loss=0.07766, over 13142.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2558, pruned_loss=0.08451, over 2578061.43 frames. ], batch size: 103, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:08:30,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=211731.66666666666, ans=0.0 2024-06-20 13:08:32,560 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.23 vs. limit=15.0 2024-06-20 13:08:36,412 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=1.340e+00 2024-06-20 13:08:43,399 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.761e+02 1.905e+02 2.140e+02 2.853e+02, threshold=3.810e+02, percent-clipped=0.0 2024-06-20 13:08:57,935 INFO [train.py:1028] (1/2) Epoch 12, batch 4250, loss[loss=0.1904, simple_loss=0.2345, pruned_loss=0.0732, over 13304.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2557, pruned_loss=0.08463, over 2581276.09 frames. ], batch size: 46, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:08:57,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=211823.33333333334, ans=0.125 2024-06-20 13:09:01,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=211823.33333333334, ans=0.125 2024-06-20 13:09:04,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=211841.66666666666, ans=0.0 2024-06-20 13:09:23,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=211878.33333333334, ans=0.0 2024-06-20 13:09:26,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=211896.66666666666, ans=0.125 2024-06-20 13:09:33,356 INFO [train.py:1028] (1/2) Epoch 12, batch 4300, loss[loss=0.2119, simple_loss=0.2588, pruned_loss=0.08251, over 13189.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2558, pruned_loss=0.08476, over 2581994.50 frames. ], batch size: 59, lr: 4.90e-03, grad_scale: 128.0 2024-06-20 13:09:34,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=211915.0, ans=0.1 2024-06-20 13:09:44,351 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.87 vs. limit=22.5 2024-06-20 13:09:46,271 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.32 vs. limit=22.5 2024-06-20 13:09:51,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=211951.66666666666, ans=0.125 2024-06-20 13:09:51,763 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.715e+02 1.873e+02 2.005e+02 2.786e+02, threshold=3.746e+02, percent-clipped=0.0 2024-06-20 13:09:51,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=211951.66666666666, ans=0.125 2024-06-20 13:09:53,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=211970.0, ans=0.125 2024-06-20 13:09:55,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=211970.0, ans=0.025 2024-06-20 13:09:57,748 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.98 vs. limit=22.5 2024-06-20 13:09:59,210 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.18 vs. limit=22.5 2024-06-20 13:10:05,748 INFO [train.py:1028] (1/2) Epoch 12, batch 4350, loss[loss=0.2893, simple_loss=0.3199, pruned_loss=0.1293, over 13192.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2556, pruned_loss=0.08483, over 2585933.49 frames. ], batch size: 59, lr: 4.90e-03, grad_scale: 128.0 2024-06-20 13:10:14,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=212006.66666666666, ans=0.1 2024-06-20 13:10:18,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=212025.0, ans=0.125 2024-06-20 13:10:21,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=212043.33333333334, ans=0.125 2024-06-20 13:10:31,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=212061.66666666666, ans=0.0 2024-06-20 13:10:41,516 INFO [train.py:1028] (1/2) Epoch 12, batch 4400, loss[loss=0.2118, simple_loss=0.251, pruned_loss=0.08624, over 13196.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2552, pruned_loss=0.08467, over 2585158.63 frames. ], batch size: 83, lr: 4.90e-03, grad_scale: 128.0 2024-06-20 13:10:44,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=212098.33333333334, ans=0.125 2024-06-20 13:10:47,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=212116.66666666666, ans=0.125 2024-06-20 13:10:52,891 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.46 vs. limit=22.5 2024-06-20 13:10:53,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=212135.0, ans=0.05 2024-06-20 13:10:55,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=212135.0, ans=0.125 2024-06-20 13:11:00,159 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.722e+02 1.833e+02 1.981e+02 2.957e+02, threshold=3.665e+02, percent-clipped=0.0 2024-06-20 13:11:08,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=212153.33333333334, ans=0.07 2024-06-20 13:11:17,499 INFO [train.py:1028] (1/2) Epoch 12, batch 4450, loss[loss=0.2238, simple_loss=0.2693, pruned_loss=0.08918, over 12830.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2556, pruned_loss=0.08477, over 2580669.22 frames. ], batch size: 33, lr: 4.90e-03, grad_scale: 64.0 2024-06-20 13:11:20,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=212190.0, ans=0.1 2024-06-20 13:11:24,950 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.55 vs. limit=15.0 2024-06-20 13:11:27,243 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:11:36,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=212245.0, ans=0.125 2024-06-20 13:11:49,346 INFO [train.py:1028] (1/2) Epoch 12, batch 4500, loss[loss=0.212, simple_loss=0.2529, pruned_loss=0.0856, over 13261.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2544, pruned_loss=0.08434, over 2585690.74 frames. ], batch size: 89, lr: 4.90e-03, grad_scale: 64.0 2024-06-20 13:11:50,358 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=12.0 2024-06-20 13:11:50,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=212281.66666666666, ans=0.025 2024-06-20 13:11:55,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=212300.0, ans=0.2 2024-06-20 13:12:00,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.25 vs. limit=15.0 2024-06-20 13:12:06,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=212318.33333333334, ans=0.0 2024-06-20 13:12:08,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=212318.33333333334, ans=0.125 2024-06-20 13:12:09,128 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.728e+02 1.899e+02 2.090e+02 2.847e+02, threshold=3.798e+02, percent-clipped=0.0 2024-06-20 13:12:09,513 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.23 vs. limit=15.0 2024-06-20 13:12:20,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=212355.0, ans=0.125 2024-06-20 13:12:26,232 INFO [train.py:1028] (1/2) Epoch 12, batch 4550, loss[loss=0.1928, simple_loss=0.242, pruned_loss=0.07176, over 13266.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2542, pruned_loss=0.08408, over 2589584.57 frames. ], batch size: 52, lr: 4.90e-03, grad_scale: 64.0 2024-06-20 13:12:27,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=212373.33333333334, ans=15.0 2024-06-20 13:12:35,845 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.21 vs. limit=15.0 2024-06-20 13:12:44,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=212410.0, ans=0.125 2024-06-20 13:12:57,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=212446.66666666666, ans=0.1 2024-06-20 13:12:59,355 INFO [train.py:1028] (1/2) Epoch 12, batch 4600, loss[loss=0.2226, simple_loss=0.259, pruned_loss=0.09304, over 12606.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2545, pruned_loss=0.08407, over 2585285.31 frames. ], batch size: 202, lr: 4.90e-03, grad_scale: 64.0 2024-06-20 13:13:00,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=212465.0, ans=0.0 2024-06-20 13:13:18,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=212501.66666666666, ans=0.95 2024-06-20 13:13:21,758 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.805e+02 1.974e+02 2.252e+02 3.445e+02, threshold=3.949e+02, percent-clipped=0.0 2024-06-20 13:13:21,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=212520.0, ans=0.0 2024-06-20 13:13:33,058 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=3.220e-01 2024-06-20 13:13:33,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=212538.33333333334, ans=0.0 2024-06-20 13:13:35,372 INFO [train.py:1028] (1/2) Epoch 12, batch 4650, loss[loss=0.208, simple_loss=0.2477, pruned_loss=0.08422, over 13103.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2541, pruned_loss=0.08411, over 2589772.73 frames. ], batch size: 132, lr: 4.90e-03, grad_scale: 64.0 2024-06-20 13:13:40,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=212556.66666666666, ans=0.0 2024-06-20 13:13:49,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=212593.33333333334, ans=0.04949747468305833 2024-06-20 13:13:49,553 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.33 vs. limit=15.0 2024-06-20 13:13:54,554 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.40 vs. limit=10.0 2024-06-20 13:14:00,593 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:14:06,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=212630.0, ans=0.1 2024-06-20 13:14:08,457 INFO [train.py:1028] (1/2) Epoch 12, batch 4700, loss[loss=0.216, simple_loss=0.2657, pruned_loss=0.08312, over 12641.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2544, pruned_loss=0.08424, over 2584454.74 frames. ], batch size: 25, lr: 4.90e-03, grad_scale: 64.0 2024-06-20 13:14:24,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=212666.66666666666, ans=0.0 2024-06-20 13:14:25,132 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.70 vs. limit=22.5 2024-06-20 13:14:31,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=212685.0, ans=0.0 2024-06-20 13:14:35,817 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.771e+02 1.904e+02 2.043e+02 2.707e+02, threshold=3.809e+02, percent-clipped=0.0 2024-06-20 13:14:37,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=212703.33333333334, ans=0.125 2024-06-20 13:14:41,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=212703.33333333334, ans=0.0 2024-06-20 13:14:45,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=212721.66666666666, ans=0.1 2024-06-20 13:14:47,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=212721.66666666666, ans=0.0 2024-06-20 13:14:49,718 INFO [train.py:1028] (1/2) Epoch 12, batch 4750, loss[loss=0.2333, simple_loss=0.2636, pruned_loss=0.1015, over 12540.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.254, pruned_loss=0.08441, over 2580210.40 frames. ], batch size: 202, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:14:52,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=212740.0, ans=0.1 2024-06-20 13:14:54,055 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2024-06-20 13:14:56,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=212758.33333333334, ans=0.125 2024-06-20 13:14:57,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=212758.33333333334, ans=0.05 2024-06-20 13:15:00,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=212758.33333333334, ans=0.1 2024-06-20 13:15:03,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=212776.66666666666, ans=0.2 2024-06-20 13:15:18,649 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:15:18,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=212795.0, ans=0.025 2024-06-20 13:15:27,736 INFO [train.py:1028] (1/2) Epoch 12, batch 4800, loss[loss=0.1932, simple_loss=0.234, pruned_loss=0.07623, over 13237.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2537, pruned_loss=0.08422, over 2576424.94 frames. ], batch size: 63, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:15:28,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=212831.66666666666, ans=0.2 2024-06-20 13:15:30,709 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.33 vs. limit=15.0 2024-06-20 13:15:32,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=212831.66666666666, ans=0.125 2024-06-20 13:15:33,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=212850.0, ans=0.1 2024-06-20 13:15:35,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=212850.0, ans=0.1 2024-06-20 13:15:38,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=212850.0, ans=0.125 2024-06-20 13:15:39,917 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.57 vs. limit=10.0 2024-06-20 13:15:46,890 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.765e+02 1.878e+02 2.043e+02 2.740e+02, threshold=3.756e+02, percent-clipped=0.0 2024-06-20 13:15:48,035 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.78 vs. limit=15.0 2024-06-20 13:15:50,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=212886.66666666666, ans=0.125 2024-06-20 13:16:00,563 INFO [train.py:1028] (1/2) Epoch 12, batch 4850, loss[loss=0.2204, simple_loss=0.2602, pruned_loss=0.09027, over 13246.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2536, pruned_loss=0.084, over 2573773.86 frames. ], batch size: 89, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:16:00,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=212923.33333333334, ans=0.0 2024-06-20 13:16:00,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.83 vs. limit=22.5 2024-06-20 13:16:02,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=212923.33333333334, ans=0.125 2024-06-20 13:16:03,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=212923.33333333334, ans=0.1 2024-06-20 13:16:10,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=212941.66666666666, ans=0.0 2024-06-20 13:16:18,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=212960.0, ans=0.0 2024-06-20 13:16:27,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=212978.33333333334, ans=0.02 2024-06-20 13:16:37,778 INFO [train.py:1028] (1/2) Epoch 12, batch 4900, loss[loss=0.1765, simple_loss=0.2299, pruned_loss=0.06157, over 13237.00 frames. ], tot_loss[loss=0.21, simple_loss=0.253, pruned_loss=0.08355, over 2575209.65 frames. ], batch size: 59, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:16:39,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=213015.0, ans=0.09899494936611666 2024-06-20 13:16:52,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=213051.66666666666, ans=0.0 2024-06-20 13:16:55,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=213051.66666666666, ans=0.0 2024-06-20 13:16:57,179 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.785e+02 1.993e+02 2.153e+02 2.876e+02, threshold=3.985e+02, percent-clipped=0.0 2024-06-20 13:17:03,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=213070.0, ans=0.1 2024-06-20 13:17:10,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=213088.33333333334, ans=0.2 2024-06-20 13:17:14,159 INFO [train.py:1028] (1/2) Epoch 12, batch 4950, loss[loss=0.2261, simple_loss=0.2525, pruned_loss=0.09982, over 10945.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2538, pruned_loss=0.08434, over 2568434.97 frames. ], batch size: 303, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:17:19,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=213106.66666666666, ans=10.0 2024-06-20 13:17:38,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=213161.66666666666, ans=0.125 2024-06-20 13:17:44,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=213180.0, ans=0.125 2024-06-20 13:17:47,498 INFO [train.py:1028] (1/2) Epoch 12, batch 5000, loss[loss=0.2048, simple_loss=0.2443, pruned_loss=0.08264, over 13133.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2528, pruned_loss=0.08333, over 2572984.63 frames. ], batch size: 95, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:18:03,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=213235.0, ans=0.5 2024-06-20 13:18:04,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=213235.0, ans=0.125 2024-06-20 13:18:07,086 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.729e+02 1.823e+02 2.050e+02 2.953e+02, threshold=3.646e+02, percent-clipped=0.0 2024-06-20 13:18:20,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=213271.66666666666, ans=0.0 2024-06-20 13:18:24,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=213290.0, ans=0.1 2024-06-20 13:18:24,874 INFO [train.py:1028] (1/2) Epoch 12, batch 5050, loss[loss=0.2101, simple_loss=0.256, pruned_loss=0.0821, over 12878.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2534, pruned_loss=0.08351, over 2573019.91 frames. ], batch size: 36, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:18:37,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=213326.66666666666, ans=0.0 2024-06-20 13:18:46,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=213345.0, ans=0.04949747468305833 2024-06-20 13:18:47,092 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.44 vs. limit=6.0 2024-06-20 13:18:47,103 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2024-06-20 13:18:49,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=213345.0, ans=0.0 2024-06-20 13:18:59,247 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2024-06-20 13:18:59,613 INFO [train.py:1028] (1/2) Epoch 12, batch 5100, loss[loss=0.2133, simple_loss=0.2579, pruned_loss=0.08428, over 13015.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.254, pruned_loss=0.08456, over 2568935.04 frames. ], batch size: 39, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:19:02,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=213381.66666666666, ans=0.2 2024-06-20 13:19:12,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=213400.0, ans=0.2 2024-06-20 13:19:15,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=213418.33333333334, ans=0.1 2024-06-20 13:19:17,004 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.89 vs. limit=22.5 2024-06-20 13:19:19,327 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.08 vs. limit=15.0 2024-06-20 13:19:22,848 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.784e+02 1.973e+02 2.199e+02 2.882e+02, threshold=3.946e+02, percent-clipped=0.0 2024-06-20 13:19:31,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=213455.0, ans=0.025 2024-06-20 13:19:37,071 INFO [train.py:1028] (1/2) Epoch 12, batch 5150, loss[loss=0.2075, simple_loss=0.2464, pruned_loss=0.08427, over 13038.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.254, pruned_loss=0.0848, over 2571511.32 frames. ], batch size: 132, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:19:46,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=213491.66666666666, ans=0.1 2024-06-20 13:20:02,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=213528.33333333334, ans=0.125 2024-06-20 13:20:07,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=213546.66666666666, ans=0.0 2024-06-20 13:20:13,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=213546.66666666666, ans=0.125 2024-06-20 13:20:14,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=213565.0, ans=0.125 2024-06-20 13:20:15,455 INFO [train.py:1028] (1/2) Epoch 12, batch 5200, loss[loss=0.2021, simple_loss=0.2347, pruned_loss=0.08471, over 13143.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2538, pruned_loss=0.08441, over 2574771.85 frames. ], batch size: 95, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:20:23,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=213583.33333333334, ans=0.0 2024-06-20 13:20:27,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=213583.33333333334, ans=0.125 2024-06-20 13:20:31,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=213601.66666666666, ans=0.0 2024-06-20 13:20:35,058 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.758e+02 1.857e+02 2.118e+02 2.911e+02, threshold=3.715e+02, percent-clipped=0.0 2024-06-20 13:20:47,992 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.61 vs. limit=22.5 2024-06-20 13:20:48,763 INFO [train.py:1028] (1/2) Epoch 12, batch 5250, loss[loss=0.218, simple_loss=0.2621, pruned_loss=0.08695, over 13267.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2536, pruned_loss=0.08443, over 2571148.60 frames. ], batch size: 52, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:20:50,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=213656.66666666666, ans=0.025 2024-06-20 13:20:59,029 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=213675.0, ans=0.125 2024-06-20 13:21:04,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=213693.33333333334, ans=0.0 2024-06-20 13:21:09,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=213711.66666666666, ans=0.1 2024-06-20 13:21:09,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=213711.66666666666, ans=0.125 2024-06-20 13:21:18,309 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.53 vs. limit=15.0 2024-06-20 13:21:21,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=213730.0, ans=0.125 2024-06-20 13:21:23,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=213730.0, ans=0.125 2024-06-20 13:21:25,070 INFO [train.py:1028] (1/2) Epoch 12, batch 5300, loss[loss=0.2096, simple_loss=0.2483, pruned_loss=0.08544, over 13046.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.253, pruned_loss=0.0842, over 2566594.21 frames. ], batch size: 144, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:21:25,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=213748.33333333334, ans=0.125 2024-06-20 13:21:36,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=213766.66666666666, ans=0.2 2024-06-20 13:21:38,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=213785.0, ans=0.125 2024-06-20 13:21:39,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=213785.0, ans=0.0 2024-06-20 13:21:44,805 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.727e+02 1.834e+02 2.014e+02 2.606e+02, threshold=3.667e+02, percent-clipped=0.0 2024-06-20 13:21:55,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=213821.66666666666, ans=0.0 2024-06-20 13:21:59,126 INFO [train.py:1028] (1/2) Epoch 12, batch 5350, loss[loss=0.2053, simple_loss=0.2531, pruned_loss=0.07873, over 11057.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2528, pruned_loss=0.08388, over 2573136.93 frames. ], batch size: 16, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:22:04,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=213840.0, ans=0.0 2024-06-20 13:22:07,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=213858.33333333334, ans=0.0 2024-06-20 13:22:08,170 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.31 vs. limit=15.0 2024-06-20 13:22:13,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=213858.33333333334, ans=0.125 2024-06-20 13:22:19,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=213876.66666666666, ans=0.2 2024-06-20 13:22:20,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=213876.66666666666, ans=0.125 2024-06-20 13:22:21,714 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.07 vs. limit=15.0 2024-06-20 13:22:34,747 INFO [train.py:1028] (1/2) Epoch 12, batch 5400, loss[loss=0.2359, simple_loss=0.2629, pruned_loss=0.1044, over 12274.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2534, pruned_loss=0.08459, over 2567124.18 frames. ], batch size: 241, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:22:41,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=213950.0, ans=0.125 2024-06-20 13:22:54,031 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.01 vs. limit=15.0 2024-06-20 13:22:54,243 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.791e+02 1.911e+02 2.068e+02 2.817e+02, threshold=3.822e+02, percent-clipped=0.0 2024-06-20 13:23:11,949 INFO [train.py:1028] (1/2) Epoch 12, batch 5450, loss[loss=0.228, simple_loss=0.2734, pruned_loss=0.0913, over 12635.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2526, pruned_loss=0.08385, over 2572837.12 frames. ], batch size: 26, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:23:27,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=214060.0, ans=0.1 2024-06-20 13:23:27,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=214060.0, ans=0.2 2024-06-20 13:23:30,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=214060.0, ans=0.0 2024-06-20 13:23:31,792 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.98 vs. limit=22.5 2024-06-20 13:23:42,260 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.08 vs. limit=15.0 2024-06-20 13:23:43,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=214096.66666666666, ans=0.09899494936611666 2024-06-20 13:23:45,357 INFO [train.py:1028] (1/2) Epoch 12, batch 5500, loss[loss=0.2354, simple_loss=0.2692, pruned_loss=0.1008, over 12259.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2525, pruned_loss=0.08378, over 2566128.01 frames. ], batch size: 241, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:23:48,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214115.0, ans=0.1 2024-06-20 13:23:50,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=214115.0, ans=0.2 2024-06-20 13:23:50,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=214115.0, ans=0.125 2024-06-20 13:23:55,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=214133.33333333334, ans=0.125 2024-06-20 13:24:04,950 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.689e+02 1.784e+02 1.952e+02 2.382e+02, threshold=3.568e+02, percent-clipped=0.0 2024-06-20 13:24:12,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214170.0, ans=0.1 2024-06-20 13:24:20,887 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:24:22,647 INFO [train.py:1028] (1/2) Epoch 12, batch 5550, loss[loss=0.2116, simple_loss=0.2585, pruned_loss=0.08231, over 13285.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2529, pruned_loss=0.08355, over 2568981.96 frames. ], batch size: 43, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:24:22,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=214206.66666666666, ans=0.0 2024-06-20 13:24:24,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=214206.66666666666, ans=0.2 2024-06-20 13:24:24,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=214206.66666666666, ans=0.125 2024-06-20 13:24:49,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=214261.66666666666, ans=0.125 2024-06-20 13:24:55,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=214280.0, ans=0.1 2024-06-20 13:24:58,987 INFO [train.py:1028] (1/2) Epoch 12, batch 5600, loss[loss=0.2285, simple_loss=0.2622, pruned_loss=0.09736, over 13259.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2522, pruned_loss=0.0832, over 2570607.00 frames. ], batch size: 89, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:24:59,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=214298.33333333334, ans=0.125 2024-06-20 13:24:59,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=214298.33333333334, ans=0.125 2024-06-20 13:25:04,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=214298.33333333334, ans=0.125 2024-06-20 13:25:04,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=214298.33333333334, ans=0.125 2024-06-20 13:25:22,154 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.701e+02 1.840e+02 1.995e+02 3.174e+02, threshold=3.680e+02, percent-clipped=0.0 2024-06-20 13:25:25,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=214353.33333333334, ans=0.125 2024-06-20 13:25:35,972 INFO [train.py:1028] (1/2) Epoch 12, batch 5650, loss[loss=0.2407, simple_loss=0.2783, pruned_loss=0.1015, over 12493.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2529, pruned_loss=0.08334, over 2576344.93 frames. ], batch size: 202, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:25:50,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=214426.66666666666, ans=0.125 2024-06-20 13:25:51,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=214426.66666666666, ans=0.2 2024-06-20 13:25:51,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=214426.66666666666, ans=0.0 2024-06-20 13:25:52,691 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=15.0 2024-06-20 13:25:53,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=214426.66666666666, ans=0.0 2024-06-20 13:25:54,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214426.66666666666, ans=0.1 2024-06-20 13:26:08,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=214481.66666666666, ans=0.125 2024-06-20 13:26:09,230 INFO [train.py:1028] (1/2) Epoch 12, batch 5700, loss[loss=0.2233, simple_loss=0.2693, pruned_loss=0.08868, over 13235.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2529, pruned_loss=0.08348, over 2580097.86 frames. ], batch size: 63, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:26:20,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=214500.0, ans=0.125 2024-06-20 13:26:23,255 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.30 vs. limit=15.0 2024-06-20 13:26:30,678 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=12.0 2024-06-20 13:26:31,649 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.800e+02 2.007e+02 2.275e+02 3.250e+02, threshold=4.014e+02, percent-clipped=0.0 2024-06-20 13:26:35,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=214536.66666666666, ans=0.025 2024-06-20 13:26:39,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=214555.0, ans=0.0 2024-06-20 13:26:43,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=214555.0, ans=0.125 2024-06-20 13:26:45,193 INFO [train.py:1028] (1/2) Epoch 12, batch 5750, loss[loss=0.2257, simple_loss=0.26, pruned_loss=0.09572, over 12778.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.253, pruned_loss=0.08363, over 2580608.60 frames. ], batch size: 176, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:26:49,942 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214573.33333333334, ans=0.1 2024-06-20 13:26:51,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=214591.66666666666, ans=0.2 2024-06-20 13:27:14,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=214646.66666666666, ans=0.1 2024-06-20 13:27:15,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=214646.66666666666, ans=0.125 2024-06-20 13:27:21,523 INFO [train.py:1028] (1/2) Epoch 12, batch 5800, loss[loss=0.2193, simple_loss=0.2564, pruned_loss=0.09107, over 12815.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.254, pruned_loss=0.08448, over 2579623.60 frames. ], batch size: 176, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:27:38,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=214701.66666666666, ans=0.1 2024-06-20 13:27:38,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=214701.66666666666, ans=0.0 2024-06-20 13:27:40,284 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.780e+02 1.908e+02 2.158e+02 3.257e+02, threshold=3.817e+02, percent-clipped=0.0 2024-06-20 13:27:48,420 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.35 vs. limit=6.0 2024-06-20 13:27:53,911 INFO [train.py:1028] (1/2) Epoch 12, batch 5850, loss[loss=0.246, simple_loss=0.2804, pruned_loss=0.1058, over 12577.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2564, pruned_loss=0.08551, over 2579370.55 frames. ], batch size: 202, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:27:55,011 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=28.56 vs. limit=22.5 2024-06-20 13:27:55,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=214756.66666666666, ans=0.125 2024-06-20 13:27:56,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214756.66666666666, ans=0.1 2024-06-20 13:27:58,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=214756.66666666666, ans=0.125 2024-06-20 13:28:14,059 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.26 vs. limit=22.5 2024-06-20 13:28:17,647 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:28:22,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=214811.66666666666, ans=0.2 2024-06-20 13:28:30,420 INFO [train.py:1028] (1/2) Epoch 12, batch 5900, loss[loss=0.2155, simple_loss=0.2506, pruned_loss=0.09023, over 13087.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2579, pruned_loss=0.08602, over 2578440.88 frames. ], batch size: 121, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:28:46,890 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.55 vs. limit=6.0 2024-06-20 13:28:49,595 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 1.759e+02 1.871e+02 2.066e+02 2.811e+02, threshold=3.742e+02, percent-clipped=0.0 2024-06-20 13:28:49,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=214903.33333333334, ans=0.0 2024-06-20 13:28:50,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=214903.33333333334, ans=0.125 2024-06-20 13:28:51,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=214903.33333333334, ans=0.025 2024-06-20 13:28:58,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=214921.66666666666, ans=0.0 2024-06-20 13:29:02,304 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.31 vs. limit=12.0 2024-06-20 13:29:02,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=214940.0, ans=0.125 2024-06-20 13:29:03,393 INFO [train.py:1028] (1/2) Epoch 12, batch 5950, loss[loss=0.2237, simple_loss=0.264, pruned_loss=0.09174, over 13098.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2599, pruned_loss=0.08684, over 2582800.02 frames. ], batch size: 121, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:29:04,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214940.0, ans=0.1 2024-06-20 13:29:15,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=214958.33333333334, ans=0.0 2024-06-20 13:29:40,181 INFO [train.py:1028] (1/2) Epoch 12, batch 6000, loss[loss=0.2672, simple_loss=0.2964, pruned_loss=0.119, over 12160.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2609, pruned_loss=0.08726, over 2575131.19 frames. ], batch size: 240, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:29:40,182 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 13:29:46,582 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.5655, 2.9369, 4.2263, 2.0078], device='cuda:1') 2024-06-20 13:29:48,352 INFO [train.py:1060] (1/2) Epoch 12, validation: loss=0.1938, simple_loss=0.2582, pruned_loss=0.0647, over 351949.00 frames. 2024-06-20 13:29:48,353 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 13:29:59,147 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:30:00,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=215050.0, ans=0.1 2024-06-20 13:30:07,807 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.81 vs. limit=10.0 2024-06-20 13:30:08,760 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 1.849e+02 2.009e+02 2.193e+02 2.852e+02, threshold=4.017e+02, percent-clipped=0.0 2024-06-20 13:30:22,488 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2024-06-20 13:30:26,897 INFO [train.py:1028] (1/2) Epoch 12, batch 6050, loss[loss=0.2378, simple_loss=0.2788, pruned_loss=0.09843, over 12973.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2624, pruned_loss=0.0878, over 2577520.51 frames. ], batch size: 39, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:30:32,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=215123.33333333334, ans=0.125 2024-06-20 13:30:34,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=215141.66666666666, ans=0.125 2024-06-20 13:30:38,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=215141.66666666666, ans=0.1 2024-06-20 13:30:41,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=215160.0, ans=0.05 2024-06-20 13:30:44,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=215160.0, ans=0.2 2024-06-20 13:30:53,216 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.85 vs. limit=10.0 2024-06-20 13:31:00,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=215215.0, ans=0.125 2024-06-20 13:31:01,035 INFO [train.py:1028] (1/2) Epoch 12, batch 6100, loss[loss=0.2112, simple_loss=0.2528, pruned_loss=0.08475, over 13094.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2632, pruned_loss=0.088, over 2579124.17 frames. ], batch size: 121, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:31:18,612 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.84 vs. limit=10.0 2024-06-20 13:31:26,688 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.840e+02 1.949e+02 2.179e+02 2.779e+02, threshold=3.897e+02, percent-clipped=0.0 2024-06-20 13:31:27,690 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:31:33,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=215288.33333333334, ans=0.1 2024-06-20 13:31:35,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=215288.33333333334, ans=0.125 2024-06-20 13:31:40,962 INFO [train.py:1028] (1/2) Epoch 12, batch 6150, loss[loss=0.2464, simple_loss=0.2728, pruned_loss=0.11, over 10775.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2648, pruned_loss=0.08862, over 2577452.64 frames. ], batch size: 303, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:31:42,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=215306.66666666666, ans=0.2 2024-06-20 13:31:49,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=215325.0, ans=0.0 2024-06-20 13:31:49,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=215325.0, ans=0.2 2024-06-20 13:31:59,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=215343.33333333334, ans=0.0 2024-06-20 13:32:12,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=215380.0, ans=0.0 2024-06-20 13:32:12,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=215380.0, ans=0.0 2024-06-20 13:32:16,091 INFO [train.py:1028] (1/2) Epoch 12, batch 6200, loss[loss=0.26, simple_loss=0.308, pruned_loss=0.106, over 13232.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2661, pruned_loss=0.08907, over 2576192.12 frames. ], batch size: 89, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:32:21,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=215398.33333333334, ans=0.2 2024-06-20 13:32:40,048 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.875e+02 2.035e+02 2.276e+02 3.398e+02, threshold=4.070e+02, percent-clipped=0.0 2024-06-20 13:32:41,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=215453.33333333334, ans=0.125 2024-06-20 13:32:44,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=215453.33333333334, ans=0.0 2024-06-20 13:32:45,288 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.05 vs. limit=6.0 2024-06-20 13:32:54,006 INFO [train.py:1028] (1/2) Epoch 12, batch 6250, loss[loss=0.2283, simple_loss=0.2739, pruned_loss=0.09129, over 13206.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2674, pruned_loss=0.08984, over 2568983.04 frames. ], batch size: 83, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:32:56,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=215490.0, ans=0.2 2024-06-20 13:32:57,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=215490.0, ans=0.0 2024-06-20 13:33:08,502 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=15.0 2024-06-20 13:33:14,253 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.65 vs. limit=15.0 2024-06-20 13:33:21,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=215563.33333333334, ans=0.125 2024-06-20 13:33:31,561 INFO [train.py:1028] (1/2) Epoch 12, batch 6300, loss[loss=0.2229, simple_loss=0.2735, pruned_loss=0.0861, over 11311.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2696, pruned_loss=0.09093, over 2563982.62 frames. ], batch size: 16, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:33:39,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=215600.0, ans=0.04949747468305833 2024-06-20 13:33:44,210 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.48 vs. limit=6.0 2024-06-20 13:33:45,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=215618.33333333334, ans=0.0 2024-06-20 13:33:45,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=215618.33333333334, ans=0.0 2024-06-20 13:33:46,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=215618.33333333334, ans=0.04949747468305833 2024-06-20 13:33:52,169 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.966e+02 2.217e+02 2.508e+02 3.661e+02, threshold=4.434e+02, percent-clipped=0.0 2024-06-20 13:33:56,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=215636.66666666666, ans=0.2 2024-06-20 13:33:58,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=215636.66666666666, ans=0.1 2024-06-20 13:34:01,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=215655.0, ans=0.025 2024-06-20 13:34:02,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=215655.0, ans=0.125 2024-06-20 13:34:06,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=215655.0, ans=0.125 2024-06-20 13:34:07,363 INFO [train.py:1028] (1/2) Epoch 12, batch 6350, loss[loss=0.2789, simple_loss=0.3118, pruned_loss=0.123, over 12524.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2716, pruned_loss=0.09137, over 2572581.59 frames. ], batch size: 202, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:34:13,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=215691.66666666666, ans=0.0 2024-06-20 13:34:23,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=215710.0, ans=0.04949747468305833 2024-06-20 13:34:25,493 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=4.485e-01 2024-06-20 13:34:37,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=215728.33333333334, ans=0.2 2024-06-20 13:34:48,878 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.59 vs. limit=15.0 2024-06-20 13:34:49,072 INFO [train.py:1028] (1/2) Epoch 12, batch 6400, loss[loss=0.2328, simple_loss=0.278, pruned_loss=0.09378, over 13235.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.2737, pruned_loss=0.09233, over 2573737.11 frames. ], batch size: 67, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:35:01,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=215783.33333333334, ans=0.07 2024-06-20 13:35:12,378 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.866e+02 2.009e+02 2.274e+02 3.021e+02, threshold=4.019e+02, percent-clipped=0.0 2024-06-20 13:35:27,481 INFO [train.py:1028] (1/2) Epoch 12, batch 6450, loss[loss=0.2486, simple_loss=0.2867, pruned_loss=0.1053, over 12512.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.2748, pruned_loss=0.09268, over 2579538.58 frames. ], batch size: 202, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:35:37,216 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.79 vs. limit=15.0 2024-06-20 13:35:51,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=215893.33333333334, ans=0.0 2024-06-20 13:35:53,855 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=12.0 2024-06-20 13:36:05,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=215930.0, ans=0.025 2024-06-20 13:36:09,587 INFO [train.py:1028] (1/2) Epoch 12, batch 6500, loss[loss=0.248, simple_loss=0.2828, pruned_loss=0.1066, over 10869.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.2764, pruned_loss=0.09289, over 2582494.23 frames. ], batch size: 303, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:36:18,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=215966.66666666666, ans=0.125 2024-06-20 13:36:22,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=215966.66666666666, ans=0.1 2024-06-20 13:36:26,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=215985.0, ans=0.2 2024-06-20 13:36:29,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=215985.0, ans=0.125 2024-06-20 13:36:32,307 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.861e+02 2.032e+02 2.229e+02 3.062e+02, threshold=4.065e+02, percent-clipped=0.0 2024-06-20 13:36:34,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=216003.33333333334, ans=0.0 2024-06-20 13:36:38,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=216003.33333333334, ans=0.0 2024-06-20 13:36:43,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=216021.66666666666, ans=0.015 2024-06-20 13:36:44,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=216021.66666666666, ans=0.0 2024-06-20 13:36:45,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=216021.66666666666, ans=0.0 2024-06-20 13:36:47,165 INFO [train.py:1028] (1/2) Epoch 12, batch 6550, loss[loss=0.1946, simple_loss=0.253, pruned_loss=0.06812, over 12470.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.277, pruned_loss=0.0928, over 2586648.76 frames. ], batch size: 22, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:37:10,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=216076.66666666666, ans=0.0 2024-06-20 13:37:23,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=216113.33333333334, ans=0.1 2024-06-20 13:37:28,710 INFO [train.py:1028] (1/2) Epoch 12, batch 6600, loss[loss=0.2327, simple_loss=0.2879, pruned_loss=0.08877, over 13217.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.2777, pruned_loss=0.09288, over 2589321.85 frames. ], batch size: 72, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:37:34,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=216131.66666666666, ans=0.035 2024-06-20 13:37:37,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=216150.0, ans=0.125 2024-06-20 13:37:41,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=216150.0, ans=0.125 2024-06-20 13:37:52,256 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.934e+02 2.070e+02 2.267e+02 2.911e+02, threshold=4.140e+02, percent-clipped=0.0 2024-06-20 13:37:55,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=216186.66666666666, ans=0.0 2024-06-20 13:37:59,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=216205.0, ans=0.1 2024-06-20 13:38:03,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=216205.0, ans=0.1 2024-06-20 13:38:04,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=216205.0, ans=0.0 2024-06-20 13:38:07,488 INFO [train.py:1028] (1/2) Epoch 12, batch 6650, loss[loss=0.2393, simple_loss=0.2845, pruned_loss=0.09709, over 12897.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.2801, pruned_loss=0.09406, over 2584048.98 frames. ], batch size: 158, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:38:35,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=216278.33333333334, ans=0.0 2024-06-20 13:38:46,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=216296.66666666666, ans=0.125 2024-06-20 13:38:47,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=216296.66666666666, ans=0.0 2024-06-20 13:38:51,131 INFO [train.py:1028] (1/2) Epoch 12, batch 6700, loss[loss=0.2593, simple_loss=0.3042, pruned_loss=0.1072, over 12842.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.281, pruned_loss=0.09462, over 2583754.83 frames. ], batch size: 177, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:39:09,326 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.50 vs. limit=15.0 2024-06-20 13:39:13,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=216370.0, ans=0.1 2024-06-20 13:39:14,091 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 1.928e+02 2.141e+02 2.350e+02 4.353e+02, threshold=4.282e+02, percent-clipped=1.0 2024-06-20 13:39:16,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=216370.0, ans=0.0 2024-06-20 13:39:16,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=216370.0, ans=0.07 2024-06-20 13:39:33,174 INFO [train.py:1028] (1/2) Epoch 12, batch 6750, loss[loss=0.2993, simple_loss=0.3304, pruned_loss=0.1341, over 12207.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.2821, pruned_loss=0.09537, over 2578151.30 frames. ], batch size: 241, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:39:43,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=216425.0, ans=0.025 2024-06-20 13:39:48,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=216443.33333333334, ans=0.0 2024-06-20 13:39:48,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=216443.33333333334, ans=0.0 2024-06-20 13:39:49,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=216443.33333333334, ans=0.125 2024-06-20 13:39:58,342 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.16 vs. limit=15.0 2024-06-20 13:40:11,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=216498.33333333334, ans=0.0 2024-06-20 13:40:11,800 INFO [train.py:1028] (1/2) Epoch 12, batch 6800, loss[loss=0.2225, simple_loss=0.2708, pruned_loss=0.08709, over 13207.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.2837, pruned_loss=0.09593, over 2580094.28 frames. ], batch size: 67, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:40:14,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=216498.33333333334, ans=0.125 2024-06-20 13:40:15,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=216498.33333333334, ans=0.0 2024-06-20 13:40:15,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=216498.33333333334, ans=0.125 2024-06-20 13:40:16,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=216498.33333333334, ans=0.1 2024-06-20 13:40:17,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=216498.33333333334, ans=0.2 2024-06-20 13:40:26,187 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.56 vs. limit=22.5 2024-06-20 13:40:27,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=216535.0, ans=0.0 2024-06-20 13:40:30,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=216535.0, ans=0.125 2024-06-20 13:40:33,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=216553.33333333334, ans=0.025 2024-06-20 13:40:38,255 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 1.917e+02 2.060e+02 2.278e+02 3.379e+02, threshold=4.121e+02, percent-clipped=0.0 2024-06-20 13:40:41,905 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2024-06-20 13:40:52,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=216571.66666666666, ans=0.0 2024-06-20 13:40:53,801 INFO [train.py:1028] (1/2) Epoch 12, batch 6850, loss[loss=0.237, simple_loss=0.2964, pruned_loss=0.08878, over 13278.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.2839, pruned_loss=0.09578, over 2583372.58 frames. ], batch size: 63, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:40:54,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=216590.0, ans=0.0 2024-06-20 13:41:01,265 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.24 vs. limit=22.5 2024-06-20 13:41:14,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=216626.66666666666, ans=0.1 2024-06-20 13:41:26,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=216663.33333333334, ans=10.0 2024-06-20 13:41:29,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=216663.33333333334, ans=0.125 2024-06-20 13:41:32,957 INFO [train.py:1028] (1/2) Epoch 12, batch 6900, loss[loss=0.2601, simple_loss=0.3013, pruned_loss=0.1095, over 13252.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.2849, pruned_loss=0.09599, over 2585825.18 frames. ], batch size: 49, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:41:34,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=216681.66666666666, ans=0.125 2024-06-20 13:41:43,570 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.88 vs. limit=15.0 2024-06-20 13:41:51,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=216700.0, ans=0.2 2024-06-20 13:42:00,298 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 1.940e+02 2.227e+02 2.484e+02 3.852e+02, threshold=4.454e+02, percent-clipped=0.0 2024-06-20 13:42:06,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=216736.66666666666, ans=0.2 2024-06-20 13:42:08,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=216755.0, ans=0.0 2024-06-20 13:42:16,121 INFO [train.py:1028] (1/2) Epoch 12, batch 6950, loss[loss=0.2376, simple_loss=0.2858, pruned_loss=0.09475, over 11380.00 frames. ], tot_loss[loss=0.238, simple_loss=0.2849, pruned_loss=0.09555, over 2579908.58 frames. ], batch size: 16, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:42:21,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=216773.33333333334, ans=0.0 2024-06-20 13:42:21,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=216773.33333333334, ans=0.0 2024-06-20 13:42:43,583 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.69 vs. limit=6.0 2024-06-20 13:42:47,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=216846.66666666666, ans=0.0 2024-06-20 13:42:50,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=216846.66666666666, ans=0.125 2024-06-20 13:42:54,995 INFO [train.py:1028] (1/2) Epoch 12, batch 7000, loss[loss=0.2388, simple_loss=0.288, pruned_loss=0.0948, over 12908.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.2855, pruned_loss=0.09579, over 2575221.64 frames. ], batch size: 158, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:43:00,618 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.10 vs. limit=10.0 2024-06-20 13:43:19,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=216901.66666666666, ans=0.025 2024-06-20 13:43:22,843 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 1.899e+02 2.008e+02 2.202e+02 3.785e+02, threshold=4.015e+02, percent-clipped=0.0 2024-06-20 13:43:33,164 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.26 vs. limit=22.5 2024-06-20 13:43:38,983 INFO [train.py:1028] (1/2) Epoch 12, batch 7050, loss[loss=0.2516, simple_loss=0.2962, pruned_loss=0.1035, over 12771.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.2864, pruned_loss=0.09619, over 2582531.60 frames. ], batch size: 176, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:43:39,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=216956.66666666666, ans=0.125 2024-06-20 13:43:45,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=216975.0, ans=0.07 2024-06-20 13:43:46,247 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.86 vs. limit=15.0 2024-06-20 13:43:54,355 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.96 vs. limit=12.0 2024-06-20 13:44:06,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=217011.66666666666, ans=0.125 2024-06-20 13:44:07,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=217011.66666666666, ans=0.07 2024-06-20 13:44:08,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=217030.0, ans=0.025 2024-06-20 13:44:09,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=217030.0, ans=0.0 2024-06-20 13:44:15,375 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.62 vs. limit=6.0 2024-06-20 13:44:21,378 INFO [train.py:1028] (1/2) Epoch 12, batch 7100, loss[loss=0.2758, simple_loss=0.3219, pruned_loss=0.1149, over 13225.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.2871, pruned_loss=0.09687, over 2575512.48 frames. ], batch size: 112, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:44:25,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=217048.33333333334, ans=0.2 2024-06-20 13:44:38,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=217085.0, ans=0.2 2024-06-20 13:44:44,804 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 1.931e+02 2.085e+02 2.284e+02 3.010e+02, threshold=4.170e+02, percent-clipped=0.0 2024-06-20 13:44:48,928 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.77 vs. limit=12.0 2024-06-20 13:44:54,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=217121.66666666666, ans=0.125 2024-06-20 13:45:00,127 INFO [train.py:1028] (1/2) Epoch 12, batch 7150, loss[loss=0.284, simple_loss=0.3264, pruned_loss=0.1208, over 12522.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.2872, pruned_loss=0.09672, over 2572851.63 frames. ], batch size: 202, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:45:00,574 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2024-06-20 13:45:01,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=217140.0, ans=0.1 2024-06-20 13:45:02,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=217140.0, ans=0.5 2024-06-20 13:45:11,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=217158.33333333334, ans=0.125 2024-06-20 13:45:25,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=217195.0, ans=0.025 2024-06-20 13:45:26,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=217195.0, ans=0.05 2024-06-20 13:45:42,254 INFO [train.py:1028] (1/2) Epoch 12, batch 7200, loss[loss=0.2398, simple_loss=0.2941, pruned_loss=0.09278, over 13164.00 frames. ], tot_loss[loss=0.241, simple_loss=0.2881, pruned_loss=0.09698, over 2578389.72 frames. ], batch size: 112, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:45:51,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=217250.0, ans=0.125 2024-06-20 13:45:54,753 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.83 vs. limit=15.0 2024-06-20 13:45:55,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=217250.0, ans=0.0 2024-06-20 13:46:04,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=217268.33333333334, ans=0.2 2024-06-20 13:46:05,538 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.02 vs. limit=15.0 2024-06-20 13:46:06,005 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 1.915e+02 2.070e+02 2.223e+02 2.968e+02, threshold=4.139e+02, percent-clipped=0.0 2024-06-20 13:46:12,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=217286.66666666666, ans=0.0 2024-06-20 13:46:12,547 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.89 vs. limit=5.0 2024-06-20 13:46:13,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=217305.0, ans=0.025 2024-06-20 13:46:22,570 INFO [train.py:1028] (1/2) Epoch 12, batch 7250, loss[loss=0.2134, simple_loss=0.2708, pruned_loss=0.07803, over 12927.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.289, pruned_loss=0.09695, over 2578814.21 frames. ], batch size: 36, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:46:37,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=217341.66666666666, ans=0.125 2024-06-20 13:46:41,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=217360.0, ans=0.1 2024-06-20 13:46:43,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=217360.0, ans=0.125 2024-06-20 13:46:46,594 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.40 vs. limit=22.5 2024-06-20 13:46:47,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=217360.0, ans=0.125 2024-06-20 13:46:51,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=217378.33333333334, ans=0.125 2024-06-20 13:46:57,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=217396.66666666666, ans=0.025 2024-06-20 13:47:04,796 INFO [train.py:1028] (1/2) Epoch 12, batch 7300, loss[loss=0.2529, simple_loss=0.2953, pruned_loss=0.1052, over 12814.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.29, pruned_loss=0.09752, over 2577822.72 frames. ], batch size: 36, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:47:16,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=217433.33333333334, ans=0.015 2024-06-20 13:47:27,850 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.910e+02 2.022e+02 2.185e+02 3.114e+02, threshold=4.044e+02, percent-clipped=0.0 2024-06-20 13:47:29,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=217470.0, ans=0.1 2024-06-20 13:47:39,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=217488.33333333334, ans=0.125 2024-06-20 13:47:42,376 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=217506.66666666666, ans=0.0 2024-06-20 13:47:42,927 INFO [train.py:1028] (1/2) Epoch 12, batch 7350, loss[loss=0.2661, simple_loss=0.3123, pruned_loss=0.11, over 13344.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.2911, pruned_loss=0.09785, over 2579402.70 frames. ], batch size: 46, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:47:48,238 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:47:50,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=217525.0, ans=0.0 2024-06-20 13:47:51,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=217525.0, ans=0.025 2024-06-20 13:47:51,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=217525.0, ans=0.0 2024-06-20 13:47:56,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=217525.0, ans=0.1 2024-06-20 13:47:57,859 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.58 vs. limit=15.0 2024-06-20 13:48:02,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=217543.33333333334, ans=0.125 2024-06-20 13:48:13,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=217561.66666666666, ans=0.025 2024-06-20 13:48:21,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=217580.0, ans=0.125 2024-06-20 13:48:25,865 INFO [train.py:1028] (1/2) Epoch 12, batch 7400, loss[loss=0.2596, simple_loss=0.3086, pruned_loss=0.1053, over 13274.00 frames. ], tot_loss[loss=0.243, simple_loss=0.2909, pruned_loss=0.09755, over 2584846.32 frames. ], batch size: 63, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:48:46,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=217635.0, ans=0.0 2024-06-20 13:48:49,627 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 1.978e+02 2.089e+02 2.332e+02 3.208e+02, threshold=4.179e+02, percent-clipped=0.0 2024-06-20 13:48:52,498 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.80 vs. limit=15.0 2024-06-20 13:48:55,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=217653.33333333334, ans=0.2 2024-06-20 13:49:09,493 INFO [train.py:1028] (1/2) Epoch 12, batch 7450, loss[loss=0.2198, simple_loss=0.2783, pruned_loss=0.08071, over 12700.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.2906, pruned_loss=0.09735, over 2578776.97 frames. ], batch size: 29, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:49:17,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=217708.33333333334, ans=0.1 2024-06-20 13:49:18,820 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.36 vs. limit=15.0 2024-06-20 13:49:20,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=217708.33333333334, ans=0.0 2024-06-20 13:49:33,077 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2024-06-20 13:49:35,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=217745.0, ans=0.0 2024-06-20 13:49:35,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=217745.0, ans=0.04949747468305833 2024-06-20 13:49:38,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=217745.0, ans=0.0 2024-06-20 13:49:40,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=217763.33333333334, ans=0.1 2024-06-20 13:49:42,757 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=8.86 vs. limit=12.0 2024-06-20 13:49:49,545 INFO [train.py:1028] (1/2) Epoch 12, batch 7500, loss[loss=0.2583, simple_loss=0.2943, pruned_loss=0.1111, over 10681.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.2914, pruned_loss=0.09789, over 2576269.59 frames. ], batch size: 303, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:49:54,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=217781.66666666666, ans=0.1 2024-06-20 13:50:02,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=217800.0, ans=0.0 2024-06-20 13:50:04,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=217818.33333333334, ans=0.125 2024-06-20 13:50:08,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=217818.33333333334, ans=0.0 2024-06-20 13:50:13,174 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 1.875e+02 2.048e+02 2.243e+02 2.837e+02, threshold=4.096e+02, percent-clipped=0.0 2024-06-20 13:50:32,725 INFO [train.py:1028] (1/2) Epoch 12, batch 7550, loss[loss=0.262, simple_loss=0.3071, pruned_loss=0.1084, over 12930.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.293, pruned_loss=0.09917, over 2576508.70 frames. ], batch size: 158, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:50:36,820 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.26 vs. limit=22.5 2024-06-20 13:50:42,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=217891.66666666666, ans=0.125 2024-06-20 13:50:49,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=217910.0, ans=0.1 2024-06-20 13:50:58,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=217928.33333333334, ans=0.0 2024-06-20 13:51:00,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=217928.33333333334, ans=0.125 2024-06-20 13:51:06,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=217946.66666666666, ans=0.125 2024-06-20 13:51:07,749 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.27 vs. limit=15.0 2024-06-20 13:51:11,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=217965.0, ans=0.125 2024-06-20 13:51:12,540 INFO [train.py:1028] (1/2) Epoch 12, batch 7600, loss[loss=0.2445, simple_loss=0.2904, pruned_loss=0.09935, over 13227.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.2927, pruned_loss=0.09916, over 2575574.14 frames. ], batch size: 83, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:51:29,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=217983.33333333334, ans=0.0 2024-06-20 13:51:39,778 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.719e+02 1.955e+02 2.083e+02 2.239e+02 3.004e+02, threshold=4.166e+02, percent-clipped=0.0 2024-06-20 13:51:41,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.33 vs. limit=15.0 2024-06-20 13:51:52,265 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.32 vs. limit=22.5 2024-06-20 13:51:55,569 INFO [train.py:1028] (1/2) Epoch 12, batch 7650, loss[loss=0.2315, simple_loss=0.2851, pruned_loss=0.0889, over 12880.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.2931, pruned_loss=0.09888, over 2569881.06 frames. ], batch size: 33, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:52:03,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=218075.0, ans=0.125 2024-06-20 13:52:04,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=218075.0, ans=0.1 2024-06-20 13:52:08,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=218075.0, ans=0.0 2024-06-20 13:52:16,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=218093.33333333334, ans=0.125 2024-06-20 13:52:17,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=218093.33333333334, ans=0.2 2024-06-20 13:52:25,052 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.09 vs. limit=15.0 2024-06-20 13:52:28,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=218130.0, ans=0.125 2024-06-20 13:52:28,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=218130.0, ans=0.0 2024-06-20 13:52:29,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=218130.0, ans=0.025 2024-06-20 13:52:34,081 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:52:35,380 INFO [train.py:1028] (1/2) Epoch 12, batch 7700, loss[loss=0.2406, simple_loss=0.3018, pruned_loss=0.08967, over 13240.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.2941, pruned_loss=0.09927, over 2567006.12 frames. ], batch size: 63, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:52:48,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=218166.66666666666, ans=0.125 2024-06-20 13:53:02,353 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 1.944e+02 2.075e+02 2.290e+02 3.443e+02, threshold=4.150e+02, percent-clipped=0.0 2024-06-20 13:53:06,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=218203.33333333334, ans=0.0 2024-06-20 13:53:18,050 INFO [train.py:1028] (1/2) Epoch 12, batch 7750, loss[loss=0.2285, simple_loss=0.2843, pruned_loss=0.08636, over 13234.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.2943, pruned_loss=0.09957, over 2572084.17 frames. ], batch size: 72, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:53:18,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=218240.0, ans=0.125 2024-06-20 13:53:23,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=218240.0, ans=0.125 2024-06-20 13:53:31,418 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.81 vs. limit=15.0 2024-06-20 13:53:33,287 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:53:33,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.33 vs. limit=15.0 2024-06-20 13:53:39,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=218276.66666666666, ans=0.0 2024-06-20 13:53:48,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=218295.0, ans=0.125 2024-06-20 13:53:52,723 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.13 vs. limit=22.5 2024-06-20 13:54:01,226 INFO [train.py:1028] (1/2) Epoch 12, batch 7800, loss[loss=0.2803, simple_loss=0.3189, pruned_loss=0.1208, over 13136.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.2948, pruned_loss=0.09938, over 2576601.82 frames. ], batch size: 95, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:54:05,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=218331.66666666666, ans=0.125 2024-06-20 13:54:10,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=218350.0, ans=0.1 2024-06-20 13:54:11,847 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.98 vs. limit=15.0 2024-06-20 13:54:25,305 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.919e+02 2.129e+02 2.375e+02 3.233e+02, threshold=4.258e+02, percent-clipped=0.0 2024-06-20 13:54:33,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=218405.0, ans=0.2 2024-06-20 13:54:40,941 INFO [train.py:1028] (1/2) Epoch 12, batch 7850, loss[loss=0.2274, simple_loss=0.2765, pruned_loss=0.08914, over 10628.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.2955, pruned_loss=0.1, over 2570032.68 frames. ], batch size: 16, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:54:47,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=218423.33333333334, ans=0.125 2024-06-20 13:54:49,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=218441.66666666666, ans=0.0 2024-06-20 13:55:16,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=218496.66666666666, ans=0.2 2024-06-20 13:55:23,836 INFO [train.py:1028] (1/2) Epoch 12, batch 7900, loss[loss=0.2346, simple_loss=0.2866, pruned_loss=0.09127, over 13149.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.2958, pruned_loss=0.1005, over 2570756.69 frames. ], batch size: 77, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:55:32,481 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.77 vs. limit=6.0 2024-06-20 13:55:36,376 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=218533.33333333334, ans=0.07 2024-06-20 13:55:45,359 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.56 vs. limit=5.0 2024-06-20 13:55:47,994 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 1.904e+02 1.975e+02 2.123e+02 3.024e+02, threshold=3.950e+02, percent-clipped=0.0 2024-06-20 13:55:56,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=218588.33333333334, ans=0.0 2024-06-20 13:56:07,861 INFO [train.py:1028] (1/2) Epoch 12, batch 7950, loss[loss=0.2317, simple_loss=0.2729, pruned_loss=0.09523, over 10606.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.296, pruned_loss=0.1007, over 2573924.90 frames. ], batch size: 303, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:56:09,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=218606.66666666666, ans=0.05 2024-06-20 13:56:19,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=218625.0, ans=0.125 2024-06-20 13:56:26,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=218643.33333333334, ans=0.0 2024-06-20 13:56:27,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=218643.33333333334, ans=0.125 2024-06-20 13:56:29,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=218643.33333333334, ans=0.1 2024-06-20 13:56:34,158 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2024-06-20 13:56:38,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=218661.66666666666, ans=0.125 2024-06-20 13:56:48,063 INFO [train.py:1028] (1/2) Epoch 12, batch 8000, loss[loss=0.2372, simple_loss=0.2902, pruned_loss=0.09208, over 12498.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.2968, pruned_loss=0.1007, over 2571190.51 frames. ], batch size: 29, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:56:50,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=218698.33333333334, ans=0.125 2024-06-20 13:57:11,155 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 1.996e+02 2.156e+02 2.497e+02 4.158e+02, threshold=4.311e+02, percent-clipped=1.0 2024-06-20 13:57:19,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=218771.66666666666, ans=0.125 2024-06-20 13:57:29,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=218771.66666666666, ans=0.125 2024-06-20 13:57:29,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=218771.66666666666, ans=0.125 2024-06-20 13:57:31,180 INFO [train.py:1028] (1/2) Epoch 12, batch 8050, loss[loss=0.2472, simple_loss=0.299, pruned_loss=0.09769, over 13241.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.2961, pruned_loss=0.1, over 2571335.56 frames. ], batch size: 83, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:57:34,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=218790.0, ans=0.0 2024-06-20 13:57:42,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=218808.33333333334, ans=0.125 2024-06-20 13:57:47,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=218826.66666666666, ans=0.125 2024-06-20 13:58:09,428 INFO [train.py:1028] (1/2) Epoch 12, batch 8100, loss[loss=0.3191, simple_loss=0.3319, pruned_loss=0.1532, over 13128.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.2976, pruned_loss=0.1009, over 2574124.45 frames. ], batch size: 112, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:58:16,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=218900.0, ans=0.1 2024-06-20 13:58:24,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=218918.33333333334, ans=0.125 2024-06-20 13:58:24,594 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.74 vs. limit=15.0 2024-06-20 13:58:36,512 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 1.927e+02 2.078e+02 2.240e+02 3.380e+02, threshold=4.156e+02, percent-clipped=0.0 2024-06-20 13:58:43,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=218955.0, ans=0.125 2024-06-20 13:58:52,375 INFO [train.py:1028] (1/2) Epoch 12, batch 8150, loss[loss=0.2315, simple_loss=0.2774, pruned_loss=0.0928, over 13108.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.2974, pruned_loss=0.09997, over 2576764.08 frames. ], batch size: 121, lr: 4.82e-03, grad_scale: 32.0 2024-06-20 13:58:58,083 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:59:06,984 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.64 vs. limit=6.0 2024-06-20 13:59:07,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=219010.0, ans=0.125 2024-06-20 13:59:26,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=219046.66666666666, ans=0.125 2024-06-20 13:59:30,880 INFO [train.py:1028] (1/2) Epoch 12, batch 8200, loss[loss=0.2568, simple_loss=0.3077, pruned_loss=0.1029, over 13165.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.2982, pruned_loss=0.1002, over 2580889.97 frames. ], batch size: 112, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 13:59:34,447 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=10.61 vs. limit=12.0 2024-06-20 13:59:35,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=219065.0, ans=0.0 2024-06-20 13:59:59,234 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 1.988e+02 2.167e+02 2.477e+02 3.130e+02, threshold=4.334e+02, percent-clipped=0.0 2024-06-20 14:00:03,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=219120.0, ans=0.2 2024-06-20 14:00:10,286 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.24 vs. limit=15.0 2024-06-20 14:00:10,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=219138.33333333334, ans=0.125 2024-06-20 14:00:15,328 INFO [train.py:1028] (1/2) Epoch 12, batch 8250, loss[loss=0.2443, simple_loss=0.3047, pruned_loss=0.09193, over 13218.00 frames. ], tot_loss[loss=0.25, simple_loss=0.2989, pruned_loss=0.1005, over 2581620.56 frames. ], batch size: 52, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:00:16,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=219156.66666666666, ans=0.05 2024-06-20 14:00:24,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=219175.0, ans=0.0 2024-06-20 14:00:29,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=219175.0, ans=0.5 2024-06-20 14:00:34,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=219193.33333333334, ans=0.2 2024-06-20 14:00:38,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=219211.66666666666, ans=0.0 2024-06-20 14:00:57,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=219230.0, ans=0.1 2024-06-20 14:00:57,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=219230.0, ans=0.125 2024-06-20 14:00:59,415 INFO [train.py:1028] (1/2) Epoch 12, batch 8300, loss[loss=0.255, simple_loss=0.3035, pruned_loss=0.1033, over 13148.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.2987, pruned_loss=0.1005, over 2579814.52 frames. ], batch size: 103, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:01:04,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=219248.33333333334, ans=0.125 2024-06-20 14:01:12,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=219266.66666666666, ans=0.125 2024-06-20 14:01:22,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=219303.33333333334, ans=0.0 2024-06-20 14:01:23,290 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 1.972e+02 2.100e+02 2.238e+02 2.935e+02, threshold=4.200e+02, percent-clipped=0.0 2024-06-20 14:01:30,779 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.38 vs. limit=10.0 2024-06-20 14:01:32,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=219321.66666666666, ans=0.0 2024-06-20 14:01:32,937 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.39 vs. limit=15.0 2024-06-20 14:01:33,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=219321.66666666666, ans=0.04949747468305833 2024-06-20 14:01:37,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=219321.66666666666, ans=0.125 2024-06-20 14:01:40,243 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.77 vs. limit=6.0 2024-06-20 14:01:40,347 INFO [train.py:1028] (1/2) Epoch 12, batch 8350, loss[loss=0.237, simple_loss=0.2838, pruned_loss=0.09513, over 13238.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.2991, pruned_loss=0.1005, over 2581204.38 frames. ], batch size: 112, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:01:44,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=219340.0, ans=0.125 2024-06-20 14:01:44,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=219340.0, ans=0.125 2024-06-20 14:01:50,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=219358.33333333334, ans=0.0 2024-06-20 14:01:51,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=219358.33333333334, ans=0.0 2024-06-20 14:02:04,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=219395.0, ans=0.0 2024-06-20 14:02:06,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=219395.0, ans=0.2 2024-06-20 14:02:13,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=219413.33333333334, ans=0.125 2024-06-20 14:02:23,921 INFO [train.py:1028] (1/2) Epoch 12, batch 8400, loss[loss=0.2386, simple_loss=0.2927, pruned_loss=0.0923, over 12953.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.2994, pruned_loss=0.101, over 2577493.84 frames. ], batch size: 39, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:02:26,533 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2024-06-20 14:02:28,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=219431.66666666666, ans=0.035 2024-06-20 14:02:35,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=219450.0, ans=0.1 2024-06-20 14:02:39,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=219468.33333333334, ans=0.125 2024-06-20 14:02:45,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=219468.33333333334, ans=0.1 2024-06-20 14:02:47,586 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 2.004e+02 2.187e+02 2.537e+02 3.691e+02, threshold=4.375e+02, percent-clipped=0.0 2024-06-20 14:03:03,330 INFO [train.py:1028] (1/2) Epoch 12, batch 8450, loss[loss=0.2655, simple_loss=0.3087, pruned_loss=0.1111, over 13175.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.2997, pruned_loss=0.101, over 2578851.86 frames. ], batch size: 112, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:03:08,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=219523.33333333334, ans=0.1 2024-06-20 14:03:19,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=219541.66666666666, ans=0.2 2024-06-20 14:03:47,030 INFO [train.py:1028] (1/2) Epoch 12, batch 8500, loss[loss=0.2532, simple_loss=0.3045, pruned_loss=0.101, over 12691.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3007, pruned_loss=0.1016, over 2577084.12 frames. ], batch size: 29, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:03:58,060 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.05 vs. limit=15.0 2024-06-20 14:04:07,876 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.73 vs. limit=15.0 2024-06-20 14:04:11,227 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 1.986e+02 2.151e+02 2.409e+02 3.291e+02, threshold=4.302e+02, percent-clipped=0.0 2024-06-20 14:04:12,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=219670.0, ans=0.035 2024-06-20 14:04:13,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=219670.0, ans=0.0 2024-06-20 14:04:13,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=219670.0, ans=0.125 2024-06-20 14:04:21,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=219688.33333333334, ans=0.0 2024-06-20 14:04:21,960 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:04:27,322 INFO [train.py:1028] (1/2) Epoch 12, batch 8550, loss[loss=0.2263, simple_loss=0.2882, pruned_loss=0.08224, over 12497.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.2999, pruned_loss=0.1012, over 2575412.51 frames. ], batch size: 22, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:04:54,070 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.94 vs. limit=22.5 2024-06-20 14:05:11,136 INFO [train.py:1028] (1/2) Epoch 12, batch 8600, loss[loss=0.2579, simple_loss=0.2921, pruned_loss=0.1119, over 13129.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.2999, pruned_loss=0.1012, over 2571800.22 frames. ], batch size: 121, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:05:15,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=219798.33333333334, ans=0.125 2024-06-20 14:05:33,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=219835.0, ans=0.2 2024-06-20 14:05:35,717 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.913e+02 2.023e+02 2.263e+02 2.929e+02, threshold=4.046e+02, percent-clipped=0.0 2024-06-20 14:05:39,342 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.25 vs. limit=15.0 2024-06-20 14:05:39,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=219853.33333333334, ans=0.0 2024-06-20 14:05:40,793 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.00 vs. limit=15.0 2024-06-20 14:05:47,935 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.75 vs. limit=15.0 2024-06-20 14:05:54,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=219890.0, ans=0.2 2024-06-20 14:05:55,522 INFO [train.py:1028] (1/2) Epoch 12, batch 8650, loss[loss=0.2497, simple_loss=0.2997, pruned_loss=0.09987, over 13149.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.2998, pruned_loss=0.1005, over 2574775.46 frames. ], batch size: 103, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:05:56,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=219890.0, ans=0.0 2024-06-20 14:06:17,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=219926.66666666666, ans=0.0 2024-06-20 14:06:29,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=219963.33333333334, ans=0.125 2024-06-20 14:06:30,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=219963.33333333334, ans=0.0 2024-06-20 14:06:34,567 INFO [train.py:1028] (1/2) Epoch 12, batch 8700, loss[loss=0.2536, simple_loss=0.3093, pruned_loss=0.09897, over 13182.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3003, pruned_loss=0.1011, over 2572533.76 frames. ], batch size: 59, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:06:39,532 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=219981.66666666666, ans=0.1 2024-06-20 14:06:49,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=220000.0, ans=0.025 2024-06-20 14:06:53,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=220000.0, ans=0.1 2024-06-20 14:06:57,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220018.33333333334, ans=0.1 2024-06-20 14:06:59,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=220018.33333333334, ans=0.1 2024-06-20 14:07:01,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=220018.33333333334, ans=0.125 2024-06-20 14:07:04,076 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 1.921e+02 2.080e+02 2.261e+02 4.273e+02, threshold=4.159e+02, percent-clipped=1.0 2024-06-20 14:07:12,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=220036.66666666666, ans=0.2 2024-06-20 14:07:20,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=220055.0, ans=0.125 2024-06-20 14:07:21,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=220055.0, ans=0.0 2024-06-20 14:07:24,012 INFO [train.py:1028] (1/2) Epoch 12, batch 8750, loss[loss=0.2453, simple_loss=0.2883, pruned_loss=0.1012, over 13070.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3009, pruned_loss=0.1018, over 2567290.72 frames. ], batch size: 121, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:07:27,173 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.86 vs. limit=15.0 2024-06-20 14:07:38,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=220091.66666666666, ans=0.125 2024-06-20 14:07:38,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=220091.66666666666, ans=0.2 2024-06-20 14:07:47,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=220128.33333333334, ans=0.125 2024-06-20 14:07:53,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=220128.33333333334, ans=0.125 2024-06-20 14:07:57,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=220146.66666666666, ans=0.125 2024-06-20 14:08:04,912 INFO [train.py:1028] (1/2) Epoch 12, batch 8800, loss[loss=0.2438, simple_loss=0.297, pruned_loss=0.09527, over 13231.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3013, pruned_loss=0.102, over 2573305.24 frames. ], batch size: 72, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:08:05,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=220165.0, ans=0.125 2024-06-20 14:08:25,532 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:08:26,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=220201.66666666666, ans=0.1 2024-06-20 14:08:33,117 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.733e+02 1.949e+02 2.099e+02 2.318e+02 3.000e+02, threshold=4.197e+02, percent-clipped=0.0 2024-06-20 14:08:35,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=220220.0, ans=0.0 2024-06-20 14:08:43,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=220238.33333333334, ans=0.0 2024-06-20 14:08:43,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=220238.33333333334, ans=0.09899494936611666 2024-06-20 14:08:44,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=220238.33333333334, ans=0.025 2024-06-20 14:08:49,307 INFO [train.py:1028] (1/2) Epoch 12, batch 8850, loss[loss=0.2893, simple_loss=0.3325, pruned_loss=0.123, over 12494.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3009, pruned_loss=0.1022, over 2561811.88 frames. ], batch size: 202, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:08:49,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=220256.66666666666, ans=0.125 2024-06-20 14:08:56,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=220256.66666666666, ans=0.125 2024-06-20 14:09:08,263 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.01 vs. limit=6.0 2024-06-20 14:09:09,005 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.31 vs. limit=15.0 2024-06-20 14:09:12,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=220311.66666666666, ans=0.0 2024-06-20 14:09:12,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=220311.66666666666, ans=0.125 2024-06-20 14:09:14,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=220311.66666666666, ans=0.125 2024-06-20 14:09:15,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=220311.66666666666, ans=0.0 2024-06-20 14:09:16,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=220311.66666666666, ans=0.0 2024-06-20 14:09:17,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=220311.66666666666, ans=0.0 2024-06-20 14:09:31,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220348.33333333334, ans=0.1 2024-06-20 14:09:32,591 INFO [train.py:1028] (1/2) Epoch 12, batch 8900, loss[loss=0.2288, simple_loss=0.2823, pruned_loss=0.08766, over 12906.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.301, pruned_loss=0.1023, over 2559565.57 frames. ], batch size: 33, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:09:46,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=220366.66666666666, ans=0.2 2024-06-20 14:09:56,200 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.028e+02 2.239e+02 2.518e+02 3.270e+02, threshold=4.478e+02, percent-clipped=0.0 2024-06-20 14:10:01,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=220403.33333333334, ans=0.125 2024-06-20 14:10:10,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=220421.66666666666, ans=0.1 2024-06-20 14:10:12,119 INFO [train.py:1028] (1/2) Epoch 12, batch 8950, loss[loss=0.2592, simple_loss=0.3039, pruned_loss=0.1072, over 12578.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3012, pruned_loss=0.1021, over 2560161.51 frames. ], batch size: 202, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:10:13,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220440.0, ans=0.1 2024-06-20 14:10:55,827 INFO [train.py:1028] (1/2) Epoch 12, batch 9000, loss[loss=0.2681, simple_loss=0.3177, pruned_loss=0.1093, over 13306.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.301, pruned_loss=0.1017, over 2566543.09 frames. ], batch size: 46, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:10:55,827 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 14:11:04,507 INFO [train.py:1060] (1/2) Epoch 12, validation: loss=0.1927, simple_loss=0.257, pruned_loss=0.06422, over 351949.00 frames. 2024-06-20 14:11:04,508 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17458MB 2024-06-20 14:11:05,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=220531.66666666666, ans=0.125 2024-06-20 14:11:07,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=220531.66666666666, ans=0.125 2024-06-20 14:11:10,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=220531.66666666666, ans=0.125 2024-06-20 14:11:11,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=220550.0, ans=0.125 2024-06-20 14:11:22,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=220568.33333333334, ans=0.2 2024-06-20 14:11:22,942 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.67 vs. limit=12.0 2024-06-20 14:11:27,954 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.928e+02 2.043e+02 2.203e+02 2.725e+02, threshold=4.087e+02, percent-clipped=0.0 2024-06-20 14:11:28,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=220586.66666666666, ans=0.1 2024-06-20 14:11:31,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=220586.66666666666, ans=0.125 2024-06-20 14:11:43,466 INFO [train.py:1028] (1/2) Epoch 12, batch 9050, loss[loss=0.2387, simple_loss=0.2898, pruned_loss=0.0938, over 11495.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3019, pruned_loss=0.1022, over 2567455.26 frames. ], batch size: 17, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:11:45,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=220623.33333333334, ans=0.0 2024-06-20 14:11:51,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=220641.66666666666, ans=0.125 2024-06-20 14:11:56,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=220641.66666666666, ans=0.0 2024-06-20 14:12:09,520 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-20 14:12:12,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=220678.33333333334, ans=0.0 2024-06-20 14:12:17,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=220696.66666666666, ans=0.125 2024-06-20 14:12:21,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=220696.66666666666, ans=0.2 2024-06-20 14:12:22,876 INFO [train.py:1028] (1/2) Epoch 12, batch 9100, loss[loss=0.2531, simple_loss=0.3136, pruned_loss=0.09625, over 13084.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3007, pruned_loss=0.1013, over 2569543.05 frames. ], batch size: 71, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:12:23,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=220715.0, ans=0.0 2024-06-20 14:12:30,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=220733.33333333334, ans=0.0 2024-06-20 14:12:31,105 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.30 vs. limit=22.5 2024-06-20 14:12:34,895 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.26 vs. limit=15.0 2024-06-20 14:12:38,090 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.53 vs. limit=12.0 2024-06-20 14:12:49,238 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 1.982e+02 2.172e+02 2.488e+02 3.363e+02, threshold=4.344e+02, percent-clipped=0.0 2024-06-20 14:12:51,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=220770.0, ans=0.2 2024-06-20 14:12:56,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=220788.33333333334, ans=0.125 2024-06-20 14:12:59,205 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:13:04,588 INFO [train.py:1028] (1/2) Epoch 12, batch 9150, loss[loss=0.2384, simple_loss=0.2946, pruned_loss=0.0911, over 13187.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3009, pruned_loss=0.1014, over 2569226.32 frames. ], batch size: 77, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:13:04,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=220806.66666666666, ans=0.0 2024-06-20 14:13:05,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=220806.66666666666, ans=0.125 2024-06-20 14:13:17,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=220825.0, ans=0.0 2024-06-20 14:13:22,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=220843.33333333334, ans=0.125 2024-06-20 14:13:33,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=220861.66666666666, ans=0.2 2024-06-20 14:13:42,529 INFO [train.py:1028] (1/2) Epoch 12, batch 9200, loss[loss=0.2429, simple_loss=0.3003, pruned_loss=0.09276, over 12949.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3007, pruned_loss=0.1006, over 2572513.26 frames. ], batch size: 36, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:13:46,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=220898.33333333334, ans=0.025 2024-06-20 14:13:46,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.17 vs. limit=10.0 2024-06-20 14:13:58,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=220935.0, ans=0.0 2024-06-20 14:14:00,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=220935.0, ans=0.125 2024-06-20 14:14:04,430 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.916e+02 2.015e+02 2.198e+02 3.345e+02, threshold=4.031e+02, percent-clipped=0.0 2024-06-20 14:14:13,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=220971.66666666666, ans=0.0 2024-06-20 14:14:18,527 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.97 vs. limit=15.0 2024-06-20 14:14:19,614 INFO [train.py:1028] (1/2) Epoch 12, batch 9250, loss[loss=0.2569, simple_loss=0.3029, pruned_loss=0.1055, over 13263.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3002, pruned_loss=0.1002, over 2573544.27 frames. ], batch size: 67, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:14:37,916 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:14:48,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=221045.0, ans=0.125 2024-06-20 14:14:54,407 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.99 vs. limit=15.0 2024-06-20 14:15:00,829 INFO [train.py:1028] (1/2) Epoch 12, batch 9300, loss[loss=0.2315, simple_loss=0.2887, pruned_loss=0.08718, over 12903.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3, pruned_loss=0.09994, over 2571480.66 frames. ], batch size: 39, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:15:05,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=221081.66666666666, ans=12.0 2024-06-20 14:15:10,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=221100.0, ans=0.0 2024-06-20 14:15:15,910 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.71 vs. limit=15.0 2024-06-20 14:15:19,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=221118.33333333334, ans=0.125 2024-06-20 14:15:21,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=221118.33333333334, ans=0.125 2024-06-20 14:15:23,712 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 1.980e+02 2.150e+02 2.290e+02 3.532e+02, threshold=4.301e+02, percent-clipped=0.0 2024-06-20 14:15:33,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=221155.0, ans=0.1 2024-06-20 14:15:34,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=221155.0, ans=0.07 2024-06-20 14:15:38,987 INFO [train.py:1028] (1/2) Epoch 12, batch 9350, loss[loss=0.2549, simple_loss=0.312, pruned_loss=0.09892, over 12602.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3004, pruned_loss=0.1001, over 2567878.83 frames. ], batch size: 22, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:15:41,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=221173.33333333334, ans=0.125 2024-06-20 14:15:43,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=221173.33333333334, ans=0.125 2024-06-20 14:15:44,190 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2024-06-20 14:15:47,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=221191.66666666666, ans=0.025 2024-06-20 14:15:51,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=221191.66666666666, ans=0.2 2024-06-20 14:16:01,009 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.73 vs. limit=15.0 2024-06-20 14:16:15,413 INFO [train.py:1028] (1/2) Epoch 12, batch 9400, loss[loss=0.2549, simple_loss=0.3048, pruned_loss=0.1025, over 13198.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3009, pruned_loss=0.1006, over 2566782.43 frames. ], batch size: 52, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:16:28,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=221283.33333333334, ans=0.95 2024-06-20 14:16:28,499 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.32 vs. limit=15.0 2024-06-20 14:16:29,212 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.75 vs. limit=15.0 2024-06-20 14:16:29,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=221301.66666666666, ans=0.1 2024-06-20 14:16:29,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=221301.66666666666, ans=0.04949747468305833 2024-06-20 14:16:32,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=221301.66666666666, ans=0.025 2024-06-20 14:16:36,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=221301.66666666666, ans=0.1 2024-06-20 14:16:37,857 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.652e+02 1.973e+02 2.162e+02 2.358e+02 3.495e+02, threshold=4.324e+02, percent-clipped=0.0 2024-06-20 14:16:40,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=221320.0, ans=0.125 2024-06-20 14:16:45,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=221320.0, ans=0.025 2024-06-20 14:16:48,397 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.93 vs. limit=15.0 2024-06-20 14:16:48,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=221338.33333333334, ans=0.125 2024-06-20 14:16:55,352 INFO [train.py:1028] (1/2) Epoch 12, batch 9450, loss[loss=0.2545, simple_loss=0.3173, pruned_loss=0.09586, over 12571.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3032, pruned_loss=0.102, over 2566787.77 frames. ], batch size: 22, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:17:02,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=221375.0, ans=0.125 2024-06-20 14:17:19,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=221411.66666666666, ans=0.125 2024-06-20 14:17:31,350 INFO [train.py:1028] (1/2) Epoch 12, batch 9500, loss[loss=0.254, simple_loss=0.3092, pruned_loss=0.0994, over 13249.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3031, pruned_loss=0.1017, over 2576087.48 frames. ], batch size: 43, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:17:38,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=221466.66666666666, ans=0.125 2024-06-20 14:17:46,891 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.19 vs. limit=15.0 2024-06-20 14:17:50,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=221485.0, ans=0.125 2024-06-20 14:17:52,975 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 1.998e+02 2.143e+02 2.298e+02 3.211e+02, threshold=4.286e+02, percent-clipped=0.0 2024-06-20 14:17:55,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=221503.33333333334, ans=0.0 2024-06-20 14:18:04,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=221521.66666666666, ans=0.0 2024-06-20 14:18:09,764 INFO [train.py:1028] (1/2) Epoch 12, batch 9550, loss[loss=0.2391, simple_loss=0.289, pruned_loss=0.09461, over 12890.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3026, pruned_loss=0.1014, over 2571191.87 frames. ], batch size: 39, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:18:20,282 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:18:25,500 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.62 vs. limit=6.0 2024-06-20 14:18:29,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=221576.66666666666, ans=0.1 2024-06-20 14:18:31,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=221595.0, ans=0.0 2024-06-20 14:18:38,672 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.78 vs. limit=22.5 2024-06-20 14:18:40,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=221613.33333333334, ans=0.1 2024-06-20 14:18:46,627 INFO [train.py:1028] (1/2) Epoch 12, batch 9600, loss[loss=0.2571, simple_loss=0.2973, pruned_loss=0.1084, over 10645.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3017, pruned_loss=0.101, over 2571204.32 frames. ], batch size: 303, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:18:54,132 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.82 vs. limit=15.0 2024-06-20 14:19:02,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=221668.33333333334, ans=0.5 2024-06-20 14:19:02,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=221668.33333333334, ans=0.125 2024-06-20 14:19:08,058 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 1.944e+02 2.066e+02 2.300e+02 3.119e+02, threshold=4.132e+02, percent-clipped=0.0 2024-06-20 14:19:10,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=221686.66666666666, ans=0.125 2024-06-20 14:19:13,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=221686.66666666666, ans=0.125 2024-06-20 14:19:15,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=221705.0, ans=0.125 2024-06-20 14:19:16,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=221705.0, ans=0.125 2024-06-20 14:19:23,154 INFO [train.py:1028] (1/2) Epoch 12, batch 9650, loss[loss=0.2388, simple_loss=0.2865, pruned_loss=0.09552, over 13084.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3015, pruned_loss=0.1018, over 2562814.85 frames. ], batch size: 132, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:19:24,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=221723.33333333334, ans=0.1 2024-06-20 14:19:29,263 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2024-06-20 14:19:30,734 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:19:32,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=221741.66666666666, ans=10.0 2024-06-20 14:19:33,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=221741.66666666666, ans=0.1 2024-06-20 14:19:36,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=221741.66666666666, ans=0.125 2024-06-20 14:19:50,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=221778.33333333334, ans=0.125 2024-06-20 14:19:55,310 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=221796.66666666666, ans=0.125 2024-06-20 14:19:56,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=221796.66666666666, ans=0.0 2024-06-20 14:20:01,632 INFO [train.py:1028] (1/2) Epoch 12, batch 9700, loss[loss=0.2606, simple_loss=0.3053, pruned_loss=0.1079, over 12991.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3011, pruned_loss=0.1017, over 2557498.86 frames. ], batch size: 144, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:20:05,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=221815.0, ans=0.125 2024-06-20 14:20:12,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=221833.33333333334, ans=0.125 2024-06-20 14:20:23,481 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 1.968e+02 2.235e+02 2.458e+02 3.114e+02, threshold=4.471e+02, percent-clipped=0.0 2024-06-20 14:20:28,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=221870.0, ans=0.0 2024-06-20 14:20:38,136 INFO [train.py:1028] (1/2) Epoch 12, batch 9750, loss[loss=0.2523, simple_loss=0.2986, pruned_loss=0.1029, over 13109.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.2995, pruned_loss=0.101, over 2552570.87 frames. ], batch size: 132, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:20:52,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=221925.0, ans=0.125 2024-06-20 14:20:59,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=221943.33333333334, ans=0.125 2024-06-20 14:21:04,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=221961.66666666666, ans=0.07 2024-06-20 14:21:14,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=221980.0, ans=0.0 2024-06-20 14:21:16,937 INFO [train.py:1028] (1/2) Epoch 12, batch 9800, loss[loss=0.2376, simple_loss=0.2926, pruned_loss=0.09133, over 13229.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.2982, pruned_loss=0.1001, over 2546968.78 frames. ], batch size: 40, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:21:30,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=222016.66666666666, ans=0.125 2024-06-20 14:21:33,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=222035.0, ans=0.125 2024-06-20 14:21:37,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=222035.0, ans=0.125 2024-06-20 14:21:38,640 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.707e+02 1.929e+02 2.070e+02 2.246e+02 2.993e+02, threshold=4.139e+02, percent-clipped=0.0 2024-06-20 14:21:50,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=222071.66666666666, ans=0.1 2024-06-20 14:21:52,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=222090.0, ans=0.125 2024-06-20 14:21:53,318 INFO [train.py:1028] (1/2) Epoch 12, batch 9850, loss[loss=0.2577, simple_loss=0.306, pruned_loss=0.1047, over 13034.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.2976, pruned_loss=0.09969, over 2538948.24 frames. ], batch size: 102, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:21:54,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=222090.0, ans=0.125 2024-06-20 14:21:54,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=222090.0, ans=0.0 2024-06-20 14:22:01,077 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.43 vs. limit=15.0 2024-06-20 14:22:02,333 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.18 vs. limit=22.5 2024-06-20 14:22:05,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=222108.33333333334, ans=0.125 2024-06-20 14:22:11,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=222126.66666666666, ans=0.025 2024-06-20 14:22:17,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=222145.0, ans=0.0 2024-06-20 14:22:20,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=222145.0, ans=0.2 2024-06-20 14:22:23,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=222163.33333333334, ans=0.0 2024-06-20 14:22:24,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=222163.33333333334, ans=0.025 2024-06-20 14:22:31,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=222181.66666666666, ans=22.5 2024-06-20 14:22:31,555 INFO [train.py:1028] (1/2) Epoch 12, batch 9900, loss[loss=0.2287, simple_loss=0.2829, pruned_loss=0.08722, over 13023.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.2975, pruned_loss=0.09989, over 2531015.42 frames. ], batch size: 39, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:22:31,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=222181.66666666666, ans=0.2 2024-06-20 14:22:36,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=222181.66666666666, ans=0.125 2024-06-20 14:22:50,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=222218.33333333334, ans=0.125 2024-06-20 14:22:52,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=222218.33333333334, ans=0.2 2024-06-20 14:22:53,902 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.744e+02 1.984e+02 2.170e+02 2.409e+02 3.510e+02, threshold=4.341e+02, percent-clipped=0.0 2024-06-20 14:22:57,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=222236.66666666666, ans=0.0 2024-06-20 14:23:03,324 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.77 vs. limit=10.0 2024-06-20 14:23:08,595 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.23 vs. limit=22.5 2024-06-20 14:23:09,004 INFO [train.py:1028] (1/2) Epoch 12, batch 9950, loss[loss=0.2414, simple_loss=0.2858, pruned_loss=0.0985, over 12670.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.2961, pruned_loss=0.09982, over 2522906.23 frames. ], batch size: 29, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:23:13,666 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:23:21,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=222291.66666666666, ans=0.0 2024-06-20 14:23:33,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=222328.33333333334, ans=0.125 2024-06-20 14:23:34,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=222328.33333333334, ans=0.125 2024-06-20 14:23:34,551 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.03 vs. limit=15.0 2024-06-20 14:23:34,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=222328.33333333334, ans=0.125 2024-06-20 14:23:35,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=222328.33333333334, ans=0.125 2024-06-20 14:23:47,421 INFO [train.py:1028] (1/2) Epoch 12, batch 10000, loss[loss=0.2349, simple_loss=0.2829, pruned_loss=0.09342, over 12591.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.2971, pruned_loss=0.1007, over 2486722.78 frames. ], batch size: 22, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:23:54,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=222383.33333333334, ans=0.125 2024-06-20 14:24:00,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=222383.33333333334, ans=0.025 2024-06-20 14:24:09,195 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 1.983e+02 2.142e+02 2.370e+02 3.061e+02, threshold=4.284e+02, percent-clipped=0.0 2024-06-20 14:24:11,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=222420.0, ans=0.0 2024-06-20 14:24:12,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=222420.0, ans=0.125 2024-06-20 14:24:24,912 INFO [train.py:1028] (1/2) Epoch 12, batch 10050, loss[loss=0.2239, simple_loss=0.281, pruned_loss=0.08344, over 12530.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.2977, pruned_loss=0.1017, over 2443767.94 frames. ], batch size: 22, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:24:27,630 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:24:41,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=222493.33333333334, ans=0.2 2024-06-20 14:24:59,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222548.33333333334, ans=0.1 2024-06-20 14:25:00,283 INFO [train.py:1028] (1/2) Epoch 12, batch 10100, loss[loss=0.1997, simple_loss=0.2519, pruned_loss=0.07371, over 11548.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.2967, pruned_loss=0.1007, over 2426591.77 frames. ], batch size: 17, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:27:34,220 INFO [train.py:1028] (1/2) Epoch 13, batch 0, loss[loss=0.2332, simple_loss=0.284, pruned_loss=0.09116, over 12905.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.284, pruned_loss=0.09116, over 12905.00 frames. ], batch size: 36, lr: 4.60e-03, grad_scale: 64.0 2024-06-20 14:27:34,221 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 14:27:40,831 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([0.2855, 0.2601, 0.6437, 0.0660], device='cuda:1') 2024-06-20 14:27:42,253 INFO [train.py:1060] (1/2) Epoch 13, validation: loss=0.1944, simple_loss=0.2592, pruned_loss=0.06477, over 351949.00 frames. 2024-06-20 14:27:42,254 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17479MB 2024-06-20 14:27:43,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=222579.5, ans=0.125 2024-06-20 14:27:45,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=222579.5, ans=0.0 2024-06-20 14:27:46,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=222579.5, ans=0.1 2024-06-20 14:27:52,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=222597.83333333334, ans=0.0 2024-06-20 14:27:53,143 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.932e+02 2.063e+02 2.347e+02 3.300e+02, threshold=4.126e+02, percent-clipped=0.0 2024-06-20 14:27:57,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=222597.83333333334, ans=0.0 2024-06-20 14:28:03,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=222616.16666666666, ans=0.035 2024-06-20 14:28:03,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=222616.16666666666, ans=0.125 2024-06-20 14:28:11,260 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.61 vs. limit=22.5 2024-06-20 14:28:11,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=222634.5, ans=0.2 2024-06-20 14:28:23,042 INFO [train.py:1028] (1/2) Epoch 13, batch 50, loss[loss=0.2486, simple_loss=0.2986, pruned_loss=0.0993, over 12594.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.2821, pruned_loss=0.0953, over 574604.75 frames. ], batch size: 29, lr: 4.60e-03, grad_scale: 64.0 2024-06-20 14:28:32,600 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:28:36,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=222689.5, ans=0.0 2024-06-20 14:28:38,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=222707.83333333334, ans=0.07 2024-06-20 14:28:40,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=222707.83333333334, ans=0.125 2024-06-20 14:29:01,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=222744.5, ans=0.125 2024-06-20 14:29:04,042 INFO [train.py:1028] (1/2) Epoch 13, batch 100, loss[loss=0.2099, simple_loss=0.268, pruned_loss=0.07585, over 13284.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.2785, pruned_loss=0.0933, over 1017539.80 frames. ], batch size: 46, lr: 4.60e-03, grad_scale: 128.0 2024-06-20 14:29:06,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=222762.83333333334, ans=0.0 2024-06-20 14:29:07,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=222762.83333333334, ans=0.125 2024-06-20 14:29:15,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=222781.16666666666, ans=0.125 2024-06-20 14:29:17,419 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 1.875e+02 1.989e+02 2.211e+02 3.304e+02, threshold=3.978e+02, percent-clipped=0.0 2024-06-20 14:29:20,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=222781.16666666666, ans=0.025 2024-06-20 14:29:29,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=222799.5, ans=0.1 2024-06-20 14:29:33,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222817.83333333334, ans=0.1 2024-06-20 14:29:34,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=222817.83333333334, ans=0.0 2024-06-20 14:29:42,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=222836.16666666666, ans=0.2 2024-06-20 14:29:42,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=222836.16666666666, ans=0.125 2024-06-20 14:29:44,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=222836.16666666666, ans=0.07 2024-06-20 14:29:45,815 INFO [train.py:1028] (1/2) Epoch 13, batch 150, loss[loss=0.2349, simple_loss=0.2805, pruned_loss=0.09467, over 12560.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.278, pruned_loss=0.09226, over 1364903.67 frames. ], batch size: 29, lr: 4.60e-03, grad_scale: 128.0 2024-06-20 14:29:46,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=222854.5, ans=0.125 2024-06-20 14:29:49,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=222854.5, ans=0.125 2024-06-20 14:30:09,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=222909.5, ans=0.07 2024-06-20 14:30:13,518 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2024-06-20 14:30:18,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2024-06-20 14:30:21,685 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=9.46 vs. limit=12.0 2024-06-20 14:30:24,607 INFO [train.py:1028] (1/2) Epoch 13, batch 200, loss[loss=0.2456, simple_loss=0.2822, pruned_loss=0.1045, over 12529.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2774, pruned_loss=0.09232, over 1633840.33 frames. ], batch size: 202, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:30:28,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=222946.16666666666, ans=0.0 2024-06-20 14:30:30,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=222946.16666666666, ans=0.025 2024-06-20 14:30:34,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=222964.5, ans=0.05 2024-06-20 14:30:34,747 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.616e+02 1.888e+02 2.138e+02 2.350e+02 3.201e+02, threshold=4.276e+02, percent-clipped=0.0 2024-06-20 14:30:38,366 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.55 vs. limit=15.0 2024-06-20 14:30:38,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=222964.5, ans=0.0 2024-06-20 14:30:41,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=222982.83333333334, ans=0.125 2024-06-20 14:31:03,646 INFO [train.py:1028] (1/2) Epoch 13, batch 250, loss[loss=0.2125, simple_loss=0.2509, pruned_loss=0.08702, over 13030.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2764, pruned_loss=0.0915, over 1846286.68 frames. ], batch size: 144, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:31:06,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=223037.83333333334, ans=0.09899494936611666 2024-06-20 14:31:10,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=223037.83333333334, ans=0.1 2024-06-20 14:31:10,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=223056.16666666666, ans=0.0 2024-06-20 14:31:12,038 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.54 vs. limit=15.0 2024-06-20 14:31:13,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=223056.16666666666, ans=0.125 2024-06-20 14:31:18,388 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.85 vs. limit=15.0 2024-06-20 14:31:24,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=223074.5, ans=0.2 2024-06-20 14:31:31,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=223092.83333333334, ans=0.1 2024-06-20 14:31:33,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=223092.83333333334, ans=0.05 2024-06-20 14:31:33,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=223092.83333333334, ans=0.1 2024-06-20 14:31:41,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=223111.16666666666, ans=0.125 2024-06-20 14:31:42,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=223111.16666666666, ans=0.125 2024-06-20 14:31:49,648 INFO [train.py:1028] (1/2) Epoch 13, batch 300, loss[loss=0.2147, simple_loss=0.2595, pruned_loss=0.08495, over 13145.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2756, pruned_loss=0.09076, over 2009362.13 frames. ], batch size: 112, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:31:54,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=223129.5, ans=0.125 2024-06-20 14:31:54,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=223129.5, ans=0.125 2024-06-20 14:31:59,855 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.804e+02 1.940e+02 2.059e+02 2.624e+02, threshold=3.879e+02, percent-clipped=0.0 2024-06-20 14:32:07,823 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2024-06-20 14:32:21,529 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=5.965e+00 2024-06-20 14:32:21,550 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:32:23,125 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:32:25,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=223202.83333333334, ans=0.025 2024-06-20 14:32:28,414 INFO [train.py:1028] (1/2) Epoch 13, batch 350, loss[loss=0.2074, simple_loss=0.2609, pruned_loss=0.07696, over 12865.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2756, pruned_loss=0.09062, over 2138654.35 frames. ], batch size: 33, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:32:40,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=223239.5, ans=15.0 2024-06-20 14:33:07,887 INFO [train.py:1028] (1/2) Epoch 13, batch 400, loss[loss=0.2294, simple_loss=0.2813, pruned_loss=0.08875, over 13269.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2754, pruned_loss=0.09002, over 2239126.13 frames. ], batch size: 63, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:33:11,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=223312.83333333334, ans=0.025 2024-06-20 14:33:17,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=223331.16666666666, ans=0.125 2024-06-20 14:33:18,168 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.850e+02 2.035e+02 2.216e+02 3.237e+02, threshold=4.070e+02, percent-clipped=0.0 2024-06-20 14:33:19,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=223331.16666666666, ans=10.0 2024-06-20 14:33:33,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=223367.83333333334, ans=0.125 2024-06-20 14:33:38,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=223386.16666666666, ans=0.1 2024-06-20 14:33:48,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=223386.16666666666, ans=0.125 2024-06-20 14:33:48,634 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=12.0 2024-06-20 14:33:48,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=223404.5, ans=0.0 2024-06-20 14:33:49,658 INFO [train.py:1028] (1/2) Epoch 13, batch 450, loss[loss=0.2265, simple_loss=0.2727, pruned_loss=0.09016, over 13275.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2759, pruned_loss=0.09046, over 2313068.51 frames. ], batch size: 67, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:33:55,859 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.32 vs. limit=15.0 2024-06-20 14:34:02,157 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.18 vs. limit=10.0 2024-06-20 14:34:11,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=223422.83333333334, ans=0.125 2024-06-20 14:34:13,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=223441.16666666666, ans=0.1 2024-06-20 14:34:15,758 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.69 vs. limit=6.0 2024-06-20 14:34:20,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=223441.16666666666, ans=0.125 2024-06-20 14:34:42,007 INFO [train.py:1028] (1/2) Epoch 13, batch 500, loss[loss=0.2062, simple_loss=0.2546, pruned_loss=0.07896, over 13080.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2757, pruned_loss=0.08971, over 2375457.68 frames. ], batch size: 121, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:34:49,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=223514.5, ans=0.1 2024-06-20 14:34:53,613 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.783e+02 1.899e+02 2.002e+02 2.418e+02, threshold=3.798e+02, percent-clipped=0.0 2024-06-20 14:35:05,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=223532.83333333334, ans=0.2 2024-06-20 14:35:17,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=223551.16666666666, ans=15.0 2024-06-20 14:35:24,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=223569.5, ans=0.125 2024-06-20 14:35:27,409 INFO [train.py:1028] (1/2) Epoch 13, batch 550, loss[loss=0.2198, simple_loss=0.2552, pruned_loss=0.09224, over 12916.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2757, pruned_loss=0.08983, over 2420246.33 frames. ], batch size: 158, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:35:46,157 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.92 vs. limit=6.0 2024-06-20 14:35:46,711 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.52 vs. limit=15.0 2024-06-20 14:35:51,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=223642.83333333334, ans=0.125 2024-06-20 14:35:54,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=223642.83333333334, ans=0.0 2024-06-20 14:36:03,294 INFO [train.py:1028] (1/2) Epoch 13, batch 600, loss[loss=0.2161, simple_loss=0.2559, pruned_loss=0.08816, over 13064.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2751, pruned_loss=0.08948, over 2458861.10 frames. ], batch size: 144, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:36:07,652 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.83 vs. limit=6.0 2024-06-20 14:36:13,168 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 1.863e+02 1.978e+02 2.174e+02 3.321e+02, threshold=3.955e+02, percent-clipped=0.0 2024-06-20 14:36:13,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=223697.83333333334, ans=0.125 2024-06-20 14:36:13,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=223697.83333333334, ans=0.125 2024-06-20 14:36:14,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=223697.83333333334, ans=0.0 2024-06-20 14:36:19,576 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:36:21,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=223716.16666666666, ans=0.0 2024-06-20 14:36:28,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=223734.5, ans=0.125 2024-06-20 14:36:34,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=223734.5, ans=0.0 2024-06-20 14:36:37,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=223752.83333333334, ans=0.125 2024-06-20 14:36:45,697 INFO [train.py:1028] (1/2) Epoch 13, batch 650, loss[loss=0.2527, simple_loss=0.3027, pruned_loss=0.1013, over 13193.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2759, pruned_loss=0.08966, over 2489988.42 frames. ], batch size: 59, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:37:13,884 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.66 vs. limit=6.0 2024-06-20 14:37:21,453 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.10 vs. limit=10.0 2024-06-20 14:37:27,933 INFO [train.py:1028] (1/2) Epoch 13, batch 700, loss[loss=0.2224, simple_loss=0.2797, pruned_loss=0.08255, over 13272.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2759, pruned_loss=0.09001, over 2512246.36 frames. ], batch size: 46, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:37:36,594 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.63 vs. limit=10.0 2024-06-20 14:37:37,576 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.658e+02 1.888e+02 2.069e+02 2.370e+02 3.607e+02, threshold=4.138e+02, percent-clipped=0.0 2024-06-20 14:37:38,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=223881.16666666666, ans=0.2 2024-06-20 14:37:46,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=223899.5, ans=0.125 2024-06-20 14:37:48,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=223899.5, ans=10.0 2024-06-20 14:38:04,791 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.69 vs. limit=6.0 2024-06-20 14:38:06,742 INFO [train.py:1028] (1/2) Epoch 13, batch 750, loss[loss=0.2331, simple_loss=0.2844, pruned_loss=0.09087, over 13275.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2764, pruned_loss=0.09006, over 2528043.60 frames. ], batch size: 63, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:38:09,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=223954.5, ans=0.125 2024-06-20 14:38:17,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=223972.83333333334, ans=0.125 2024-06-20 14:38:28,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=223991.16666666666, ans=0.0 2024-06-20 14:38:37,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=224027.83333333334, ans=0.0 2024-06-20 14:38:45,485 INFO [train.py:1028] (1/2) Epoch 13, batch 800, loss[loss=0.2103, simple_loss=0.2668, pruned_loss=0.07691, over 13006.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2758, pruned_loss=0.08972, over 2540625.18 frames. ], batch size: 36, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:38:55,542 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.793e+02 1.891e+02 2.058e+02 2.751e+02, threshold=3.782e+02, percent-clipped=0.0 2024-06-20 14:38:55,929 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.54 vs. limit=10.0 2024-06-20 14:39:11,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=224101.16666666666, ans=0.125 2024-06-20 14:39:20,971 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.37 vs. limit=15.0 2024-06-20 14:39:26,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=224119.5, ans=0.1 2024-06-20 14:39:30,581 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.63 vs. limit=15.0 2024-06-20 14:39:31,764 INFO [train.py:1028] (1/2) Epoch 13, batch 850, loss[loss=0.2319, simple_loss=0.2802, pruned_loss=0.09176, over 13087.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2757, pruned_loss=0.08946, over 2551200.85 frames. ], batch size: 95, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:39:37,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=224137.83333333334, ans=0.125 2024-06-20 14:39:38,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=224137.83333333334, ans=0.0 2024-06-20 14:39:41,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=224156.16666666666, ans=0.0 2024-06-20 14:39:42,135 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.83 vs. limit=12.0 2024-06-20 14:39:45,727 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.40 vs. limit=15.0 2024-06-20 14:39:46,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=224174.5, ans=0.125 2024-06-20 14:39:55,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=224192.83333333334, ans=0.125 2024-06-20 14:39:59,016 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.735e+00 2024-06-20 14:40:08,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=224211.16666666666, ans=0.0 2024-06-20 14:40:08,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=224211.16666666666, ans=0.125 2024-06-20 14:40:09,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=224211.16666666666, ans=0.2 2024-06-20 14:40:10,790 INFO [train.py:1028] (1/2) Epoch 13, batch 900, loss[loss=0.2298, simple_loss=0.2804, pruned_loss=0.08963, over 12929.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2756, pruned_loss=0.08962, over 2556133.73 frames. ], batch size: 36, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:40:21,215 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 1.843e+02 1.973e+02 2.142e+02 2.751e+02, threshold=3.946e+02, percent-clipped=0.0 2024-06-20 14:40:29,326 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.95 vs. limit=6.0 2024-06-20 14:40:30,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=224266.16666666666, ans=0.125 2024-06-20 14:40:49,644 INFO [train.py:1028] (1/2) Epoch 13, batch 950, loss[loss=0.2039, simple_loss=0.2584, pruned_loss=0.07474, over 12896.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2754, pruned_loss=0.08942, over 2558662.73 frames. ], batch size: 39, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:40:53,247 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.54 vs. limit=15.0 2024-06-20 14:40:57,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224339.5, ans=0.1 2024-06-20 14:41:00,205 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.22 vs. limit=15.0 2024-06-20 14:41:04,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224357.83333333334, ans=0.1 2024-06-20 14:41:08,028 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.66 vs. limit=6.0 2024-06-20 14:41:10,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=224357.83333333334, ans=0.125 2024-06-20 14:41:25,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=224394.5, ans=0.125 2024-06-20 14:41:31,789 INFO [train.py:1028] (1/2) Epoch 13, batch 1000, loss[loss=0.248, simple_loss=0.2917, pruned_loss=0.1021, over 13301.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2754, pruned_loss=0.08985, over 2560933.25 frames. ], batch size: 49, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:41:45,026 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 1.823e+02 1.926e+02 2.080e+02 3.009e+02, threshold=3.852e+02, percent-clipped=0.0 2024-06-20 14:41:45,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=224431.16666666666, ans=0.125 2024-06-20 14:41:55,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=224449.5, ans=0.1 2024-06-20 14:42:02,190 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.88 vs. limit=15.0 2024-06-20 14:42:07,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=224486.16666666666, ans=0.0 2024-06-20 14:42:14,182 INFO [train.py:1028] (1/2) Epoch 13, batch 1050, loss[loss=0.2053, simple_loss=0.2524, pruned_loss=0.07907, over 13228.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2755, pruned_loss=0.08984, over 2564509.51 frames. ], batch size: 77, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:42:14,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=224504.5, ans=0.1 2024-06-20 14:42:19,726 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.79 vs. limit=22.5 2024-06-20 14:42:27,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=224522.83333333334, ans=0.05 2024-06-20 14:42:35,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=224541.16666666666, ans=0.125 2024-06-20 14:42:36,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=224559.5, ans=0.125 2024-06-20 14:42:36,997 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.95 vs. limit=22.5 2024-06-20 14:42:46,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=224577.83333333334, ans=0.1 2024-06-20 14:42:53,686 INFO [train.py:1028] (1/2) Epoch 13, batch 1100, loss[loss=0.2204, simple_loss=0.2697, pruned_loss=0.0856, over 13266.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2762, pruned_loss=0.08989, over 2569794.94 frames. ], batch size: 52, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:43:02,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=224614.5, ans=0.125 2024-06-20 14:43:03,541 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 1.862e+02 2.001e+02 2.185e+02 2.725e+02, threshold=4.002e+02, percent-clipped=0.0 2024-06-20 14:43:15,110 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:43:19,807 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:43:21,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=224651.16666666666, ans=0.0 2024-06-20 14:43:26,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224669.5, ans=0.1 2024-06-20 14:43:27,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=224669.5, ans=0.125 2024-06-20 14:43:27,805 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.28 vs. limit=15.0 2024-06-20 14:43:30,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=224669.5, ans=0.2 2024-06-20 14:43:32,501 INFO [train.py:1028] (1/2) Epoch 13, batch 1150, loss[loss=0.2264, simple_loss=0.2747, pruned_loss=0.08902, over 13229.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2768, pruned_loss=0.09056, over 2570873.42 frames. ], batch size: 52, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:43:42,463 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.13 vs. limit=15.0 2024-06-20 14:43:44,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=224706.16666666666, ans=0.0 2024-06-20 14:43:49,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=224706.16666666666, ans=0.125 2024-06-20 14:43:55,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=224724.5, ans=0.2 2024-06-20 14:43:57,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=224724.5, ans=0.1 2024-06-20 14:44:06,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=224742.83333333334, ans=0.125 2024-06-20 14:44:07,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=224742.83333333334, ans=0.125 2024-06-20 14:44:14,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=224761.16666666666, ans=0.0 2024-06-20 14:44:16,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=224761.16666666666, ans=10.0 2024-06-20 14:44:18,078 INFO [train.py:1028] (1/2) Epoch 13, batch 1200, loss[loss=0.2219, simple_loss=0.2784, pruned_loss=0.08271, over 13107.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2766, pruned_loss=0.0907, over 2573731.37 frames. ], batch size: 77, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:44:27,770 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.855e+02 2.025e+02 2.211e+02 2.856e+02, threshold=4.050e+02, percent-clipped=0.0 2024-06-20 14:44:32,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=224797.83333333334, ans=0.125 2024-06-20 14:44:34,787 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.88 vs. limit=12.0 2024-06-20 14:44:47,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=224834.5, ans=0.125 2024-06-20 14:44:50,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=224852.83333333334, ans=0.0 2024-06-20 14:44:51,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=224852.83333333334, ans=0.5 2024-06-20 14:44:56,126 INFO [train.py:1028] (1/2) Epoch 13, batch 1250, loss[loss=0.2148, simple_loss=0.2658, pruned_loss=0.08189, over 13208.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2756, pruned_loss=0.08994, over 2583265.49 frames. ], batch size: 112, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:45:04,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=224889.5, ans=0.125 2024-06-20 14:45:12,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=224907.83333333334, ans=0.125 2024-06-20 14:45:20,558 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-20 14:45:24,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=224926.16666666666, ans=0.0 2024-06-20 14:45:28,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=224944.5, ans=0.0 2024-06-20 14:45:35,665 INFO [train.py:1028] (1/2) Epoch 13, batch 1300, loss[loss=0.239, simple_loss=0.2856, pruned_loss=0.09626, over 12717.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2757, pruned_loss=0.08979, over 2582773.15 frames. ], batch size: 176, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:45:41,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=224962.83333333334, ans=0.2 2024-06-20 14:45:45,118 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.819e+02 1.917e+02 2.112e+02 2.858e+02, threshold=3.835e+02, percent-clipped=0.0 2024-06-20 14:45:50,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=224999.5, ans=0.2 2024-06-20 14:46:00,247 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.14 vs. limit=15.0 2024-06-20 14:46:10,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=225036.16666666666, ans=0.125 2024-06-20 14:46:12,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.88 vs. limit=22.5 2024-06-20 14:46:12,960 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.86 vs. limit=22.5 2024-06-20 14:46:17,955 INFO [train.py:1028] (1/2) Epoch 13, batch 1350, loss[loss=0.2323, simple_loss=0.2835, pruned_loss=0.09048, over 13188.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2767, pruned_loss=0.09016, over 2585179.48 frames. ], batch size: 59, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:46:20,980 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.53 vs. limit=22.5 2024-06-20 14:46:22,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=225054.5, ans=0.04949747468305833 2024-06-20 14:46:28,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=225072.83333333334, ans=0.1 2024-06-20 14:46:47,918 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=15.0 2024-06-20 14:46:48,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=225109.5, ans=0.125 2024-06-20 14:47:00,957 INFO [train.py:1028] (1/2) Epoch 13, batch 1400, loss[loss=0.2299, simple_loss=0.2864, pruned_loss=0.08669, over 12481.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2764, pruned_loss=0.0902, over 2586647.73 frames. ], batch size: 25, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:47:06,082 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.77 vs. limit=6.0 2024-06-20 14:47:10,779 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 1.858e+02 1.963e+02 2.070e+02 2.904e+02, threshold=3.927e+02, percent-clipped=0.0 2024-06-20 14:47:13,720 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.66 vs. limit=15.0 2024-06-20 14:47:22,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=225182.83333333334, ans=0.125 2024-06-20 14:47:24,094 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.90 vs. limit=15.0 2024-06-20 14:47:39,401 INFO [train.py:1028] (1/2) Epoch 13, batch 1450, loss[loss=0.2201, simple_loss=0.2647, pruned_loss=0.0877, over 13149.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2763, pruned_loss=0.09025, over 2586858.61 frames. ], batch size: 121, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:47:41,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=225237.83333333334, ans=0.125 2024-06-20 14:47:47,892 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=12.0 2024-06-20 14:48:00,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=225274.5, ans=0.125 2024-06-20 14:48:04,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=225292.83333333334, ans=0.0 2024-06-20 14:48:05,964 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.21 vs. limit=15.0 2024-06-20 14:48:06,699 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2024-06-20 14:48:18,194 INFO [train.py:1028] (1/2) Epoch 13, batch 1500, loss[loss=0.2119, simple_loss=0.259, pruned_loss=0.08241, over 13183.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2758, pruned_loss=0.09018, over 2589588.51 frames. ], batch size: 83, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:48:31,443 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.867e+02 2.008e+02 2.156e+02 2.806e+02, threshold=4.017e+02, percent-clipped=0.0 2024-06-20 14:48:45,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=225384.5, ans=0.125 2024-06-20 14:48:45,856 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.05 vs. limit=15.0 2024-06-20 14:49:03,107 INFO [train.py:1028] (1/2) Epoch 13, batch 1550, loss[loss=0.224, simple_loss=0.2651, pruned_loss=0.09146, over 13071.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2764, pruned_loss=0.09086, over 2584651.90 frames. ], batch size: 102, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:49:03,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=225421.16666666666, ans=0.125 2024-06-20 14:49:08,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=225421.16666666666, ans=0.125 2024-06-20 14:49:17,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=225439.5, ans=0.0 2024-06-20 14:49:25,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=225476.16666666666, ans=0.125 2024-06-20 14:49:32,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=225476.16666666666, ans=0.125 2024-06-20 14:49:42,175 INFO [train.py:1028] (1/2) Epoch 13, batch 1600, loss[loss=0.2137, simple_loss=0.2566, pruned_loss=0.08534, over 13180.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2765, pruned_loss=0.09082, over 2579921.06 frames. ], batch size: 77, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:49:48,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=225512.83333333334, ans=0.2 2024-06-20 14:49:51,138 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.95 vs. limit=12.0 2024-06-20 14:49:52,033 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.658e+02 1.814e+02 1.897e+02 2.102e+02 2.746e+02, threshold=3.794e+02, percent-clipped=0.0 2024-06-20 14:49:55,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=225531.16666666666, ans=0.125 2024-06-20 14:50:03,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.12 vs. limit=15.0 2024-06-20 14:50:06,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=225567.83333333334, ans=0.125 2024-06-20 14:50:12,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=225586.16666666666, ans=0.1 2024-06-20 14:50:20,231 INFO [train.py:1028] (1/2) Epoch 13, batch 1650, loss[loss=0.2164, simple_loss=0.2611, pruned_loss=0.08583, over 13196.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2759, pruned_loss=0.09059, over 2576729.20 frames. ], batch size: 95, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:50:32,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=225622.83333333334, ans=0.2 2024-06-20 14:50:34,799 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=12.0 2024-06-20 14:50:35,783 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.01 vs. limit=10.0 2024-06-20 14:50:45,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=225659.5, ans=0.125 2024-06-20 14:50:47,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=225659.5, ans=0.2 2024-06-20 14:50:49,733 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.99 vs. limit=15.0 2024-06-20 14:51:03,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.81 vs. limit=15.0 2024-06-20 14:51:04,153 INFO [train.py:1028] (1/2) Epoch 13, batch 1700, loss[loss=0.2656, simple_loss=0.3147, pruned_loss=0.1083, over 12911.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2763, pruned_loss=0.09053, over 2581461.18 frames. ], batch size: 26, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:51:14,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=225714.5, ans=0.125 2024-06-20 14:51:17,409 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 1.830e+02 1.938e+02 2.073e+02 2.988e+02, threshold=3.876e+02, percent-clipped=0.0 2024-06-20 14:51:17,956 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.94 vs. limit=6.0 2024-06-20 14:51:28,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=225732.83333333334, ans=0.125 2024-06-20 14:51:29,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=225751.16666666666, ans=0.125 2024-06-20 14:51:30,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=225751.16666666666, ans=0.025 2024-06-20 14:51:36,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=225751.16666666666, ans=0.025 2024-06-20 14:51:36,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=225751.16666666666, ans=0.05 2024-06-20 14:51:45,707 INFO [train.py:1028] (1/2) Epoch 13, batch 1750, loss[loss=0.2124, simple_loss=0.2671, pruned_loss=0.07884, over 12469.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2767, pruned_loss=0.09054, over 2582541.08 frames. ], batch size: 22, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:52:04,727 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.92 vs. limit=10.0 2024-06-20 14:52:06,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=225824.5, ans=0.125 2024-06-20 14:52:11,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.34 vs. limit=15.0 2024-06-20 14:52:25,306 INFO [train.py:1028] (1/2) Epoch 13, batch 1800, loss[loss=0.2229, simple_loss=0.2739, pruned_loss=0.08594, over 13196.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.2776, pruned_loss=0.09128, over 2582602.03 frames. ], batch size: 67, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:52:30,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=225879.5, ans=0.1 2024-06-20 14:52:31,046 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.53 vs. limit=15.0 2024-06-20 14:52:35,049 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.839e+02 1.985e+02 2.144e+02 2.950e+02, threshold=3.970e+02, percent-clipped=0.0 2024-06-20 14:52:35,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=225897.83333333334, ans=0.2 2024-06-20 14:52:39,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=225897.83333333334, ans=0.05 2024-06-20 14:52:49,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=225934.5, ans=0.04949747468305833 2024-06-20 14:53:00,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=225952.83333333334, ans=0.2 2024-06-20 14:53:04,269 INFO [train.py:1028] (1/2) Epoch 13, batch 1850, loss[loss=0.2244, simple_loss=0.2714, pruned_loss=0.08868, over 13247.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.2781, pruned_loss=0.09133, over 2583880.33 frames. ], batch size: 83, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 14:53:21,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=226007.83333333334, ans=0.1 2024-06-20 14:53:24,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=226007.83333333334, ans=0.0 2024-06-20 14:53:24,499 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=15.0 2024-06-20 14:53:33,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=226026.16666666666, ans=0.2 2024-06-20 14:53:43,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=226044.5, ans=0.1 2024-06-20 14:53:49,567 INFO [train.py:1028] (1/2) Epoch 13, batch 1900, loss[loss=0.2434, simple_loss=0.2881, pruned_loss=0.09934, over 13197.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2773, pruned_loss=0.0911, over 2586185.07 frames. ], batch size: 95, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 14:53:54,964 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.02 vs. limit=15.0 2024-06-20 14:53:59,773 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.879e+02 2.021e+02 2.157e+02 2.869e+02, threshold=4.043e+02, percent-clipped=0.0 2024-06-20 14:54:10,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=226099.5, ans=0.0 2024-06-20 14:54:17,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=226117.83333333334, ans=0.1 2024-06-20 14:54:19,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=226117.83333333334, ans=0.0 2024-06-20 14:54:20,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=226136.16666666666, ans=0.125 2024-06-20 14:54:22,312 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.22 vs. limit=15.0 2024-06-20 14:54:28,613 INFO [train.py:1028] (1/2) Epoch 13, batch 1950, loss[loss=0.2298, simple_loss=0.2799, pruned_loss=0.08986, over 13319.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2771, pruned_loss=0.09118, over 2591807.77 frames. ], batch size: 52, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 14:54:30,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=226154.5, ans=0.0 2024-06-20 14:54:35,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=226154.5, ans=0.2 2024-06-20 14:54:36,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=226172.83333333334, ans=0.125 2024-06-20 14:54:46,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=226191.16666666666, ans=0.125 2024-06-20 14:54:58,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=226209.5, ans=0.0 2024-06-20 14:55:04,187 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.46 vs. limit=22.5 2024-06-20 14:55:04,900 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.05 vs. limit=15.0 2024-06-20 14:55:06,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=226227.83333333334, ans=0.0 2024-06-20 14:55:07,600 INFO [train.py:1028] (1/2) Epoch 13, batch 2000, loss[loss=0.225, simple_loss=0.2822, pruned_loss=0.08389, over 12458.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.2767, pruned_loss=0.09105, over 2587850.90 frames. ], batch size: 22, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 14:55:15,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=226264.5, ans=0.0 2024-06-20 14:55:17,570 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.838e+02 1.965e+02 2.115e+02 2.659e+02, threshold=3.930e+02, percent-clipped=0.0 2024-06-20 14:55:32,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=226301.16666666666, ans=0.125 2024-06-20 14:55:36,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=226301.16666666666, ans=0.125 2024-06-20 14:55:53,115 INFO [train.py:1028] (1/2) Epoch 13, batch 2050, loss[loss=0.2052, simple_loss=0.2594, pruned_loss=0.07543, over 12908.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2771, pruned_loss=0.09132, over 2584378.32 frames. ], batch size: 30, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 14:55:53,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=226337.83333333334, ans=0.1 2024-06-20 14:56:04,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=226356.16666666666, ans=0.125 2024-06-20 14:56:13,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=226374.5, ans=0.0 2024-06-20 14:56:13,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=226374.5, ans=0.125 2024-06-20 14:56:14,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=226374.5, ans=0.1 2024-06-20 14:56:28,260 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.93 vs. limit=15.0 2024-06-20 14:56:32,447 INFO [train.py:1028] (1/2) Epoch 13, batch 2100, loss[loss=0.2121, simple_loss=0.2569, pruned_loss=0.08365, over 13236.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.2775, pruned_loss=0.0914, over 2586411.22 frames. ], batch size: 59, lr: 4.56e-03, grad_scale: 256.0 2024-06-20 14:56:33,674 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.78 vs. limit=22.5 2024-06-20 14:56:36,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=226429.5, ans=0.07 2024-06-20 14:56:42,588 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 1.875e+02 2.043e+02 2.266e+02 2.728e+02, threshold=4.087e+02, percent-clipped=0.0 2024-06-20 14:56:49,119 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.65 vs. limit=22.5 2024-06-20 14:56:49,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=226466.16666666666, ans=0.125 2024-06-20 14:56:51,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=226466.16666666666, ans=0.0 2024-06-20 14:57:01,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=226484.5, ans=0.125 2024-06-20 14:57:11,455 INFO [train.py:1028] (1/2) Epoch 13, batch 2150, loss[loss=0.228, simple_loss=0.283, pruned_loss=0.08651, over 13236.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2767, pruned_loss=0.09041, over 2588948.95 frames. ], batch size: 52, lr: 4.56e-03, grad_scale: 256.0 2024-06-20 14:57:19,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=226539.5, ans=15.0 2024-06-20 14:57:25,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=226539.5, ans=0.125 2024-06-20 14:57:34,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=226576.16666666666, ans=0.0 2024-06-20 14:57:34,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=226576.16666666666, ans=0.125 2024-06-20 14:57:38,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=226576.16666666666, ans=0.125 2024-06-20 14:57:46,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=226594.5, ans=0.0 2024-06-20 14:57:51,369 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.64 vs. limit=22.5 2024-06-20 14:57:51,671 INFO [train.py:1028] (1/2) Epoch 13, batch 2200, loss[loss=0.2002, simple_loss=0.2455, pruned_loss=0.07742, over 13246.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.277, pruned_loss=0.0906, over 2589456.34 frames. ], batch size: 83, lr: 4.56e-03, grad_scale: 256.0 2024-06-20 14:58:02,074 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 1.836e+02 1.991e+02 2.160e+02 3.169e+02, threshold=3.983e+02, percent-clipped=0.0 2024-06-20 14:58:05,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=226631.16666666666, ans=0.125 2024-06-20 14:58:06,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=226631.16666666666, ans=0.0 2024-06-20 14:58:08,160 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.00 vs. limit=6.0 2024-06-20 14:58:08,757 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.91 vs. limit=15.0 2024-06-20 14:58:21,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=226667.83333333334, ans=0.0 2024-06-20 14:58:31,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=226686.16666666666, ans=0.0 2024-06-20 14:58:38,470 INFO [train.py:1028] (1/2) Epoch 13, batch 2250, loss[loss=0.2026, simple_loss=0.2545, pruned_loss=0.07532, over 13253.00 frames. ], tot_loss[loss=0.229, simple_loss=0.277, pruned_loss=0.09048, over 2587571.48 frames. ], batch size: 63, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 14:58:48,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=226722.83333333334, ans=0.0 2024-06-20 14:59:02,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2024-06-20 14:59:12,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=226777.83333333334, ans=0.125 2024-06-20 14:59:14,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=226777.83333333334, ans=0.2 2024-06-20 14:59:17,255 INFO [train.py:1028] (1/2) Epoch 13, batch 2300, loss[loss=0.2243, simple_loss=0.2778, pruned_loss=0.08542, over 12965.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2771, pruned_loss=0.09044, over 2581236.25 frames. ], batch size: 33, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 14:59:17,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=226796.16666666666, ans=0.125 2024-06-20 14:59:28,460 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.856e+02 2.002e+02 2.238e+02 2.871e+02, threshold=4.004e+02, percent-clipped=0.0 2024-06-20 14:59:34,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=226832.83333333334, ans=0.025 2024-06-20 14:59:47,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=226851.16666666666, ans=0.125 2024-06-20 14:59:49,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=226869.5, ans=0.125 2024-06-20 14:59:51,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=226869.5, ans=0.125 2024-06-20 14:59:56,524 INFO [train.py:1028] (1/2) Epoch 13, batch 2350, loss[loss=0.212, simple_loss=0.2616, pruned_loss=0.08125, over 13269.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2766, pruned_loss=0.09036, over 2584772.69 frames. ], batch size: 67, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 15:00:13,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=226924.5, ans=0.1 2024-06-20 15:00:15,767 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:00:28,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=226961.16666666666, ans=0.125 2024-06-20 15:00:39,797 INFO [train.py:1028] (1/2) Epoch 13, batch 2400, loss[loss=0.243, simple_loss=0.2906, pruned_loss=0.09773, over 13282.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.2769, pruned_loss=0.09077, over 2587773.91 frames. ], batch size: 46, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:00:41,786 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.54 vs. limit=22.5 2024-06-20 15:00:42,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=226979.5, ans=0.1 2024-06-20 15:00:49,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=226979.5, ans=0.1 2024-06-20 15:00:53,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=226997.83333333334, ans=0.125 2024-06-20 15:00:54,323 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.862e+02 1.969e+02 2.205e+02 2.678e+02, threshold=3.939e+02, percent-clipped=0.0 2024-06-20 15:00:57,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=226997.83333333334, ans=0.125 2024-06-20 15:00:59,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=227016.16666666666, ans=0.1 2024-06-20 15:01:04,222 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=22.5 2024-06-20 15:01:13,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=227034.5, ans=0.1 2024-06-20 15:01:22,144 INFO [train.py:1028] (1/2) Epoch 13, batch 2450, loss[loss=0.2129, simple_loss=0.2571, pruned_loss=0.08432, over 13255.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2756, pruned_loss=0.09047, over 2583364.07 frames. ], batch size: 63, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:01:30,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=227089.5, ans=0.0 2024-06-20 15:01:54,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=227144.5, ans=0.125 2024-06-20 15:02:00,919 INFO [train.py:1028] (1/2) Epoch 13, batch 2500, loss[loss=0.2394, simple_loss=0.2799, pruned_loss=0.09942, over 13190.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2742, pruned_loss=0.08999, over 2586168.46 frames. ], batch size: 83, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:02:12,243 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 1.797e+02 1.907e+02 2.135e+02 3.272e+02, threshold=3.814e+02, percent-clipped=0.0 2024-06-20 15:02:12,668 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.56 vs. limit=15.0 2024-06-20 15:02:14,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=227181.16666666666, ans=0.025 2024-06-20 15:02:18,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=227199.5, ans=0.125 2024-06-20 15:02:21,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=227199.5, ans=0.05 2024-06-20 15:02:22,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=227199.5, ans=0.125 2024-06-20 15:02:24,536 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.55 vs. limit=15.0 2024-06-20 15:02:32,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=227236.16666666666, ans=0.125 2024-06-20 15:02:40,439 INFO [train.py:1028] (1/2) Epoch 13, batch 2550, loss[loss=0.226, simple_loss=0.2765, pruned_loss=0.08775, over 12582.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2725, pruned_loss=0.08918, over 2587015.11 frames. ], batch size: 22, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:02:42,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=227254.5, ans=0.1 2024-06-20 15:02:46,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227254.5, ans=0.1 2024-06-20 15:03:07,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=227291.16666666666, ans=0.125 2024-06-20 15:03:17,521 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2024-06-20 15:03:19,028 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.43 vs. limit=15.0 2024-06-20 15:03:31,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=227346.16666666666, ans=0.04949747468305833 2024-06-20 15:03:31,748 INFO [train.py:1028] (1/2) Epoch 13, batch 2600, loss[loss=0.2245, simple_loss=0.2782, pruned_loss=0.08542, over 13287.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2718, pruned_loss=0.08928, over 2586526.51 frames. ], batch size: 52, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:03:39,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=227364.5, ans=0.05 2024-06-20 15:03:42,730 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.653e+02 1.910e+02 2.041e+02 2.210e+02 2.784e+02, threshold=4.081e+02, percent-clipped=0.0 2024-06-20 15:03:55,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=227401.16666666666, ans=0.5 2024-06-20 15:03:58,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=227401.16666666666, ans=0.025 2024-06-20 15:04:07,661 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.93 vs. limit=15.0 2024-06-20 15:04:11,126 INFO [train.py:1028] (1/2) Epoch 13, batch 2650, loss[loss=0.2163, simple_loss=0.2527, pruned_loss=0.08993, over 13039.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2707, pruned_loss=0.08887, over 2587096.62 frames. ], batch size: 144, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:04:12,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=227437.83333333334, ans=0.125 2024-06-20 15:04:12,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=227437.83333333334, ans=0.95 2024-06-20 15:04:18,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=227456.16666666666, ans=0.125 2024-06-20 15:04:23,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227456.16666666666, ans=0.1 2024-06-20 15:04:27,695 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2024-06-20 15:04:27,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=227474.5, ans=0.1 2024-06-20 15:04:35,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=227492.83333333334, ans=0.015 2024-06-20 15:04:39,824 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.38 vs. limit=22.5 2024-06-20 15:04:42,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=227511.16666666666, ans=0.125 2024-06-20 15:04:45,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=227511.16666666666, ans=0.125 2024-06-20 15:04:46,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=227511.16666666666, ans=0.125 2024-06-20 15:04:50,532 INFO [train.py:1028] (1/2) Epoch 13, batch 2700, loss[loss=0.2369, simple_loss=0.2792, pruned_loss=0.09735, over 13273.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2688, pruned_loss=0.08822, over 2584758.23 frames. ], batch size: 89, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:04:52,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=227529.5, ans=0.2 2024-06-20 15:04:58,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=227547.83333333334, ans=0.2 2024-06-20 15:04:59,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=227547.83333333334, ans=0.125 2024-06-20 15:05:01,433 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.895e+02 2.106e+02 2.323e+02 3.880e+02, threshold=4.212e+02, percent-clipped=0.0 2024-06-20 15:05:06,076 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.08 vs. limit=15.0 2024-06-20 15:05:19,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=227566.16666666666, ans=0.09899494936611666 2024-06-20 15:05:34,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=227602.83333333334, ans=0.0 2024-06-20 15:05:36,780 INFO [train.py:1028] (1/2) Epoch 13, batch 2750, loss[loss=0.1987, simple_loss=0.244, pruned_loss=0.07667, over 13268.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2674, pruned_loss=0.08728, over 2582395.52 frames. ], batch size: 43, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:05:37,349 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.56 vs. limit=15.0 2024-06-20 15:05:44,550 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.23 vs. limit=10.0 2024-06-20 15:05:53,201 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.15 vs. limit=10.0 2024-06-20 15:06:02,091 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2024-06-20 15:06:02,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=227676.16666666666, ans=0.2 2024-06-20 15:06:04,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=227676.16666666666, ans=0.05 2024-06-20 15:06:16,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=227712.83333333334, ans=0.125 2024-06-20 15:06:17,344 INFO [train.py:1028] (1/2) Epoch 13, batch 2800, loss[loss=0.2369, simple_loss=0.2725, pruned_loss=0.1007, over 10961.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2673, pruned_loss=0.08753, over 2580238.02 frames. ], batch size: 304, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:06:26,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=227731.16666666666, ans=0.125 2024-06-20 15:06:28,588 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.818e+02 1.926e+02 2.112e+02 2.648e+02, threshold=3.852e+02, percent-clipped=0.0 2024-06-20 15:06:44,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=227767.83333333334, ans=0.07 2024-06-20 15:06:50,077 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.76 vs. limit=22.5 2024-06-20 15:06:50,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=227786.16666666666, ans=15.0 2024-06-20 15:06:54,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=227786.16666666666, ans=0.2 2024-06-20 15:06:55,678 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.75 vs. limit=15.0 2024-06-20 15:06:56,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=227804.5, ans=0.125 2024-06-20 15:06:56,955 INFO [train.py:1028] (1/2) Epoch 13, batch 2850, loss[loss=0.2095, simple_loss=0.259, pruned_loss=0.08004, over 13348.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2669, pruned_loss=0.08756, over 2578339.69 frames. ], batch size: 49, lr: 4.55e-03, grad_scale: 64.0 2024-06-20 15:06:58,008 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.69 vs. limit=6.0 2024-06-20 15:06:59,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=227804.5, ans=0.1 2024-06-20 15:07:07,861 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.41 vs. limit=15.0 2024-06-20 15:07:10,256 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2024-06-20 15:07:38,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=227877.83333333334, ans=0.1 2024-06-20 15:07:42,228 INFO [train.py:1028] (1/2) Epoch 13, batch 2900, loss[loss=0.2055, simple_loss=0.2583, pruned_loss=0.07632, over 13156.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.265, pruned_loss=0.08677, over 2586721.53 frames. ], batch size: 55, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:07:44,272 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.50 vs. limit=15.0 2024-06-20 15:07:45,900 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.52 vs. limit=22.5 2024-06-20 15:07:47,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=227896.16666666666, ans=0.2 2024-06-20 15:07:50,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=227914.5, ans=0.125 2024-06-20 15:07:54,192 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.790e+02 1.893e+02 2.052e+02 3.270e+02, threshold=3.787e+02, percent-clipped=0.0 2024-06-20 15:07:57,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=227932.83333333334, ans=0.025 2024-06-20 15:08:03,895 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=15.0 2024-06-20 15:08:04,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=227932.83333333334, ans=0.125 2024-06-20 15:08:15,913 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.72 vs. limit=15.0 2024-06-20 15:08:21,292 INFO [train.py:1028] (1/2) Epoch 13, batch 2950, loss[loss=0.1889, simple_loss=0.2465, pruned_loss=0.06567, over 13204.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2645, pruned_loss=0.08628, over 2579923.52 frames. ], batch size: 43, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:08:32,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=228006.16666666666, ans=0.015 2024-06-20 15:08:39,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=228024.5, ans=0.125 2024-06-20 15:08:46,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=228042.83333333334, ans=0.0 2024-06-20 15:08:50,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=228042.83333333334, ans=0.1 2024-06-20 15:09:02,002 INFO [train.py:1028] (1/2) Epoch 13, batch 3000, loss[loss=0.244, simple_loss=0.2878, pruned_loss=0.1001, over 13214.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2637, pruned_loss=0.08603, over 2578831.23 frames. ], batch size: 59, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:09:02,003 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 15:09:11,011 INFO [train.py:1060] (1/2) Epoch 13, validation: loss=0.192, simple_loss=0.2563, pruned_loss=0.06384, over 351949.00 frames. 2024-06-20 15:09:11,011 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17479MB 2024-06-20 15:09:12,210 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.16 vs. limit=22.5 2024-06-20 15:09:20,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=228097.83333333334, ans=0.2 2024-06-20 15:09:22,990 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 1.809e+02 1.925e+02 2.100e+02 4.409e+02, threshold=3.851e+02, percent-clipped=1.0 2024-06-20 15:09:29,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=228116.16666666666, ans=0.05 2024-06-20 15:09:32,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=228116.16666666666, ans=0.125 2024-06-20 15:09:38,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=228134.5, ans=0.0 2024-06-20 15:09:47,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=228152.83333333334, ans=0.125 2024-06-20 15:09:50,941 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.14 vs. limit=22.5 2024-06-20 15:09:54,241 INFO [train.py:1028] (1/2) Epoch 13, batch 3050, loss[loss=0.2077, simple_loss=0.2565, pruned_loss=0.07942, over 13290.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2631, pruned_loss=0.08606, over 2577883.09 frames. ], batch size: 46, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:10:09,143 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.93 vs. limit=15.0 2024-06-20 15:10:17,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=228226.16666666666, ans=0.125 2024-06-20 15:10:19,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=228226.16666666666, ans=0.0 2024-06-20 15:10:24,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=228226.16666666666, ans=0.125 2024-06-20 15:10:26,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=228244.5, ans=0.0 2024-06-20 15:10:34,025 INFO [train.py:1028] (1/2) Epoch 13, batch 3100, loss[loss=0.2178, simple_loss=0.2547, pruned_loss=0.0904, over 13037.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2624, pruned_loss=0.08575, over 2579166.49 frames. ], batch size: 144, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:10:36,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=228262.83333333334, ans=0.015 2024-06-20 15:10:38,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=228262.83333333334, ans=0.0 2024-06-20 15:10:39,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=228262.83333333334, ans=0.125 2024-06-20 15:10:40,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=228262.83333333334, ans=0.2 2024-06-20 15:10:44,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=228281.16666666666, ans=0.09899494936611666 2024-06-20 15:10:46,086 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.836e+02 1.964e+02 2.137e+02 3.172e+02, threshold=3.928e+02, percent-clipped=0.0 2024-06-20 15:10:52,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=228299.5, ans=0.125 2024-06-20 15:10:59,487 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:11:13,172 INFO [train.py:1028] (1/2) Epoch 13, batch 3150, loss[loss=0.2113, simple_loss=0.2491, pruned_loss=0.0867, over 12923.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2606, pruned_loss=0.08499, over 2580530.86 frames. ], batch size: 158, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:11:14,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=228354.5, ans=0.07 2024-06-20 15:11:15,245 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.69 vs. limit=15.0 2024-06-20 15:11:21,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=228372.83333333334, ans=0.0 2024-06-20 15:11:30,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=228391.16666666666, ans=0.125 2024-06-20 15:11:38,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=228409.5, ans=0.0 2024-06-20 15:11:40,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=228409.5, ans=0.1 2024-06-20 15:11:41,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=228409.5, ans=0.125 2024-06-20 15:11:51,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=228427.83333333334, ans=0.125 2024-06-20 15:11:53,862 INFO [train.py:1028] (1/2) Epoch 13, batch 3200, loss[loss=0.2155, simple_loss=0.2667, pruned_loss=0.08213, over 13203.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2606, pruned_loss=0.08497, over 2581385.61 frames. ], batch size: 55, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:12:02,692 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.51 vs. limit=6.0 2024-06-20 15:12:05,394 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.803e+02 1.967e+02 2.199e+02 2.722e+02, threshold=3.934e+02, percent-clipped=0.0 2024-06-20 15:12:23,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=228501.16666666666, ans=0.07 2024-06-20 15:12:24,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=228501.16666666666, ans=0.125 2024-06-20 15:12:39,088 INFO [train.py:1028] (1/2) Epoch 13, batch 3250, loss[loss=0.2031, simple_loss=0.2521, pruned_loss=0.07706, over 13292.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2602, pruned_loss=0.08478, over 2586347.85 frames. ], batch size: 72, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:12:50,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=228556.16666666666, ans=0.125 2024-06-20 15:12:54,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=228556.16666666666, ans=0.1 2024-06-20 15:12:55,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=228574.5, ans=0.125 2024-06-20 15:12:56,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=228574.5, ans=0.2 2024-06-20 15:13:19,479 INFO [train.py:1028] (1/2) Epoch 13, batch 3300, loss[loss=0.2357, simple_loss=0.2738, pruned_loss=0.09878, over 12711.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2605, pruned_loss=0.08516, over 2583657.55 frames. ], batch size: 176, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:13:22,404 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.93 vs. limit=15.0 2024-06-20 15:13:30,974 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.829e+02 2.005e+02 2.309e+02 2.933e+02, threshold=4.010e+02, percent-clipped=0.0 2024-06-20 15:13:45,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=228684.5, ans=0.0 2024-06-20 15:13:58,085 INFO [train.py:1028] (1/2) Epoch 13, batch 3350, loss[loss=0.202, simple_loss=0.2429, pruned_loss=0.08055, over 12907.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2597, pruned_loss=0.08511, over 2579308.26 frames. ], batch size: 158, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:14:07,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=228739.5, ans=0.125 2024-06-20 15:14:29,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=228776.16666666666, ans=0.125 2024-06-20 15:14:36,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=228794.5, ans=0.2 2024-06-20 15:14:39,482 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=17.93 vs. limit=15.0 2024-06-20 15:14:39,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=228794.5, ans=0.04949747468305833 2024-06-20 15:14:42,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=228794.5, ans=0.1 2024-06-20 15:14:44,321 INFO [train.py:1028] (1/2) Epoch 13, batch 3400, loss[loss=0.2299, simple_loss=0.2747, pruned_loss=0.09257, over 12454.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2593, pruned_loss=0.08516, over 2578120.82 frames. ], batch size: 22, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:14:52,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=228831.16666666666, ans=0.2 2024-06-20 15:14:53,334 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.93 vs. limit=22.5 2024-06-20 15:14:55,781 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.787e+02 1.935e+02 2.073e+02 2.749e+02, threshold=3.870e+02, percent-clipped=0.0 2024-06-20 15:14:57,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=228831.16666666666, ans=0.0 2024-06-20 15:15:10,324 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:15:11,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=228867.83333333334, ans=0.09899494936611666 2024-06-20 15:15:23,678 INFO [train.py:1028] (1/2) Epoch 13, batch 3450, loss[loss=0.2267, simple_loss=0.2691, pruned_loss=0.09217, over 12742.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2587, pruned_loss=0.08494, over 2577685.34 frames. ], batch size: 176, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:15:31,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=228922.83333333334, ans=0.125 2024-06-20 15:15:34,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=228922.83333333334, ans=0.125 2024-06-20 15:15:36,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=228922.83333333334, ans=0.2 2024-06-20 15:15:47,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=228959.5, ans=0.0 2024-06-20 15:15:54,464 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.97 vs. limit=15.0 2024-06-20 15:15:54,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=228977.83333333334, ans=0.035 2024-06-20 15:15:55,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=228977.83333333334, ans=0.0 2024-06-20 15:15:58,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=228977.83333333334, ans=0.0 2024-06-20 15:15:58,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=228977.83333333334, ans=0.0 2024-06-20 15:15:59,492 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.003e+00 2024-06-20 15:16:01,451 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.55 vs. limit=15.0 2024-06-20 15:16:01,699 INFO [train.py:1028] (1/2) Epoch 13, batch 3500, loss[loss=0.2112, simple_loss=0.2522, pruned_loss=0.08513, over 12934.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2585, pruned_loss=0.0846, over 2576579.36 frames. ], batch size: 33, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:16:13,394 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.881e+02 2.115e+02 2.331e+02 4.090e+02, threshold=4.229e+02, percent-clipped=1.0 2024-06-20 15:16:16,384 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.18 vs. limit=15.0 2024-06-20 15:16:18,809 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.36 vs. limit=10.0 2024-06-20 15:16:19,519 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.69 vs. limit=22.5 2024-06-20 15:16:24,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=229051.16666666666, ans=0.125 2024-06-20 15:16:36,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=229069.5, ans=0.2 2024-06-20 15:16:38,542 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.85 vs. limit=12.0 2024-06-20 15:16:44,244 INFO [train.py:1028] (1/2) Epoch 13, batch 3550, loss[loss=0.213, simple_loss=0.2509, pruned_loss=0.08752, over 13158.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2573, pruned_loss=0.08394, over 2576805.36 frames. ], batch size: 95, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:16:53,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=229087.83333333334, ans=0.2 2024-06-20 15:17:05,406 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.45 vs. limit=10.0 2024-06-20 15:17:07,505 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=12.0 2024-06-20 15:17:11,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=229142.83333333334, ans=0.0 2024-06-20 15:17:26,559 INFO [train.py:1028] (1/2) Epoch 13, batch 3600, loss[loss=0.2045, simple_loss=0.2516, pruned_loss=0.07872, over 13272.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2566, pruned_loss=0.08373, over 2580205.48 frames. ], batch size: 49, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:17:28,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=229179.5, ans=0.0 2024-06-20 15:17:38,327 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.744e+02 1.894e+02 2.214e+02 2.668e+02, threshold=3.788e+02, percent-clipped=0.0 2024-06-20 15:17:43,639 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.83 vs. limit=15.0 2024-06-20 15:17:54,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=229234.5, ans=0.125 2024-06-20 15:17:57,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=229234.5, ans=0.0 2024-06-20 15:18:01,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=229252.83333333334, ans=0.125 2024-06-20 15:18:04,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=229252.83333333334, ans=0.0 2024-06-20 15:18:06,369 INFO [train.py:1028] (1/2) Epoch 13, batch 3650, loss[loss=0.2141, simple_loss=0.2576, pruned_loss=0.08534, over 13066.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2564, pruned_loss=0.08307, over 2578023.09 frames. ], batch size: 102, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:18:07,459 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2024-06-20 15:18:10,616 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.81 vs. limit=6.0 2024-06-20 15:18:20,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=229289.5, ans=0.125 2024-06-20 15:18:25,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=229307.83333333334, ans=0.125 2024-06-20 15:18:45,540 INFO [train.py:1028] (1/2) Epoch 13, batch 3700, loss[loss=0.2322, simple_loss=0.2807, pruned_loss=0.09181, over 13228.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2563, pruned_loss=0.08317, over 2582726.90 frames. ], batch size: 72, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:18:52,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=229362.83333333334, ans=0.2 2024-06-20 15:18:53,054 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.16 vs. limit=15.0 2024-06-20 15:18:55,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=229381.16666666666, ans=0.0 2024-06-20 15:18:56,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=229381.16666666666, ans=0.125 2024-06-20 15:18:56,187 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=229381.16666666666, ans=0.2 2024-06-20 15:18:57,332 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.760e+02 1.879e+02 2.033e+02 2.967e+02, threshold=3.757e+02, percent-clipped=0.0 2024-06-20 15:19:14,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=229417.83333333334, ans=0.2 2024-06-20 15:19:20,348 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.31 vs. limit=15.0 2024-06-20 15:19:31,745 INFO [train.py:1028] (1/2) Epoch 13, batch 3750, loss[loss=0.2192, simple_loss=0.2752, pruned_loss=0.08162, over 12864.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2557, pruned_loss=0.08265, over 2585570.01 frames. ], batch size: 22, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:19:45,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=229472.83333333334, ans=0.125 2024-06-20 15:19:46,231 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.87 vs. limit=15.0 2024-06-20 15:19:50,645 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:19:54,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=229509.5, ans=0.05 2024-06-20 15:19:55,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=229509.5, ans=0.125 2024-06-20 15:20:09,784 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.29 vs. limit=15.0 2024-06-20 15:20:10,839 INFO [train.py:1028] (1/2) Epoch 13, batch 3800, loss[loss=0.2084, simple_loss=0.2534, pruned_loss=0.08175, over 13216.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2554, pruned_loss=0.08267, over 2584582.84 frames. ], batch size: 83, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:20:11,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=229546.16666666666, ans=0.0 2024-06-20 15:20:18,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229564.5, ans=0.1 2024-06-20 15:20:21,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=229564.5, ans=0.125 2024-06-20 15:20:22,486 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.766e+02 1.889e+02 2.083e+02 2.735e+02, threshold=3.778e+02, percent-clipped=0.0 2024-06-20 15:20:22,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=229564.5, ans=0.125 2024-06-20 15:20:23,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=229564.5, ans=0.0 2024-06-20 15:20:29,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229582.83333333334, ans=0.1 2024-06-20 15:20:31,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=229582.83333333334, ans=0.025 2024-06-20 15:20:33,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=229601.16666666666, ans=0.125 2024-06-20 15:20:38,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=229601.16666666666, ans=0.07 2024-06-20 15:20:45,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=229619.5, ans=0.1 2024-06-20 15:20:50,348 INFO [train.py:1028] (1/2) Epoch 13, batch 3850, loss[loss=0.2126, simple_loss=0.2499, pruned_loss=0.08766, over 12990.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2545, pruned_loss=0.08207, over 2585185.21 frames. ], batch size: 144, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:21:08,257 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=3.632e+00 2024-06-20 15:21:08,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=229674.5, ans=0.025 2024-06-20 15:21:19,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=229692.83333333334, ans=0.0 2024-06-20 15:21:24,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=229711.16666666666, ans=0.125 2024-06-20 15:21:27,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=229729.5, ans=0.125 2024-06-20 15:21:28,030 INFO [train.py:1028] (1/2) Epoch 13, batch 3900, loss[loss=0.1981, simple_loss=0.2403, pruned_loss=0.07802, over 13220.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2548, pruned_loss=0.08246, over 2588032.39 frames. ], batch size: 83, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:21:38,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=229747.83333333334, ans=0.1 2024-06-20 15:21:43,211 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.782e+02 1.884e+02 1.984e+02 2.506e+02, threshold=3.768e+02, percent-clipped=0.0 2024-06-20 15:21:45,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=229747.83333333334, ans=0.015 2024-06-20 15:21:50,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=229766.16666666666, ans=0.2 2024-06-20 15:22:00,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=229784.5, ans=0.125 2024-06-20 15:22:00,792 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:22:01,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=229784.5, ans=0.0 2024-06-20 15:22:14,812 INFO [train.py:1028] (1/2) Epoch 13, batch 3950, loss[loss=0.2041, simple_loss=0.2445, pruned_loss=0.08186, over 13101.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2543, pruned_loss=0.08204, over 2588410.59 frames. ], batch size: 132, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:22:16,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=229821.16666666666, ans=0.0 2024-06-20 15:22:32,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=229857.83333333334, ans=0.0 2024-06-20 15:22:35,577 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.35 vs. limit=15.0 2024-06-20 15:22:36,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=229857.83333333334, ans=0.125 2024-06-20 15:22:50,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=229894.5, ans=0.125 2024-06-20 15:22:53,051 INFO [train.py:1028] (1/2) Epoch 13, batch 4000, loss[loss=0.222, simple_loss=0.2681, pruned_loss=0.08797, over 12858.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2541, pruned_loss=0.08211, over 2582988.97 frames. ], batch size: 39, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:22:55,229 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2024-06-20 15:23:04,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=229931.16666666666, ans=0.125 2024-06-20 15:23:04,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=229931.16666666666, ans=0.125 2024-06-20 15:23:05,000 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.798e+02 1.982e+02 2.122e+02 2.925e+02, threshold=3.963e+02, percent-clipped=0.0 2024-06-20 15:23:08,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=229949.5, ans=0.0 2024-06-20 15:23:28,198 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.91 vs. limit=15.0 2024-06-20 15:23:29,381 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.032e+01 2024-06-20 15:23:33,022 INFO [train.py:1028] (1/2) Epoch 13, batch 4050, loss[loss=0.2384, simple_loss=0.27, pruned_loss=0.1034, over 11145.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2539, pruned_loss=0.08218, over 2580542.27 frames. ], batch size: 303, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:23:33,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=230004.5, ans=0.025 2024-06-20 15:23:46,874 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:24:18,919 INFO [train.py:1028] (1/2) Epoch 13, batch 4100, loss[loss=0.2247, simple_loss=0.2634, pruned_loss=0.09305, over 13055.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2538, pruned_loss=0.08249, over 2577303.69 frames. ], batch size: 102, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:24:19,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=230096.16666666666, ans=0.2 2024-06-20 15:24:30,316 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.793e+02 1.970e+02 2.196e+02 2.662e+02, threshold=3.941e+02, percent-clipped=0.0 2024-06-20 15:24:47,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=230151.16666666666, ans=0.125 2024-06-20 15:24:58,300 INFO [train.py:1028] (1/2) Epoch 13, batch 4150, loss[loss=0.21, simple_loss=0.265, pruned_loss=0.07747, over 13178.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.253, pruned_loss=0.08209, over 2574387.87 frames. ], batch size: 55, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:25:14,898 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:25:17,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=230224.5, ans=0.2 2024-06-20 15:25:22,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=230242.83333333334, ans=0.0 2024-06-20 15:25:25,272 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.42 vs. limit=22.5 2024-06-20 15:25:28,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=230242.83333333334, ans=0.125 2024-06-20 15:25:38,308 INFO [train.py:1028] (1/2) Epoch 13, batch 4200, loss[loss=0.1964, simple_loss=0.2412, pruned_loss=0.07584, over 13004.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2526, pruned_loss=0.08186, over 2577246.38 frames. ], batch size: 102, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:25:48,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=230297.83333333334, ans=0.125 2024-06-20 15:25:49,957 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.769e+02 1.944e+02 2.138e+02 2.693e+02, threshold=3.888e+02, percent-clipped=0.0 2024-06-20 15:26:00,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=230334.5, ans=0.125 2024-06-20 15:26:07,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=230334.5, ans=0.125 2024-06-20 15:26:16,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=230352.83333333334, ans=0.125 2024-06-20 15:26:17,565 INFO [train.py:1028] (1/2) Epoch 13, batch 4250, loss[loss=0.2016, simple_loss=0.2464, pruned_loss=0.07839, over 13265.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2527, pruned_loss=0.08189, over 2580073.03 frames. ], batch size: 46, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:26:18,763 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.38 vs. limit=22.5 2024-06-20 15:26:21,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=230371.16666666666, ans=0.025 2024-06-20 15:26:23,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=230371.16666666666, ans=0.2 2024-06-20 15:26:28,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=230389.5, ans=10.0 2024-06-20 15:26:42,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=230407.83333333334, ans=0.1 2024-06-20 15:26:52,978 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.00 vs. limit=22.5 2024-06-20 15:26:55,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=230444.5, ans=0.0 2024-06-20 15:26:57,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=230444.5, ans=0.125 2024-06-20 15:27:00,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=230444.5, ans=0.025 2024-06-20 15:27:04,027 INFO [train.py:1028] (1/2) Epoch 13, batch 4300, loss[loss=0.2026, simple_loss=0.252, pruned_loss=0.07661, over 13181.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2519, pruned_loss=0.08157, over 2580085.55 frames. ], batch size: 59, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:27:10,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=230462.83333333334, ans=0.025 2024-06-20 15:27:15,782 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.746e+02 1.862e+02 2.031e+02 2.664e+02, threshold=3.723e+02, percent-clipped=0.0 2024-06-20 15:27:22,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=230499.5, ans=0.1 2024-06-20 15:27:22,661 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.11 vs. limit=15.0 2024-06-20 15:27:43,379 INFO [train.py:1028] (1/2) Epoch 13, batch 4350, loss[loss=0.2192, simple_loss=0.2667, pruned_loss=0.08583, over 13163.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2518, pruned_loss=0.08146, over 2585139.19 frames. ], batch size: 59, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:27:50,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=230554.5, ans=0.0 2024-06-20 15:27:51,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=230572.83333333334, ans=0.0 2024-06-20 15:27:53,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=230572.83333333334, ans=0.05 2024-06-20 15:27:58,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=230591.16666666666, ans=0.1 2024-06-20 15:28:22,928 INFO [train.py:1028] (1/2) Epoch 13, batch 4400, loss[loss=0.2139, simple_loss=0.2517, pruned_loss=0.08805, over 13230.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2512, pruned_loss=0.08143, over 2585155.58 frames. ], batch size: 83, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:28:24,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=230646.16666666666, ans=0.125 2024-06-20 15:28:29,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=230664.5, ans=0.125 2024-06-20 15:28:34,536 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.744e+02 1.908e+02 2.040e+02 2.905e+02, threshold=3.816e+02, percent-clipped=0.0 2024-06-20 15:28:56,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=230719.5, ans=0.125 2024-06-20 15:29:08,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=230719.5, ans=0.125 2024-06-20 15:29:09,684 INFO [train.py:1028] (1/2) Epoch 13, batch 4450, loss[loss=0.2077, simple_loss=0.2508, pruned_loss=0.08231, over 12984.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2512, pruned_loss=0.08146, over 2580889.64 frames. ], batch size: 33, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:29:10,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=230737.83333333334, ans=0.1 2024-06-20 15:29:16,182 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.07 vs. limit=10.0 2024-06-20 15:29:17,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=230756.16666666666, ans=0.0 2024-06-20 15:29:22,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=230756.16666666666, ans=0.125 2024-06-20 15:29:22,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=230756.16666666666, ans=0.125 2024-06-20 15:29:23,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=230756.16666666666, ans=0.125 2024-06-20 15:29:26,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=230774.5, ans=0.0 2024-06-20 15:29:32,427 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.02 vs. limit=15.0 2024-06-20 15:29:34,703 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.98 vs. limit=15.0 2024-06-20 15:29:45,893 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=230811.16666666666, ans=0.125 2024-06-20 15:29:47,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=230829.5, ans=0.1 2024-06-20 15:29:47,968 INFO [train.py:1028] (1/2) Epoch 13, batch 4500, loss[loss=0.1834, simple_loss=0.2266, pruned_loss=0.07015, over 13241.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2505, pruned_loss=0.08109, over 2585290.92 frames. ], batch size: 89, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:29:56,247 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=230847.83333333334, ans=0.125 2024-06-20 15:29:59,716 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.597e+02 1.783e+02 1.901e+02 2.120e+02 2.684e+02, threshold=3.802e+02, percent-clipped=0.0 2024-06-20 15:30:02,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=230866.16666666666, ans=0.2 2024-06-20 15:30:11,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=230884.5, ans=0.2 2024-06-20 15:30:12,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=230884.5, ans=0.0 2024-06-20 15:30:21,559 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.64 vs. limit=22.5 2024-06-20 15:30:27,831 INFO [train.py:1028] (1/2) Epoch 13, batch 4550, loss[loss=0.2028, simple_loss=0.2502, pruned_loss=0.07776, over 13274.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2512, pruned_loss=0.08144, over 2588853.05 frames. ], batch size: 52, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:30:30,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=230921.16666666666, ans=0.125 2024-06-20 15:31:12,596 INFO [train.py:1028] (1/2) Epoch 13, batch 4600, loss[loss=0.2332, simple_loss=0.2721, pruned_loss=0.09721, over 12620.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2512, pruned_loss=0.08099, over 2585446.91 frames. ], batch size: 202, lr: 4.51e-03, grad_scale: 64.0 2024-06-20 15:31:12,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=231012.83333333334, ans=0.125 2024-06-20 15:31:19,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=231012.83333333334, ans=0.125 2024-06-20 15:31:24,999 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.747e+02 1.893e+02 2.118e+02 3.705e+02, threshold=3.786e+02, percent-clipped=0.0 2024-06-20 15:31:33,874 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:31:38,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=231049.5, ans=0.125 2024-06-20 15:31:46,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=231067.83333333334, ans=0.125 2024-06-20 15:31:50,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=231086.16666666666, ans=0.0 2024-06-20 15:31:54,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=231086.16666666666, ans=0.125 2024-06-20 15:31:57,064 INFO [train.py:1028] (1/2) Epoch 13, batch 4650, loss[loss=0.2067, simple_loss=0.2478, pruned_loss=0.08281, over 13105.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2506, pruned_loss=0.0807, over 2588166.91 frames. ], batch size: 132, lr: 4.51e-03, grad_scale: 64.0 2024-06-20 15:32:01,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231104.5, ans=0.1 2024-06-20 15:32:06,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=231122.83333333334, ans=0.0 2024-06-20 15:32:10,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=231122.83333333334, ans=0.1 2024-06-20 15:32:13,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=231141.16666666666, ans=0.0 2024-06-20 15:32:19,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=231141.16666666666, ans=0.1 2024-06-20 15:32:26,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=231159.5, ans=0.125 2024-06-20 15:32:29,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=231177.83333333334, ans=0.0 2024-06-20 15:32:37,449 INFO [train.py:1028] (1/2) Epoch 13, batch 4700, loss[loss=0.2008, simple_loss=0.2452, pruned_loss=0.07814, over 12765.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2505, pruned_loss=0.08077, over 2583452.11 frames. ], batch size: 26, lr: 4.51e-03, grad_scale: 64.0 2024-06-20 15:32:49,574 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 1.826e+02 1.960e+02 2.292e+02 3.430e+02, threshold=3.921e+02, percent-clipped=0.0 2024-06-20 15:33:04,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=231251.16666666666, ans=0.0 2024-06-20 15:33:15,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=231269.5, ans=0.2 2024-06-20 15:33:17,366 INFO [train.py:1028] (1/2) Epoch 13, batch 4750, loss[loss=0.219, simple_loss=0.2688, pruned_loss=0.0846, over 12534.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2507, pruned_loss=0.08123, over 2580741.31 frames. ], batch size: 202, lr: 4.51e-03, grad_scale: 64.0 2024-06-20 15:33:22,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=231287.83333333334, ans=0.0 2024-06-20 15:33:27,679 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.30 vs. limit=15.0 2024-06-20 15:33:39,078 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.39 vs. limit=15.0 2024-06-20 15:33:48,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=231342.83333333334, ans=0.1 2024-06-20 15:33:55,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=231342.83333333334, ans=0.125 2024-06-20 15:33:57,575 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:34:04,784 INFO [train.py:1028] (1/2) Epoch 13, batch 4800, loss[loss=0.2049, simple_loss=0.2503, pruned_loss=0.07976, over 13259.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2501, pruned_loss=0.08075, over 2577328.99 frames. ], batch size: 63, lr: 4.51e-03, grad_scale: 64.0 2024-06-20 15:34:15,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=231397.83333333334, ans=0.125 2024-06-20 15:34:17,278 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.776e+02 1.902e+02 2.059e+02 3.063e+02, threshold=3.804e+02, percent-clipped=0.0 2024-06-20 15:34:17,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=231397.83333333334, ans=0.0 2024-06-20 15:34:30,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=231434.5, ans=0.125 2024-06-20 15:34:36,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=231452.83333333334, ans=0.0 2024-06-20 15:34:40,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=231452.83333333334, ans=0.125 2024-06-20 15:34:44,927 INFO [train.py:1028] (1/2) Epoch 13, batch 4850, loss[loss=0.2, simple_loss=0.2397, pruned_loss=0.08017, over 13275.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2497, pruned_loss=0.08081, over 2575831.31 frames. ], batch size: 89, lr: 4.51e-03, grad_scale: 128.0 2024-06-20 15:34:45,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=231471.16666666666, ans=0.125 2024-06-20 15:35:00,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=231507.83333333334, ans=0.2 2024-06-20 15:35:01,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=231507.83333333334, ans=0.95 2024-06-20 15:35:08,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=231526.16666666666, ans=0.125 2024-06-20 15:35:16,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=231526.16666666666, ans=0.125 2024-06-20 15:35:23,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=231544.5, ans=0.1 2024-06-20 15:35:26,236 INFO [train.py:1028] (1/2) Epoch 13, batch 4900, loss[loss=0.2002, simple_loss=0.2453, pruned_loss=0.0776, over 13253.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2497, pruned_loss=0.08085, over 2575837.76 frames. ], batch size: 59, lr: 4.51e-03, grad_scale: 128.0 2024-06-20 15:35:30,293 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-20 15:35:32,672 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.25 vs. limit=15.0 2024-06-20 15:35:38,726 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.770e+02 1.923e+02 2.109e+02 3.113e+02, threshold=3.845e+02, percent-clipped=0.0 2024-06-20 15:35:42,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=231599.5, ans=0.125 2024-06-20 15:35:45,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=231599.5, ans=0.0 2024-06-20 15:35:53,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=231599.5, ans=0.2 2024-06-20 15:35:53,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=231599.5, ans=0.1 2024-06-20 15:35:53,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=231599.5, ans=0.125 2024-06-20 15:36:10,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=231636.16666666666, ans=0.0 2024-06-20 15:36:13,782 INFO [train.py:1028] (1/2) Epoch 13, batch 4950, loss[loss=0.2263, simple_loss=0.255, pruned_loss=0.09874, over 11079.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2506, pruned_loss=0.08178, over 2569385.17 frames. ], batch size: 303, lr: 4.51e-03, grad_scale: 128.0 2024-06-20 15:36:14,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=231654.5, ans=0.125 2024-06-20 15:36:15,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=231654.5, ans=0.125 2024-06-20 15:36:19,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=231654.5, ans=0.0 2024-06-20 15:36:27,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=231672.83333333334, ans=0.07 2024-06-20 15:36:31,984 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.65 vs. limit=15.0 2024-06-20 15:37:00,313 INFO [train.py:1028] (1/2) Epoch 13, batch 5000, loss[loss=0.2014, simple_loss=0.2461, pruned_loss=0.07836, over 13143.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2511, pruned_loss=0.082, over 2572483.89 frames. ], batch size: 95, lr: 4.51e-03, grad_scale: 128.0 2024-06-20 15:37:10,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=231764.5, ans=0.125 2024-06-20 15:37:15,213 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 1.807e+02 1.898e+02 2.007e+02 2.497e+02, threshold=3.795e+02, percent-clipped=0.0 2024-06-20 15:37:49,572 INFO [train.py:1028] (1/2) Epoch 13, batch 5050, loss[loss=0.2075, simple_loss=0.2603, pruned_loss=0.07732, over 12955.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2511, pruned_loss=0.08168, over 2569922.61 frames. ], batch size: 36, lr: 4.51e-03, grad_scale: 128.0 2024-06-20 15:37:51,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=231837.83333333334, ans=0.0 2024-06-20 15:38:04,017 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=9.80 vs. limit=12.0 2024-06-20 15:38:07,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=231874.5, ans=0.125 2024-06-20 15:38:09,976 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.73 vs. limit=15.0 2024-06-20 15:38:19,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=231892.83333333334, ans=0.0 2024-06-20 15:38:25,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231911.16666666666, ans=0.1 2024-06-20 15:38:38,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231911.16666666666, ans=0.1 2024-06-20 15:38:42,164 INFO [train.py:1028] (1/2) Epoch 13, batch 5100, loss[loss=0.2007, simple_loss=0.2507, pruned_loss=0.07535, over 12948.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2508, pruned_loss=0.08165, over 2567793.32 frames. ], batch size: 39, lr: 4.51e-03, grad_scale: 128.0 2024-06-20 15:38:45,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=231929.5, ans=0.2 2024-06-20 15:38:55,512 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.65 vs. limit=15.0 2024-06-20 15:38:56,763 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.840e+02 2.082e+02 2.334e+02 3.611e+02, threshold=4.164e+02, percent-clipped=0.0 2024-06-20 15:39:11,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=231966.16666666666, ans=0.125 2024-06-20 15:39:22,021 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.87 vs. limit=15.0 2024-06-20 15:39:22,819 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.88 vs. limit=10.0 2024-06-20 15:39:35,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=232021.16666666666, ans=0.125 2024-06-20 15:39:36,109 INFO [train.py:1028] (1/2) Epoch 13, batch 5150, loss[loss=0.1898, simple_loss=0.2306, pruned_loss=0.07448, over 13101.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2502, pruned_loss=0.08164, over 2570190.35 frames. ], batch size: 132, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:39:54,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=232057.83333333334, ans=0.125 2024-06-20 15:39:58,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=232057.83333333334, ans=0.0 2024-06-20 15:40:04,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=232076.16666666666, ans=0.125 2024-06-20 15:40:05,575 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.87 vs. limit=12.0 2024-06-20 15:40:07,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=232076.16666666666, ans=0.0 2024-06-20 15:40:19,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=232094.5, ans=0.025 2024-06-20 15:40:24,679 INFO [train.py:1028] (1/2) Epoch 13, batch 5200, loss[loss=0.2036, simple_loss=0.2422, pruned_loss=0.0825, over 13174.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2501, pruned_loss=0.08138, over 2573197.94 frames. ], batch size: 95, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:40:29,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=232112.83333333334, ans=0.2 2024-06-20 15:40:33,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=232131.16666666666, ans=0.125 2024-06-20 15:40:39,107 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.789e+02 1.902e+02 2.145e+02 2.916e+02, threshold=3.804e+02, percent-clipped=0.0 2024-06-20 15:40:44,830 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.12 vs. limit=22.5 2024-06-20 15:40:50,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=232149.5, ans=0.0 2024-06-20 15:41:02,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=232167.83333333334, ans=0.0 2024-06-20 15:41:03,589 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2024-06-20 15:41:11,908 INFO [train.py:1028] (1/2) Epoch 13, batch 5250, loss[loss=0.2082, simple_loss=0.2466, pruned_loss=0.08487, over 13274.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2504, pruned_loss=0.08175, over 2568556.91 frames. ], batch size: 52, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:41:22,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.58 vs. limit=22.5 2024-06-20 15:41:34,183 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.80 vs. limit=15.0 2024-06-20 15:41:36,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=232241.16666666666, ans=0.125 2024-06-20 15:41:37,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=232241.16666666666, ans=0.125 2024-06-20 15:41:38,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=232241.16666666666, ans=0.0 2024-06-20 15:41:51,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=232277.83333333334, ans=0.125 2024-06-20 15:42:07,582 INFO [train.py:1028] (1/2) Epoch 13, batch 5300, loss[loss=0.1944, simple_loss=0.2306, pruned_loss=0.07908, over 13071.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2503, pruned_loss=0.08146, over 2565618.38 frames. ], batch size: 144, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:42:14,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=232296.16666666666, ans=0.125 2024-06-20 15:42:20,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=232314.5, ans=0.125 2024-06-20 15:42:22,566 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.758e+02 1.862e+02 2.002e+02 2.730e+02, threshold=3.725e+02, percent-clipped=0.0 2024-06-20 15:42:26,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=232332.83333333334, ans=0.125 2024-06-20 15:42:48,987 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.47 vs. limit=15.0 2024-06-20 15:42:57,711 INFO [train.py:1028] (1/2) Epoch 13, batch 5350, loss[loss=0.2143, simple_loss=0.2706, pruned_loss=0.07903, over 11538.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2496, pruned_loss=0.08109, over 2572745.66 frames. ], batch size: 17, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:42:58,391 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.01 vs. limit=15.0 2024-06-20 15:43:00,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=232387.83333333334, ans=0.125 2024-06-20 15:43:03,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=232387.83333333334, ans=0.125 2024-06-20 15:43:08,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=232406.16666666666, ans=0.125 2024-06-20 15:43:10,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232406.16666666666, ans=0.1 2024-06-20 15:43:13,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=232406.16666666666, ans=0.125 2024-06-20 15:43:15,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=232406.16666666666, ans=0.0 2024-06-20 15:43:15,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232406.16666666666, ans=0.1 2024-06-20 15:43:19,050 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.57 vs. limit=12.0 2024-06-20 15:43:39,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=232461.16666666666, ans=0.125 2024-06-20 15:43:46,945 INFO [train.py:1028] (1/2) Epoch 13, batch 5400, loss[loss=0.2468, simple_loss=0.2722, pruned_loss=0.1107, over 12301.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2499, pruned_loss=0.0814, over 2566081.81 frames. ], batch size: 241, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:43:47,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=232479.5, ans=0.0 2024-06-20 15:43:49,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232479.5, ans=0.1 2024-06-20 15:43:49,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=232479.5, ans=0.125 2024-06-20 15:43:58,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232497.83333333334, ans=0.1 2024-06-20 15:44:01,825 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.835e+02 1.939e+02 2.149e+02 2.788e+02, threshold=3.879e+02, percent-clipped=0.0 2024-06-20 15:44:06,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232516.16666666666, ans=0.1 2024-06-20 15:44:08,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=232516.16666666666, ans=0.035 2024-06-20 15:44:09,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=232516.16666666666, ans=0.0 2024-06-20 15:44:25,006 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.77 vs. limit=6.0 2024-06-20 15:44:26,014 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.34 vs. limit=15.0 2024-06-20 15:44:26,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=232534.5, ans=0.0 2024-06-20 15:44:32,117 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:44:41,256 INFO [train.py:1028] (1/2) Epoch 13, batch 5450, loss[loss=0.1968, simple_loss=0.2364, pruned_loss=0.07855, over 12758.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2504, pruned_loss=0.08142, over 2569036.00 frames. ], batch size: 26, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:45:09,567 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.92 vs. limit=6.0 2024-06-20 15:45:11,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=232607.83333333334, ans=0.2 2024-06-20 15:45:14,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=232607.83333333334, ans=0.125 2024-06-20 15:45:27,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=232644.5, ans=0.0 2024-06-20 15:45:32,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=232644.5, ans=0.125 2024-06-20 15:45:37,731 INFO [train.py:1028] (1/2) Epoch 13, batch 5500, loss[loss=0.2283, simple_loss=0.266, pruned_loss=0.0953, over 12278.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2502, pruned_loss=0.08135, over 2562010.94 frames. ], batch size: 241, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:45:49,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=232681.16666666666, ans=0.125 2024-06-20 15:45:52,342 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.797e+02 1.911e+02 2.166e+02 3.072e+02, threshold=3.821e+02, percent-clipped=0.0 2024-06-20 15:45:54,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=232681.16666666666, ans=0.1 2024-06-20 15:46:03,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=232699.5, ans=0.125 2024-06-20 15:46:04,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=232699.5, ans=0.125 2024-06-20 15:46:07,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=232717.83333333334, ans=0.125 2024-06-20 15:46:09,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232717.83333333334, ans=0.1 2024-06-20 15:46:23,906 INFO [train.py:1028] (1/2) Epoch 13, batch 5550, loss[loss=0.1899, simple_loss=0.2373, pruned_loss=0.07127, over 13218.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2491, pruned_loss=0.08051, over 2565013.22 frames. ], batch size: 43, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:46:25,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=232754.5, ans=0.0 2024-06-20 15:46:35,780 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.52 vs. limit=15.0 2024-06-20 15:47:20,392 INFO [train.py:1028] (1/2) Epoch 13, batch 5600, loss[loss=0.2211, simple_loss=0.2649, pruned_loss=0.08866, over 13186.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2486, pruned_loss=0.08042, over 2568771.24 frames. ], batch size: 89, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:47:36,402 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.745e+02 1.871e+02 2.102e+02 2.698e+02, threshold=3.741e+02, percent-clipped=0.0 2024-06-20 15:47:37,986 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.48 vs. limit=15.0 2024-06-20 15:47:38,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=232864.5, ans=0.95 2024-06-20 15:48:17,814 INFO [train.py:1028] (1/2) Epoch 13, batch 5650, loss[loss=0.2172, simple_loss=0.2554, pruned_loss=0.08948, over 12542.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2479, pruned_loss=0.07983, over 2575004.39 frames. ], batch size: 202, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:48:24,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=232937.83333333334, ans=0.125 2024-06-20 15:48:25,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=232937.83333333334, ans=0.2 2024-06-20 15:48:26,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=232956.16666666666, ans=0.0 2024-06-20 15:48:35,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.80 vs. limit=15.0 2024-06-20 15:48:38,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232974.5, ans=0.1 2024-06-20 15:48:50,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232992.83333333334, ans=0.1 2024-06-20 15:48:52,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=232992.83333333334, ans=0.2 2024-06-20 15:49:01,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=233011.16666666666, ans=0.125 2024-06-20 15:49:06,730 INFO [train.py:1028] (1/2) Epoch 13, batch 5700, loss[loss=0.1974, simple_loss=0.242, pruned_loss=0.07647, over 13278.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2479, pruned_loss=0.08002, over 2579974.52 frames. ], batch size: 63, lr: 4.49e-03, grad_scale: 128.0 2024-06-20 15:49:21,298 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.744e+02 1.871e+02 1.989e+02 2.447e+02, threshold=3.741e+02, percent-clipped=0.0 2024-06-20 15:49:48,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=233102.83333333334, ans=0.0 2024-06-20 15:49:51,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=233102.83333333334, ans=0.0 2024-06-20 15:49:54,282 INFO [train.py:1028] (1/2) Epoch 13, batch 5750, loss[loss=0.2192, simple_loss=0.2587, pruned_loss=0.08985, over 12739.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2489, pruned_loss=0.08028, over 2579796.17 frames. ], batch size: 176, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:49:55,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2024-06-20 15:50:00,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=233121.16666666666, ans=0.0 2024-06-20 15:50:24,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=233157.83333333334, ans=0.09899494936611666 2024-06-20 15:50:46,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=233194.5, ans=0.07 2024-06-20 15:50:46,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=233194.5, ans=0.04949747468305833 2024-06-20 15:50:49,455 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.03 vs. limit=22.5 2024-06-20 15:50:51,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=233194.5, ans=0.5 2024-06-20 15:50:51,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=233194.5, ans=0.0 2024-06-20 15:50:52,845 INFO [train.py:1028] (1/2) Epoch 13, batch 5800, loss[loss=0.2111, simple_loss=0.253, pruned_loss=0.08463, over 12758.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2502, pruned_loss=0.08123, over 2579666.40 frames. ], batch size: 176, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:50:59,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=233231.16666666666, ans=0.04949747468305833 2024-06-20 15:51:05,785 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.887e+02 2.078e+02 2.312e+02 3.160e+02, threshold=4.157e+02, percent-clipped=0.0 2024-06-20 15:51:11,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=233249.5, ans=0.125 2024-06-20 15:51:21,401 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.42 vs. limit=10.0 2024-06-20 15:51:24,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=233267.83333333334, ans=0.1 2024-06-20 15:51:27,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=233286.16666666666, ans=0.1 2024-06-20 15:51:29,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=233286.16666666666, ans=0.0 2024-06-20 15:51:30,240 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.55 vs. limit=15.0 2024-06-20 15:51:36,221 INFO [train.py:1028] (1/2) Epoch 13, batch 5850, loss[loss=0.2497, simple_loss=0.2816, pruned_loss=0.1089, over 12602.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2521, pruned_loss=0.08213, over 2578292.45 frames. ], batch size: 202, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:51:37,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=233304.5, ans=0.125 2024-06-20 15:51:38,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=233304.5, ans=0.125 2024-06-20 15:51:40,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=233304.5, ans=0.0 2024-06-20 15:51:41,884 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:52:01,359 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.05 vs. limit=6.0 2024-06-20 15:52:11,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=233359.5, ans=0.035 2024-06-20 15:52:14,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=233359.5, ans=0.1 2024-06-20 15:52:16,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=233377.83333333334, ans=0.1 2024-06-20 15:52:24,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=233377.83333333334, ans=0.0 2024-06-20 15:52:26,144 INFO [train.py:1028] (1/2) Epoch 13, batch 5900, loss[loss=0.1944, simple_loss=0.2376, pruned_loss=0.07558, over 13160.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2545, pruned_loss=0.08292, over 2578799.85 frames. ], batch size: 121, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:52:27,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=233396.16666666666, ans=0.2 2024-06-20 15:52:28,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=233396.16666666666, ans=0.0 2024-06-20 15:52:33,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=233396.16666666666, ans=0.05 2024-06-20 15:52:36,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=233414.5, ans=0.2 2024-06-20 15:52:42,069 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.805e+02 1.964e+02 2.143e+02 2.868e+02, threshold=3.928e+02, percent-clipped=0.0 2024-06-20 15:52:44,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=233414.5, ans=0.125 2024-06-20 15:52:50,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=233432.83333333334, ans=0.125 2024-06-20 15:52:50,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=233432.83333333334, ans=0.125 2024-06-20 15:52:53,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=233432.83333333334, ans=0.0 2024-06-20 15:53:12,461 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:53:18,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=233469.5, ans=0.0 2024-06-20 15:53:21,227 INFO [train.py:1028] (1/2) Epoch 13, batch 5950, loss[loss=0.2053, simple_loss=0.2459, pruned_loss=0.08234, over 13148.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2558, pruned_loss=0.08351, over 2583939.67 frames. ], batch size: 121, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:53:25,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=233487.83333333334, ans=0.1 2024-06-20 15:53:27,146 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.41 vs. limit=15.0 2024-06-20 15:53:28,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=233487.83333333334, ans=0.0 2024-06-20 15:53:31,287 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.52 vs. limit=15.0 2024-06-20 15:53:38,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=233506.16666666666, ans=0.0 2024-06-20 15:53:53,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=233524.5, ans=0.125 2024-06-20 15:53:56,020 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.77 vs. limit=15.0 2024-06-20 15:54:10,356 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.53 vs. limit=22.5 2024-06-20 15:54:20,414 INFO [train.py:1028] (1/2) Epoch 13, batch 6000, loss[loss=0.2476, simple_loss=0.2818, pruned_loss=0.1067, over 12315.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2572, pruned_loss=0.08419, over 2578047.18 frames. ], batch size: 241, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:54:20,417 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 15:54:27,439 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.6850, 1.8112, 2.0303, 1.6837], device='cuda:1') 2024-06-20 15:54:31,919 INFO [train.py:1060] (1/2) Epoch 13, validation: loss=0.1926, simple_loss=0.2569, pruned_loss=0.06418, over 351949.00 frames. 2024-06-20 15:54:31,919 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17479MB 2024-06-20 15:54:33,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=233579.5, ans=0.125 2024-06-20 15:54:36,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=233579.5, ans=0.0 2024-06-20 15:54:42,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=233597.83333333334, ans=0.0 2024-06-20 15:54:47,816 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.660e+02 1.852e+02 2.010e+02 2.218e+02 3.854e+02, threshold=4.020e+02, percent-clipped=0.0 2024-06-20 15:54:49,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=233597.83333333334, ans=0.0 2024-06-20 15:54:52,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=233616.16666666666, ans=0.125 2024-06-20 15:55:00,525 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.26 vs. limit=15.0 2024-06-20 15:55:20,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=233652.83333333334, ans=0.125 2024-06-20 15:55:21,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=233671.16666666666, ans=0.125 2024-06-20 15:55:22,802 INFO [train.py:1028] (1/2) Epoch 13, batch 6050, loss[loss=0.2214, simple_loss=0.2734, pruned_loss=0.08475, over 12995.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2587, pruned_loss=0.08464, over 2580302.90 frames. ], batch size: 39, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:55:26,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=233671.16666666666, ans=0.025 2024-06-20 15:55:35,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=233689.5, ans=0.125 2024-06-20 15:55:36,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=233689.5, ans=0.2 2024-06-20 15:55:42,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=233707.83333333334, ans=0.125 2024-06-20 15:55:44,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=233707.83333333334, ans=0.05 2024-06-20 15:56:05,776 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:56:07,225 INFO [train.py:1028] (1/2) Epoch 13, batch 6100, loss[loss=0.2055, simple_loss=0.246, pruned_loss=0.08246, over 13106.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2605, pruned_loss=0.0853, over 2581713.80 frames. ], batch size: 121, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:56:11,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=233762.83333333334, ans=0.125 2024-06-20 15:56:23,509 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.597e+02 1.907e+02 2.030e+02 2.293e+02 3.277e+02, threshold=4.059e+02, percent-clipped=0.0 2024-06-20 15:56:27,242 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.39 vs. limit=15.0 2024-06-20 15:56:29,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=233799.5, ans=0.0 2024-06-20 15:57:04,550 INFO [train.py:1028] (1/2) Epoch 13, batch 6150, loss[loss=0.223, simple_loss=0.2645, pruned_loss=0.09073, over 10741.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2625, pruned_loss=0.08621, over 2579843.64 frames. ], batch size: 303, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:57:09,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=233854.5, ans=0.125 2024-06-20 15:57:09,967 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.07 vs. limit=6.0 2024-06-20 15:57:42,185 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=11.24 vs. limit=10.0 2024-06-20 15:57:44,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=233927.83333333334, ans=0.0 2024-06-20 15:57:52,655 INFO [train.py:1028] (1/2) Epoch 13, batch 6200, loss[loss=0.2493, simple_loss=0.2963, pruned_loss=0.1012, over 13248.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2648, pruned_loss=0.08725, over 2575618.63 frames. ], batch size: 89, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:57:55,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=233946.16666666666, ans=0.125 2024-06-20 15:58:00,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=233964.5, ans=0.125 2024-06-20 15:58:06,867 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.78 vs. limit=15.0 2024-06-20 15:58:07,082 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.909e+02 2.129e+02 2.448e+02 3.357e+02, threshold=4.258e+02, percent-clipped=0.0 2024-06-20 15:58:16,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=233982.83333333334, ans=0.1 2024-06-20 15:58:31,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=234019.5, ans=0.035 2024-06-20 15:58:32,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=234019.5, ans=0.125 2024-06-20 15:58:36,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=234019.5, ans=0.025 2024-06-20 15:58:41,371 INFO [train.py:1028] (1/2) Epoch 13, batch 6250, loss[loss=0.2294, simple_loss=0.2761, pruned_loss=0.09138, over 13224.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2664, pruned_loss=0.08789, over 2569071.03 frames. ], batch size: 83, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:58:44,220 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.64 vs. limit=22.5 2024-06-20 15:58:45,867 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.14 vs. limit=22.5 2024-06-20 15:58:49,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=234037.83333333334, ans=0.0 2024-06-20 15:58:51,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=234056.16666666666, ans=0.125 2024-06-20 15:58:51,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=234056.16666666666, ans=0.125 2024-06-20 15:59:12,021 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.06 vs. limit=15.0 2024-06-20 15:59:17,641 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.26 vs. limit=15.0 2024-06-20 15:59:20,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=234092.83333333334, ans=0.025 2024-06-20 15:59:24,179 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:59:29,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=234111.16666666666, ans=0.1 2024-06-20 15:59:33,082 INFO [train.py:1028] (1/2) Epoch 13, batch 6300, loss[loss=0.2111, simple_loss=0.2569, pruned_loss=0.08265, over 11040.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2676, pruned_loss=0.08819, over 2563181.65 frames. ], batch size: 16, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 15:59:34,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=234129.5, ans=0.1 2024-06-20 15:59:35,534 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.73 vs. limit=15.0 2024-06-20 15:59:55,744 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.631e+02 1.864e+02 1.979e+02 2.097e+02 2.752e+02, threshold=3.958e+02, percent-clipped=0.0 2024-06-20 15:59:59,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=234166.16666666666, ans=0.125 2024-06-20 16:00:00,319 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.83 vs. limit=15.0 2024-06-20 16:00:10,856 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.49 vs. limit=15.0 2024-06-20 16:00:24,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=234202.83333333334, ans=6.0 2024-06-20 16:00:27,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=234202.83333333334, ans=0.125 2024-06-20 16:00:29,069 INFO [train.py:1028] (1/2) Epoch 13, batch 6350, loss[loss=0.2609, simple_loss=0.2913, pruned_loss=0.1153, over 12515.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.269, pruned_loss=0.08836, over 2573183.47 frames. ], batch size: 202, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:00:44,683 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.680e+00 2024-06-20 16:00:48,155 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.57 vs. limit=15.0 2024-06-20 16:00:54,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=234257.83333333334, ans=0.125 2024-06-20 16:00:54,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=234257.83333333334, ans=0.125 2024-06-20 16:00:58,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234276.16666666666, ans=0.1 2024-06-20 16:01:11,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=234294.5, ans=0.125 2024-06-20 16:01:15,798 INFO [train.py:1028] (1/2) Epoch 13, batch 6400, loss[loss=0.2083, simple_loss=0.2561, pruned_loss=0.08028, over 13241.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2711, pruned_loss=0.08921, over 2573931.07 frames. ], batch size: 67, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:01:17,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=234312.83333333334, ans=0.0 2024-06-20 16:01:20,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=234312.83333333334, ans=0.125 2024-06-20 16:01:25,869 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=9.49 vs. limit=12.0 2024-06-20 16:01:27,831 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.25 vs. limit=15.0 2024-06-20 16:01:31,713 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 1.943e+02 2.108e+02 2.389e+02 3.739e+02, threshold=4.215e+02, percent-clipped=0.0 2024-06-20 16:01:33,458 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2024-06-20 16:01:34,439 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.82 vs. limit=10.0 2024-06-20 16:01:35,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=234349.5, ans=0.2 2024-06-20 16:01:37,206 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.74 vs. limit=15.0 2024-06-20 16:01:39,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=234349.5, ans=0.2 2024-06-20 16:01:41,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=234349.5, ans=0.0 2024-06-20 16:01:49,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=234367.83333333334, ans=0.025 2024-06-20 16:01:50,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=234367.83333333334, ans=0.0 2024-06-20 16:01:51,725 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=9.52 vs. limit=12.0 2024-06-20 16:02:09,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=234386.16666666666, ans=0.125 2024-06-20 16:02:10,974 INFO [train.py:1028] (1/2) Epoch 13, batch 6450, loss[loss=0.2556, simple_loss=0.2933, pruned_loss=0.109, over 12610.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2726, pruned_loss=0.09024, over 2580726.24 frames. ], batch size: 202, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:02:11,597 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.44 vs. limit=15.0 2024-06-20 16:02:22,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=234422.83333333334, ans=0.1 2024-06-20 16:02:25,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=234422.83333333334, ans=0.125 2024-06-20 16:02:32,228 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.44 vs. limit=22.5 2024-06-20 16:02:33,145 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.93 vs. limit=6.0 2024-06-20 16:02:37,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=234459.5, ans=0.125 2024-06-20 16:02:37,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=234459.5, ans=0.125 2024-06-20 16:02:39,153 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:02:40,869 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.42 vs. limit=10.0 2024-06-20 16:02:58,652 INFO [train.py:1028] (1/2) Epoch 13, batch 6500, loss[loss=0.2496, simple_loss=0.2774, pruned_loss=0.1109, over 10948.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2744, pruned_loss=0.09074, over 2584055.54 frames. ], batch size: 304, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:02:59,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=234496.16666666666, ans=0.125 2024-06-20 16:03:08,764 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.66 vs. limit=22.5 2024-06-20 16:03:14,114 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.674e+02 1.906e+02 2.069e+02 2.215e+02 3.011e+02, threshold=4.137e+02, percent-clipped=0.0 2024-06-20 16:03:16,699 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.40 vs. limit=15.0 2024-06-20 16:03:35,122 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.60 vs. limit=10.0 2024-06-20 16:03:36,906 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.91 vs. limit=10.0 2024-06-20 16:03:43,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=234569.5, ans=0.125 2024-06-20 16:03:45,875 INFO [train.py:1028] (1/2) Epoch 13, batch 6550, loss[loss=0.2325, simple_loss=0.2887, pruned_loss=0.08817, over 12607.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2749, pruned_loss=0.09063, over 2587879.90 frames. ], batch size: 22, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:03:49,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=234587.83333333334, ans=0.0 2024-06-20 16:03:50,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=234587.83333333334, ans=0.2 2024-06-20 16:04:02,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=234624.5, ans=0.025 2024-06-20 16:04:04,807 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=22.5 2024-06-20 16:04:06,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234624.5, ans=0.1 2024-06-20 16:04:08,143 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.72 vs. limit=10.0 2024-06-20 16:04:09,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=234624.5, ans=0.125 2024-06-20 16:04:11,095 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.48 vs. limit=15.0 2024-06-20 16:04:30,924 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.11 vs. limit=15.0 2024-06-20 16:04:33,322 INFO [train.py:1028] (1/2) Epoch 13, batch 6600, loss[loss=0.2429, simple_loss=0.2906, pruned_loss=0.09763, over 13083.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2751, pruned_loss=0.09088, over 2590277.90 frames. ], batch size: 71, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:04:36,144 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.25 vs. limit=22.5 2024-06-20 16:04:43,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=234697.83333333334, ans=0.125 2024-06-20 16:04:45,807 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.930e+02 2.182e+02 2.480e+02 3.228e+02, threshold=4.364e+02, percent-clipped=0.0 2024-06-20 16:04:52,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=234716.16666666666, ans=10.0 2024-06-20 16:05:25,330 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.51 vs. limit=15.0 2024-06-20 16:05:25,455 INFO [train.py:1028] (1/2) Epoch 13, batch 6650, loss[loss=0.2498, simple_loss=0.2953, pruned_loss=0.1021, over 12914.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.2769, pruned_loss=0.09133, over 2585751.19 frames. ], batch size: 158, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:05:33,013 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.25 vs. limit=15.0 2024-06-20 16:05:48,471 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:06:02,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=234826.16666666666, ans=0.125 2024-06-20 16:06:04,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=234826.16666666666, ans=0.125 2024-06-20 16:06:11,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=234844.5, ans=0.2 2024-06-20 16:06:15,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=234844.5, ans=0.0 2024-06-20 16:06:17,610 INFO [train.py:1028] (1/2) Epoch 13, batch 6700, loss[loss=0.2545, simple_loss=0.2951, pruned_loss=0.107, over 12701.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.2782, pruned_loss=0.0921, over 2584510.89 frames. ], batch size: 176, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:06:20,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=234862.83333333334, ans=0.125 2024-06-20 16:06:25,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=234862.83333333334, ans=0.125 2024-06-20 16:06:33,047 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 1.926e+02 2.056e+02 2.356e+02 3.484e+02, threshold=4.111e+02, percent-clipped=0.0 2024-06-20 16:06:38,412 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=15.0 2024-06-20 16:06:38,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=234899.5, ans=0.125 2024-06-20 16:06:44,083 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.64 vs. limit=15.0 2024-06-20 16:06:51,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=234917.83333333334, ans=0.04949747468305833 2024-06-20 16:07:03,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=234936.16666666666, ans=0.125 2024-06-20 16:07:03,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=234936.16666666666, ans=0.0 2024-06-20 16:07:06,259 INFO [train.py:1028] (1/2) Epoch 13, batch 6750, loss[loss=0.3023, simple_loss=0.3329, pruned_loss=0.1358, over 12251.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.2785, pruned_loss=0.09235, over 2578634.00 frames. ], batch size: 241, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:07:07,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234954.5, ans=0.1 2024-06-20 16:07:58,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=235027.83333333334, ans=10.0 2024-06-20 16:07:59,847 INFO [train.py:1028] (1/2) Epoch 13, batch 6800, loss[loss=0.2113, simple_loss=0.2614, pruned_loss=0.08057, over 13226.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.2798, pruned_loss=0.09285, over 2581752.55 frames. ], batch size: 67, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:08:15,737 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 1.927e+02 2.013e+02 2.200e+02 2.988e+02, threshold=4.025e+02, percent-clipped=0.0 2024-06-20 16:08:17,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=235064.5, ans=0.0 2024-06-20 16:08:18,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=235082.83333333334, ans=0.125 2024-06-20 16:08:20,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=235082.83333333334, ans=0.1 2024-06-20 16:08:21,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=235082.83333333334, ans=0.5 2024-06-20 16:08:22,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=235082.83333333334, ans=0.125 2024-06-20 16:08:28,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=235101.16666666666, ans=0.0 2024-06-20 16:08:34,790 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:08:37,288 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=33.92 vs. limit=15.0 2024-06-20 16:08:40,223 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.59 vs. limit=12.0 2024-06-20 16:08:54,718 INFO [train.py:1028] (1/2) Epoch 13, batch 6850, loss[loss=0.2549, simple_loss=0.3148, pruned_loss=0.09756, over 13293.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.2796, pruned_loss=0.09233, over 2584546.18 frames. ], batch size: 63, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:09:04,676 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.23 vs. limit=10.0 2024-06-20 16:09:05,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=235156.16666666666, ans=0.0 2024-06-20 16:09:05,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=235156.16666666666, ans=0.2 2024-06-20 16:09:15,077 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.63 vs. limit=12.0 2024-06-20 16:09:33,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=235211.16666666666, ans=0.125 2024-06-20 16:09:39,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=235229.5, ans=0.2 2024-06-20 16:09:40,519 INFO [train.py:1028] (1/2) Epoch 13, batch 6900, loss[loss=0.2338, simple_loss=0.2835, pruned_loss=0.09208, over 13206.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.2802, pruned_loss=0.09258, over 2585927.66 frames. ], batch size: 49, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:09:50,650 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.89 vs. limit=12.0 2024-06-20 16:09:55,806 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 1.911e+02 2.083e+02 2.290e+02 2.958e+02, threshold=4.167e+02, percent-clipped=0.0 2024-06-20 16:10:03,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=235266.16666666666, ans=0.0 2024-06-20 16:10:13,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=235284.5, ans=0.0 2024-06-20 16:10:18,778 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=235302.83333333334, ans=0.0 2024-06-20 16:10:25,353 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.74 vs. limit=22.5 2024-06-20 16:10:29,585 INFO [train.py:1028] (1/2) Epoch 13, batch 6950, loss[loss=0.212, simple_loss=0.2624, pruned_loss=0.08077, over 10875.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2802, pruned_loss=0.09241, over 2579316.14 frames. ], batch size: 16, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:10:33,929 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2024-06-20 16:11:00,416 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.31 vs. limit=15.0 2024-06-20 16:11:10,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=235376.16666666666, ans=0.125 2024-06-20 16:11:23,917 INFO [train.py:1028] (1/2) Epoch 13, batch 7000, loss[loss=0.2418, simple_loss=0.2843, pruned_loss=0.09963, over 12959.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.2808, pruned_loss=0.09246, over 2575885.57 frames. ], batch size: 158, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:11:24,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=235412.83333333334, ans=0.09899494936611666 2024-06-20 16:11:38,643 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.970e+02 2.143e+02 2.425e+02 3.357e+02, threshold=4.286e+02, percent-clipped=0.0 2024-06-20 16:11:41,599 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2024-06-20 16:11:55,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=235449.5, ans=0.1 2024-06-20 16:11:59,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=235467.83333333334, ans=0.125 2024-06-20 16:12:06,015 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.84 vs. limit=15.0 2024-06-20 16:12:07,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=235467.83333333334, ans=0.125 2024-06-20 16:12:17,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=235486.16666666666, ans=0.125 2024-06-20 16:12:20,943 INFO [train.py:1028] (1/2) Epoch 13, batch 7050, loss[loss=0.249, simple_loss=0.2879, pruned_loss=0.105, over 12680.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.2822, pruned_loss=0.09313, over 2583372.93 frames. ], batch size: 176, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:12:33,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=235522.83333333334, ans=0.1 2024-06-20 16:12:41,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=235541.16666666666, ans=0.125 2024-06-20 16:12:45,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=235541.16666666666, ans=0.0 2024-06-20 16:12:46,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=235559.5, ans=0.2 2024-06-20 16:12:54,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=235559.5, ans=0.2 2024-06-20 16:13:02,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=235577.83333333334, ans=0.1 2024-06-20 16:13:05,003 INFO [train.py:1028] (1/2) Epoch 13, batch 7100, loss[loss=0.2628, simple_loss=0.3134, pruned_loss=0.1062, over 13163.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.283, pruned_loss=0.09362, over 2575213.58 frames. ], batch size: 112, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:13:19,979 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 2.022e+02 2.228e+02 2.469e+02 3.621e+02, threshold=4.455e+02, percent-clipped=0.0 2024-06-20 16:13:25,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=235632.83333333334, ans=0.2 2024-06-20 16:13:30,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=235632.83333333334, ans=0.1 2024-06-20 16:13:49,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=235669.5, ans=0.09899494936611666 2024-06-20 16:13:53,756 INFO [train.py:1028] (1/2) Epoch 13, batch 7150, loss[loss=0.2548, simple_loss=0.2907, pruned_loss=0.1095, over 12531.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.2837, pruned_loss=0.09364, over 2572205.29 frames. ], batch size: 202, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:13:55,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=235687.83333333334, ans=0.125 2024-06-20 16:14:17,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=235706.16666666666, ans=0.1 2024-06-20 16:14:24,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=235724.5, ans=0.125 2024-06-20 16:14:29,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=235742.83333333334, ans=0.125 2024-06-20 16:14:36,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=235761.16666666666, ans=0.1 2024-06-20 16:14:42,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=235761.16666666666, ans=0.5 2024-06-20 16:14:44,407 INFO [train.py:1028] (1/2) Epoch 13, batch 7200, loss[loss=0.2596, simple_loss=0.3085, pruned_loss=0.1053, over 13099.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.2849, pruned_loss=0.09378, over 2577530.21 frames. ], batch size: 112, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:15:03,967 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 1.965e+02 2.146e+02 2.355e+02 3.295e+02, threshold=4.292e+02, percent-clipped=0.0 2024-06-20 16:15:06,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=235816.16666666666, ans=0.0 2024-06-20 16:15:27,990 INFO [train.py:1028] (1/2) Epoch 13, batch 7250, loss[loss=0.2327, simple_loss=0.2877, pruned_loss=0.08889, over 12948.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.2851, pruned_loss=0.09363, over 2580005.81 frames. ], batch size: 36, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:15:30,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=235871.16666666666, ans=0.2 2024-06-20 16:15:38,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=235889.5, ans=0.2 2024-06-20 16:15:46,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=235907.83333333334, ans=0.0 2024-06-20 16:15:53,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=235926.16666666666, ans=0.0 2024-06-20 16:16:03,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=235944.5, ans=0.125 2024-06-20 16:16:07,167 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.45 vs. limit=10.0 2024-06-20 16:16:09,468 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.39 vs. limit=22.5 2024-06-20 16:16:12,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=235962.83333333334, ans=0.0 2024-06-20 16:16:13,093 INFO [train.py:1028] (1/2) Epoch 13, batch 7300, loss[loss=0.2177, simple_loss=0.2706, pruned_loss=0.08236, over 12995.00 frames. ], tot_loss[loss=0.238, simple_loss=0.2868, pruned_loss=0.09462, over 2579773.61 frames. ], batch size: 36, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:16:20,246 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=22.5 2024-06-20 16:16:27,633 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 1.987e+02 2.159e+02 2.332e+02 3.155e+02, threshold=4.318e+02, percent-clipped=0.0 2024-06-20 16:16:28,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=235981.16666666666, ans=0.2 2024-06-20 16:16:31,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=235999.5, ans=0.0 2024-06-20 16:16:40,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=235999.5, ans=0.125 2024-06-20 16:17:04,536 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=15.0 2024-06-20 16:17:08,877 INFO [train.py:1028] (1/2) Epoch 13, batch 7350, loss[loss=0.2472, simple_loss=0.302, pruned_loss=0.09625, over 13340.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.2873, pruned_loss=0.09495, over 2583019.63 frames. ], batch size: 46, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:17:11,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=236054.5, ans=0.1 2024-06-20 16:17:46,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=236127.83333333334, ans=0.2 2024-06-20 16:17:58,665 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=4.074e+00 2024-06-20 16:18:03,066 INFO [train.py:1028] (1/2) Epoch 13, batch 7400, loss[loss=0.2223, simple_loss=0.2833, pruned_loss=0.08062, over 13310.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.2867, pruned_loss=0.0945, over 2587281.22 frames. ], batch size: 63, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:18:05,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=236146.16666666666, ans=0.0 2024-06-20 16:18:18,684 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.704e+02 1.922e+02 2.090e+02 2.344e+02 3.456e+02, threshold=4.181e+02, percent-clipped=0.0 2024-06-20 16:18:36,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=236201.16666666666, ans=0.125 2024-06-20 16:18:38,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=236201.16666666666, ans=0.09899494936611666 2024-06-20 16:18:39,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=236219.5, ans=0.0 2024-06-20 16:18:41,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=236219.5, ans=0.025 2024-06-20 16:18:50,721 INFO [train.py:1028] (1/2) Epoch 13, batch 7450, loss[loss=0.2305, simple_loss=0.282, pruned_loss=0.08951, over 12821.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.2865, pruned_loss=0.09412, over 2581867.21 frames. ], batch size: 29, lr: 4.46e-03, grad_scale: 64.0 2024-06-20 16:18:54,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=236237.83333333334, ans=0.025 2024-06-20 16:19:13,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=236274.5, ans=0.1 2024-06-20 16:19:15,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=236274.5, ans=0.125 2024-06-20 16:19:15,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=236274.5, ans=0.0 2024-06-20 16:19:17,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=236274.5, ans=0.125 2024-06-20 16:19:18,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=236274.5, ans=0.125 2024-06-20 16:19:19,815 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:19:21,687 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:19:24,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=236292.83333333334, ans=0.1 2024-06-20 16:19:34,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=236311.16666666666, ans=0.2 2024-06-20 16:19:35,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=236311.16666666666, ans=0.1 2024-06-20 16:19:39,714 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.50 vs. limit=15.0 2024-06-20 16:19:39,945 INFO [train.py:1028] (1/2) Epoch 13, batch 7500, loss[loss=0.2541, simple_loss=0.2909, pruned_loss=0.1086, over 10515.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.2883, pruned_loss=0.09522, over 2578807.89 frames. ], batch size: 304, lr: 4.46e-03, grad_scale: 64.0 2024-06-20 16:19:44,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=236329.5, ans=0.0 2024-06-20 16:20:00,006 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.951e+02 2.127e+02 2.339e+02 3.812e+02, threshold=4.253e+02, percent-clipped=0.0 2024-06-20 16:20:01,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=236347.83333333334, ans=0.0 2024-06-20 16:20:15,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=236384.5, ans=0.0 2024-06-20 16:20:26,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=236402.83333333334, ans=0.125 2024-06-20 16:20:28,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=236402.83333333334, ans=0.125 2024-06-20 16:20:30,262 INFO [train.py:1028] (1/2) Epoch 13, batch 7550, loss[loss=0.2403, simple_loss=0.284, pruned_loss=0.0983, over 12989.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.2885, pruned_loss=0.0954, over 2578421.95 frames. ], batch size: 158, lr: 4.46e-03, grad_scale: 64.0 2024-06-20 16:20:47,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=236439.5, ans=0.125 2024-06-20 16:20:48,207 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.38 vs. limit=15.0 2024-06-20 16:21:04,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=236457.83333333334, ans=0.1 2024-06-20 16:21:04,790 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=9.17 vs. limit=12.0 2024-06-20 16:21:10,647 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.26 vs. limit=15.0 2024-06-20 16:21:21,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=236494.5, ans=0.125 2024-06-20 16:21:26,148 INFO [train.py:1028] (1/2) Epoch 13, batch 7600, loss[loss=0.2695, simple_loss=0.3218, pruned_loss=0.1086, over 13223.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.2898, pruned_loss=0.09599, over 2577374.90 frames. ], batch size: 83, lr: 4.46e-03, grad_scale: 64.0 2024-06-20 16:21:26,495 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:21:37,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=236531.16666666666, ans=0.2 2024-06-20 16:21:39,181 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=15.0 2024-06-20 16:21:42,407 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 1.955e+02 2.099e+02 2.375e+02 4.096e+02, threshold=4.197e+02, percent-clipped=0.0 2024-06-20 16:21:44,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=236531.16666666666, ans=0.125 2024-06-20 16:21:47,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=236549.5, ans=0.125 2024-06-20 16:21:51,810 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.05 vs. limit=15.0 2024-06-20 16:22:11,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=236586.16666666666, ans=0.0 2024-06-20 16:22:14,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=236586.16666666666, ans=0.5 2024-06-20 16:22:16,517 INFO [train.py:1028] (1/2) Epoch 13, batch 7650, loss[loss=0.2474, simple_loss=0.2976, pruned_loss=0.09856, over 12891.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.2903, pruned_loss=0.09604, over 2573531.75 frames. ], batch size: 33, lr: 4.46e-03, grad_scale: 64.0 2024-06-20 16:22:27,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=236622.83333333334, ans=0.2 2024-06-20 16:22:54,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=236659.5, ans=0.0 2024-06-20 16:22:58,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=236659.5, ans=0.2 2024-06-20 16:23:08,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=236677.83333333334, ans=0.1 2024-06-20 16:23:11,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=236696.16666666666, ans=0.05 2024-06-20 16:23:12,144 INFO [train.py:1028] (1/2) Epoch 13, batch 7700, loss[loss=0.2176, simple_loss=0.2713, pruned_loss=0.0819, over 13269.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.2907, pruned_loss=0.09614, over 2569220.07 frames. ], batch size: 63, lr: 4.46e-03, grad_scale: 64.0 2024-06-20 16:23:17,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=236696.16666666666, ans=0.125 2024-06-20 16:23:23,564 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.592e+00 2024-06-20 16:23:25,152 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.680e+02 1.955e+02 2.120e+02 2.386e+02 3.256e+02, threshold=4.240e+02, percent-clipped=0.0 2024-06-20 16:23:35,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=236732.83333333334, ans=0.125 2024-06-20 16:23:35,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=236732.83333333334, ans=0.125 2024-06-20 16:23:53,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=236769.5, ans=0.2 2024-06-20 16:24:03,321 INFO [train.py:1028] (1/2) Epoch 13, batch 7750, loss[loss=0.2197, simple_loss=0.271, pruned_loss=0.08419, over 13268.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.2906, pruned_loss=0.09616, over 2573596.37 frames. ], batch size: 72, lr: 4.46e-03, grad_scale: 128.0 2024-06-20 16:24:06,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=236787.83333333334, ans=0.0 2024-06-20 16:24:19,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=236806.16666666666, ans=0.125 2024-06-20 16:24:33,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=236842.83333333334, ans=0.1 2024-06-20 16:24:43,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=236861.16666666666, ans=0.125 2024-06-20 16:24:52,578 INFO [train.py:1028] (1/2) Epoch 13, batch 7800, loss[loss=0.2511, simple_loss=0.3054, pruned_loss=0.09836, over 13131.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.2918, pruned_loss=0.09672, over 2578399.19 frames. ], batch size: 95, lr: 4.46e-03, grad_scale: 128.0 2024-06-20 16:24:56,952 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.56 vs. limit=15.0 2024-06-20 16:24:57,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=236879.5, ans=0.0 2024-06-20 16:24:57,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=236879.5, ans=0.125 2024-06-20 16:24:58,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=236879.5, ans=0.1 2024-06-20 16:24:59,762 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2024-06-20 16:25:04,762 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 1.916e+02 2.044e+02 2.209e+02 2.950e+02, threshold=4.088e+02, percent-clipped=0.0 2024-06-20 16:25:15,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=236916.16666666666, ans=0.07 2024-06-20 16:25:44,085 INFO [train.py:1028] (1/2) Epoch 13, batch 7850, loss[loss=0.2433, simple_loss=0.2863, pruned_loss=0.1001, over 11649.00 frames. ], tot_loss[loss=0.243, simple_loss=0.2923, pruned_loss=0.09688, over 2573439.04 frames. ], batch size: 17, lr: 4.46e-03, grad_scale: 128.0 2024-06-20 16:26:27,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=237026.16666666666, ans=0.1 2024-06-20 16:26:38,445 INFO [train.py:1028] (1/2) Epoch 13, batch 7900, loss[loss=0.24, simple_loss=0.2885, pruned_loss=0.09572, over 13155.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.2927, pruned_loss=0.09733, over 2572339.46 frames. ], batch size: 77, lr: 4.46e-03, grad_scale: 128.0 2024-06-20 16:26:38,941 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.80 vs. limit=15.0 2024-06-20 16:26:39,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=237062.83333333334, ans=15.0 2024-06-20 16:26:44,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=237062.83333333334, ans=0.05 2024-06-20 16:26:50,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=237081.16666666666, ans=0.0 2024-06-20 16:26:50,780 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 2.034e+02 2.314e+02 2.620e+02 3.741e+02, threshold=4.628e+02, percent-clipped=0.0 2024-06-20 16:26:56,718 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.82 vs. limit=12.0 2024-06-20 16:27:05,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=237117.83333333334, ans=0.125 2024-06-20 16:27:21,133 INFO [train.py:1028] (1/2) Epoch 13, batch 7950, loss[loss=0.2477, simple_loss=0.2851, pruned_loss=0.1052, over 10783.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.2928, pruned_loss=0.09714, over 2575692.07 frames. ], batch size: 304, lr: 4.46e-03, grad_scale: 128.0 2024-06-20 16:27:25,853 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.50 vs. limit=10.0 2024-06-20 16:27:26,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=237154.5, ans=0.125 2024-06-20 16:27:27,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=237154.5, ans=0.125 2024-06-20 16:27:36,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=237172.83333333334, ans=0.125 2024-06-20 16:27:40,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=237172.83333333334, ans=0.125 2024-06-20 16:27:46,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=237191.16666666666, ans=0.09899494936611666 2024-06-20 16:27:52,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=237209.5, ans=0.0 2024-06-20 16:27:57,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=237209.5, ans=0.0 2024-06-20 16:28:06,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=237227.83333333334, ans=0.025 2024-06-20 16:28:10,414 INFO [train.py:1028] (1/2) Epoch 13, batch 8000, loss[loss=0.2283, simple_loss=0.2855, pruned_loss=0.08556, over 12720.00 frames. ], tot_loss[loss=0.245, simple_loss=0.2944, pruned_loss=0.09781, over 2573031.66 frames. ], batch size: 29, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:28:26,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=237246.16666666666, ans=0.125 2024-06-20 16:28:27,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=237264.5, ans=0.125 2024-06-20 16:28:28,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=237264.5, ans=0.2 2024-06-20 16:28:33,153 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.710e+02 1.930e+02 2.131e+02 2.408e+02 3.572e+02, threshold=4.262e+02, percent-clipped=0.0 2024-06-20 16:28:34,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=237264.5, ans=0.125 2024-06-20 16:28:42,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=237282.83333333334, ans=0.125 2024-06-20 16:28:47,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=237301.16666666666, ans=0.025 2024-06-20 16:28:50,682 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.00 vs. limit=15.0 2024-06-20 16:28:53,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=237301.16666666666, ans=0.2 2024-06-20 16:29:02,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=237319.5, ans=0.0 2024-06-20 16:29:05,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=237337.83333333334, ans=0.0 2024-06-20 16:29:05,818 INFO [train.py:1028] (1/2) Epoch 13, batch 8050, loss[loss=0.2291, simple_loss=0.2794, pruned_loss=0.08944, over 13238.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.2935, pruned_loss=0.09753, over 2573322.31 frames. ], batch size: 83, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:29:08,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=237337.83333333334, ans=0.2 2024-06-20 16:29:14,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=237337.83333333334, ans=0.0 2024-06-20 16:29:17,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=237337.83333333334, ans=0.125 2024-06-20 16:29:19,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=237356.16666666666, ans=0.125 2024-06-20 16:29:21,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=237356.16666666666, ans=0.07 2024-06-20 16:29:23,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=237356.16666666666, ans=0.0 2024-06-20 16:29:56,570 INFO [train.py:1028] (1/2) Epoch 13, batch 8100, loss[loss=0.2384, simple_loss=0.296, pruned_loss=0.09037, over 13131.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.2942, pruned_loss=0.09759, over 2577180.40 frames. ], batch size: 112, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:30:02,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=237429.5, ans=0.0 2024-06-20 16:30:05,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=237447.83333333334, ans=0.1 2024-06-20 16:30:07,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=237447.83333333334, ans=0.125 2024-06-20 16:30:12,190 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 1.972e+02 2.080e+02 2.238e+02 3.114e+02, threshold=4.161e+02, percent-clipped=0.0 2024-06-20 16:30:20,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=237466.16666666666, ans=0.125 2024-06-20 16:30:22,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=237466.16666666666, ans=0.1 2024-06-20 16:30:23,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=237466.16666666666, ans=0.1 2024-06-20 16:30:32,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=237484.5, ans=0.0 2024-06-20 16:30:36,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=237502.83333333334, ans=0.125 2024-06-20 16:30:39,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=237502.83333333334, ans=0.0 2024-06-20 16:30:45,505 INFO [train.py:1028] (1/2) Epoch 13, batch 8150, loss[loss=0.2311, simple_loss=0.2774, pruned_loss=0.0924, over 13085.00 frames. ], tot_loss[loss=0.244, simple_loss=0.2937, pruned_loss=0.09716, over 2580262.98 frames. ], batch size: 121, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:30:58,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=237539.5, ans=0.125 2024-06-20 16:31:08,182 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.76 vs. limit=15.0 2024-06-20 16:31:11,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=237557.83333333334, ans=0.04949747468305833 2024-06-20 16:31:12,062 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2024-06-20 16:31:36,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=237594.5, ans=0.0 2024-06-20 16:31:37,238 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.23 vs. limit=22.5 2024-06-20 16:31:41,276 INFO [train.py:1028] (1/2) Epoch 13, batch 8200, loss[loss=0.2678, simple_loss=0.3126, pruned_loss=0.1115, over 13147.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.2933, pruned_loss=0.09676, over 2583565.88 frames. ], batch size: 112, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:31:42,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=237612.83333333334, ans=0.2 2024-06-20 16:31:58,378 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 1.964e+02 2.094e+02 2.266e+02 2.704e+02, threshold=4.187e+02, percent-clipped=0.0 2024-06-20 16:32:01,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=237649.5, ans=0.125 2024-06-20 16:32:05,255 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.60 vs. limit=22.5 2024-06-20 16:32:22,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=237667.83333333334, ans=0.125 2024-06-20 16:32:25,104 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2024-06-20 16:32:36,381 INFO [train.py:1028] (1/2) Epoch 13, batch 8250, loss[loss=0.2473, simple_loss=0.2974, pruned_loss=0.09856, over 13186.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.2941, pruned_loss=0.09746, over 2584119.80 frames. ], batch size: 52, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:32:38,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=237704.5, ans=0.0 2024-06-20 16:32:47,152 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.27 vs. limit=15.0 2024-06-20 16:32:52,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=237741.16666666666, ans=0.0 2024-06-20 16:32:54,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.83 vs. limit=10.0 2024-06-20 16:33:04,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=237759.5, ans=0.125 2024-06-20 16:33:05,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=237759.5, ans=15.0 2024-06-20 16:33:18,542 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:33:19,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=237777.83333333334, ans=0.125 2024-06-20 16:33:21,275 INFO [train.py:1028] (1/2) Epoch 13, batch 8300, loss[loss=0.2484, simple_loss=0.2964, pruned_loss=0.1002, over 13058.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.2933, pruned_loss=0.09678, over 2580239.25 frames. ], batch size: 103, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:33:37,042 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 1.936e+02 2.073e+02 2.269e+02 3.138e+02, threshold=4.147e+02, percent-clipped=0.0 2024-06-20 16:33:38,608 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.28 vs. limit=22.5 2024-06-20 16:33:40,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=237832.83333333334, ans=0.0 2024-06-20 16:33:42,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=237832.83333333334, ans=0.125 2024-06-20 16:33:46,694 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.75 vs. limit=15.0 2024-06-20 16:33:51,908 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=30.68 vs. limit=22.5 2024-06-20 16:33:53,707 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.79 vs. limit=15.0 2024-06-20 16:34:17,282 INFO [train.py:1028] (1/2) Epoch 13, batch 8350, loss[loss=0.2538, simple_loss=0.2974, pruned_loss=0.1051, over 13129.00 frames. ], tot_loss[loss=0.244, simple_loss=0.2939, pruned_loss=0.09699, over 2581119.58 frames. ], batch size: 112, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:34:37,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=237924.5, ans=0.04949747468305833 2024-06-20 16:34:39,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=237924.5, ans=0.0 2024-06-20 16:35:14,034 INFO [train.py:1028] (1/2) Epoch 13, batch 8400, loss[loss=0.2292, simple_loss=0.2811, pruned_loss=0.08866, over 12961.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.2936, pruned_loss=0.09692, over 2577591.02 frames. ], batch size: 39, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:35:22,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=237979.5, ans=0.025 2024-06-20 16:35:25,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=237997.83333333334, ans=0.2 2024-06-20 16:35:30,460 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 1.993e+02 2.147e+02 2.339e+02 3.017e+02, threshold=4.294e+02, percent-clipped=0.0 2024-06-20 16:35:30,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=237997.83333333334, ans=0.0 2024-06-20 16:35:44,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=238034.5, ans=0.125 2024-06-20 16:35:51,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=238034.5, ans=0.0 2024-06-20 16:35:54,656 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.607e+01 2024-06-20 16:36:02,868 INFO [train.py:1028] (1/2) Epoch 13, batch 8450, loss[loss=0.2412, simple_loss=0.2882, pruned_loss=0.09707, over 13201.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.2946, pruned_loss=0.09729, over 2579059.34 frames. ], batch size: 112, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:36:05,835 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.88 vs. limit=10.0 2024-06-20 16:36:08,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=238071.16666666666, ans=0.125 2024-06-20 16:36:10,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=238089.5, ans=0.0 2024-06-20 16:36:12,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=238089.5, ans=0.0 2024-06-20 16:36:34,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=238144.5, ans=0.04949747468305833 2024-06-20 16:36:39,465 INFO [train.py:1028] (1/2) Epoch 13, batch 8500, loss[loss=0.2099, simple_loss=0.2716, pruned_loss=0.07413, over 12644.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.2958, pruned_loss=0.09757, over 2577956.81 frames. ], batch size: 29, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:36:54,781 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.075e+02 2.273e+02 2.463e+02 3.415e+02, threshold=4.547e+02, percent-clipped=0.0 2024-06-20 16:37:00,007 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.68 vs. limit=15.0 2024-06-20 16:37:18,915 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.81 vs. limit=6.0 2024-06-20 16:37:22,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=238217.83333333334, ans=0.0 2024-06-20 16:37:24,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=238236.16666666666, ans=0.125 2024-06-20 16:37:34,943 INFO [train.py:1028] (1/2) Epoch 13, batch 8550, loss[loss=0.2362, simple_loss=0.2894, pruned_loss=0.09152, over 12425.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.2949, pruned_loss=0.09694, over 2576498.35 frames. ], batch size: 22, lr: 4.45e-03, grad_scale: 64.0 2024-06-20 16:37:35,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=238254.5, ans=0.0 2024-06-20 16:37:36,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=238254.5, ans=0.125 2024-06-20 16:37:50,518 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.19 vs. limit=15.0 2024-06-20 16:38:21,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=238327.83333333334, ans=0.0 2024-06-20 16:38:31,808 INFO [train.py:1028] (1/2) Epoch 13, batch 8600, loss[loss=0.2448, simple_loss=0.2893, pruned_loss=0.1001, over 13089.00 frames. ], tot_loss[loss=0.245, simple_loss=0.2956, pruned_loss=0.09723, over 2574743.50 frames. ], batch size: 121, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:38:35,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=238346.16666666666, ans=0.125 2024-06-20 16:38:47,937 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 2.022e+02 2.224e+02 2.393e+02 4.177e+02, threshold=4.447e+02, percent-clipped=0.0 2024-06-20 16:39:01,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=238401.16666666666, ans=0.2 2024-06-20 16:39:09,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=238419.5, ans=0.0 2024-06-20 16:39:09,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=238419.5, ans=0.0 2024-06-20 16:39:18,931 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.13 vs. limit=15.0 2024-06-20 16:39:20,249 INFO [train.py:1028] (1/2) Epoch 13, batch 8650, loss[loss=0.2148, simple_loss=0.2706, pruned_loss=0.07952, over 13108.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.2955, pruned_loss=0.09701, over 2577767.77 frames. ], batch size: 103, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:39:29,120 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.06 vs. limit=15.0 2024-06-20 16:39:29,235 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.06 vs. limit=15.0 2024-06-20 16:39:47,887 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2024-06-20 16:39:48,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=238492.83333333334, ans=0.1 2024-06-20 16:39:50,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=238492.83333333334, ans=0.125 2024-06-20 16:39:58,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=238511.16666666666, ans=0.2 2024-06-20 16:40:07,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=238511.16666666666, ans=0.1 2024-06-20 16:40:14,401 INFO [train.py:1028] (1/2) Epoch 13, batch 8700, loss[loss=0.2303, simple_loss=0.2919, pruned_loss=0.08433, over 13196.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.2965, pruned_loss=0.09789, over 2573407.97 frames. ], batch size: 59, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:40:19,624 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.28 vs. limit=15.0 2024-06-20 16:40:19,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=238529.5, ans=0.0 2024-06-20 16:40:20,509 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.85 vs. limit=15.0 2024-06-20 16:40:30,550 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.698e+02 1.984e+02 2.125e+02 2.357e+02 3.828e+02, threshold=4.250e+02, percent-clipped=0.0 2024-06-20 16:40:37,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=238566.16666666666, ans=0.0 2024-06-20 16:40:40,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=238566.16666666666, ans=0.125 2024-06-20 16:40:41,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=238566.16666666666, ans=15.0 2024-06-20 16:40:52,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.41 vs. limit=15.0 2024-06-20 16:40:54,932 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.95 vs. limit=15.0 2024-06-20 16:40:57,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=238602.83333333334, ans=0.125 2024-06-20 16:41:00,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=238602.83333333334, ans=0.0 2024-06-20 16:41:06,643 INFO [train.py:1028] (1/2) Epoch 13, batch 8750, loss[loss=0.2385, simple_loss=0.2862, pruned_loss=0.09535, over 13147.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.2962, pruned_loss=0.09783, over 2570587.96 frames. ], batch size: 121, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:41:13,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2024-06-20 16:41:46,614 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2024-06-20 16:41:53,563 INFO [train.py:1028] (1/2) Epoch 13, batch 8800, loss[loss=0.251, simple_loss=0.3069, pruned_loss=0.09756, over 13246.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.2964, pruned_loss=0.09775, over 2575297.86 frames. ], batch size: 72, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:42:10,182 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.681e+02 1.958e+02 2.091e+02 2.300e+02 3.040e+02, threshold=4.181e+02, percent-clipped=0.0 2024-06-20 16:42:18,356 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.65 vs. limit=15.0 2024-06-20 16:42:18,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=238749.5, ans=0.125 2024-06-20 16:42:26,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=238767.83333333334, ans=0.2 2024-06-20 16:42:40,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=238786.16666666666, ans=0.0 2024-06-20 16:42:40,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=238786.16666666666, ans=0.125 2024-06-20 16:42:49,473 INFO [train.py:1028] (1/2) Epoch 13, batch 8850, loss[loss=0.277, simple_loss=0.3155, pruned_loss=0.1193, over 12565.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.2964, pruned_loss=0.09818, over 2563618.08 frames. ], batch size: 202, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:43:00,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=238822.83333333334, ans=0.0 2024-06-20 16:43:06,405 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=12.0 2024-06-20 16:43:07,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=238841.16666666666, ans=0.0 2024-06-20 16:43:12,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=238841.16666666666, ans=0.2 2024-06-20 16:43:17,158 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.82 vs. limit=15.0 2024-06-20 16:43:30,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=238877.83333333334, ans=0.0 2024-06-20 16:43:35,912 INFO [train.py:1028] (1/2) Epoch 13, batch 8900, loss[loss=0.2643, simple_loss=0.3136, pruned_loss=0.1075, over 12838.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.2969, pruned_loss=0.0986, over 2561813.29 frames. ], batch size: 33, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:43:41,587 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.68 vs. limit=6.0 2024-06-20 16:43:57,630 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.031e+02 2.197e+02 2.359e+02 2.909e+02, threshold=4.394e+02, percent-clipped=0.0 2024-06-20 16:44:06,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=238932.83333333334, ans=0.0 2024-06-20 16:44:08,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=238932.83333333334, ans=0.025 2024-06-20 16:44:30,412 INFO [train.py:1028] (1/2) Epoch 13, batch 8950, loss[loss=0.2708, simple_loss=0.3134, pruned_loss=0.1141, over 12528.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.2979, pruned_loss=0.09874, over 2560645.82 frames. ], batch size: 202, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:44:32,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=238987.83333333334, ans=0.125 2024-06-20 16:44:33,353 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=15.0 2024-06-20 16:44:39,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239006.16666666666, ans=0.1 2024-06-20 16:44:44,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=239006.16666666666, ans=0.2 2024-06-20 16:44:49,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=239024.5, ans=0.125 2024-06-20 16:44:53,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=239024.5, ans=0.125 2024-06-20 16:45:03,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=239061.16666666666, ans=0.125 2024-06-20 16:45:11,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=239061.16666666666, ans=22.5 2024-06-20 16:45:13,370 INFO [train.py:1028] (1/2) Epoch 13, batch 9000, loss[loss=0.2463, simple_loss=0.2964, pruned_loss=0.09814, over 13272.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.2985, pruned_loss=0.09888, over 2567181.84 frames. ], batch size: 46, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:45:13,371 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 16:45:22,987 INFO [train.py:1060] (1/2) Epoch 13, validation: loss=0.1913, simple_loss=0.2561, pruned_loss=0.06321, over 351949.00 frames. 2024-06-20 16:45:22,988 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17479MB 2024-06-20 16:45:23,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=239079.5, ans=0.04949747468305833 2024-06-20 16:45:30,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=239097.83333333334, ans=0.125 2024-06-20 16:45:36,948 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.002e+02 2.128e+02 2.271e+02 3.032e+02, threshold=4.256e+02, percent-clipped=0.0 2024-06-20 16:45:41,201 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.62 vs. limit=15.0 2024-06-20 16:46:07,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=239152.83333333334, ans=0.0 2024-06-20 16:46:14,347 INFO [train.py:1028] (1/2) Epoch 13, batch 9050, loss[loss=0.2432, simple_loss=0.2975, pruned_loss=0.09446, over 11179.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.2985, pruned_loss=0.09887, over 2565981.21 frames. ], batch size: 16, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:46:14,954 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=22.5 2024-06-20 16:46:22,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=239171.16666666666, ans=0.0 2024-06-20 16:46:38,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239207.83333333334, ans=0.1 2024-06-20 16:46:46,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=239226.16666666666, ans=0.2 2024-06-20 16:46:57,104 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=3.069e+00 2024-06-20 16:47:02,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=239262.83333333334, ans=0.0 2024-06-20 16:47:02,809 INFO [train.py:1028] (1/2) Epoch 13, batch 9100, loss[loss=0.2475, simple_loss=0.2976, pruned_loss=0.0987, over 13255.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.2981, pruned_loss=0.0986, over 2566828.69 frames. ], batch size: 72, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:47:02,905 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:47:03,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=239262.83333333334, ans=0.1 2024-06-20 16:47:04,226 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:47:05,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=239262.83333333334, ans=0.1 2024-06-20 16:47:11,157 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.46 vs. limit=22.5 2024-06-20 16:47:14,789 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2024-06-20 16:47:16,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=239281.16666666666, ans=0.125 2024-06-20 16:47:17,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=239281.16666666666, ans=0.2 2024-06-20 16:47:18,407 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 1.996e+02 2.179e+02 2.407e+02 3.355e+02, threshold=4.358e+02, percent-clipped=0.0 2024-06-20 16:47:23,073 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.15 vs. limit=15.0 2024-06-20 16:47:25,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=239299.5, ans=0.125 2024-06-20 16:47:27,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=239299.5, ans=0.125 2024-06-20 16:47:31,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=239317.83333333334, ans=0.125 2024-06-20 16:47:36,414 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2024-06-20 16:47:39,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=239336.16666666666, ans=0.125 2024-06-20 16:47:45,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=239336.16666666666, ans=0.2 2024-06-20 16:47:53,693 INFO [train.py:1028] (1/2) Epoch 13, batch 9150, loss[loss=0.2598, simple_loss=0.3126, pruned_loss=0.1035, over 13183.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.2983, pruned_loss=0.09862, over 2568150.17 frames. ], batch size: 77, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:47:56,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=239354.5, ans=0.1 2024-06-20 16:47:58,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=239354.5, ans=0.0 2024-06-20 16:48:07,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=239372.83333333334, ans=0.125 2024-06-20 16:48:09,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=239372.83333333334, ans=0.125 2024-06-20 16:48:12,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=239391.16666666666, ans=0.125 2024-06-20 16:48:18,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=239391.16666666666, ans=0.0 2024-06-20 16:48:25,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=239409.5, ans=0.0 2024-06-20 16:48:39,056 INFO [train.py:1028] (1/2) Epoch 13, batch 9200, loss[loss=0.2391, simple_loss=0.2928, pruned_loss=0.09267, over 12929.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.2982, pruned_loss=0.09843, over 2571522.42 frames. ], batch size: 36, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:48:39,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=239446.16666666666, ans=0.125 2024-06-20 16:48:55,095 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 1.965e+02 2.099e+02 2.286e+02 3.165e+02, threshold=4.198e+02, percent-clipped=0.0 2024-06-20 16:48:58,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=239482.83333333334, ans=0.125 2024-06-20 16:49:08,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=239501.16666666666, ans=0.025 2024-06-20 16:49:09,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=239501.16666666666, ans=0.0 2024-06-20 16:49:25,597 INFO [train.py:1028] (1/2) Epoch 13, batch 9250, loss[loss=0.2244, simple_loss=0.2832, pruned_loss=0.08281, over 13208.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.2978, pruned_loss=0.09787, over 2574647.43 frames. ], batch size: 67, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:49:28,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=239537.83333333334, ans=0.1 2024-06-20 16:49:29,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=239537.83333333334, ans=0.125 2024-06-20 16:49:45,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=239574.5, ans=0.125 2024-06-20 16:49:50,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239574.5, ans=0.1 2024-06-20 16:49:55,522 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.66 vs. limit=22.5 2024-06-20 16:49:56,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=239592.83333333334, ans=0.0 2024-06-20 16:50:05,262 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.69 vs. limit=5.0 2024-06-20 16:50:11,099 INFO [train.py:1028] (1/2) Epoch 13, batch 9300, loss[loss=0.2365, simple_loss=0.2894, pruned_loss=0.09174, over 13005.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.298, pruned_loss=0.09791, over 2573095.09 frames. ], batch size: 39, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:50:19,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=239647.83333333334, ans=0.0 2024-06-20 16:50:25,901 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+02 2.005e+02 2.145e+02 2.313e+02 3.312e+02, threshold=4.290e+02, percent-clipped=0.0 2024-06-20 16:50:31,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=239666.16666666666, ans=0.05 2024-06-20 16:50:36,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=239666.16666666666, ans=0.1 2024-06-20 16:50:41,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=239684.5, ans=0.125 2024-06-20 16:50:43,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239684.5, ans=0.1 2024-06-20 16:50:45,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239684.5, ans=0.1 2024-06-20 16:50:49,488 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.464e+01 2024-06-20 16:50:52,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=239702.83333333334, ans=0.025 2024-06-20 16:50:55,538 INFO [train.py:1028] (1/2) Epoch 13, batch 9350, loss[loss=0.2376, simple_loss=0.3059, pruned_loss=0.08463, over 12515.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.2981, pruned_loss=0.09777, over 2569201.30 frames. ], batch size: 22, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:50:57,849 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.82 vs. limit=6.0 2024-06-20 16:50:59,322 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:51:04,422 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:51:09,272 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.59 vs. limit=15.0 2024-06-20 16:51:36,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=239794.5, ans=0.07 2024-06-20 16:51:41,226 INFO [train.py:1028] (1/2) Epoch 13, batch 9400, loss[loss=0.252, simple_loss=0.3114, pruned_loss=0.0963, over 13217.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.2988, pruned_loss=0.09828, over 2568276.24 frames. ], batch size: 52, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:51:51,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=239831.16666666666, ans=0.125 2024-06-20 16:51:54,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=239831.16666666666, ans=0.04949747468305833 2024-06-20 16:51:56,716 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.678e+02 1.992e+02 2.100e+02 2.328e+02 3.409e+02, threshold=4.200e+02, percent-clipped=0.0 2024-06-20 16:52:10,815 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.93 vs. limit=22.5 2024-06-20 16:52:13,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=239867.83333333334, ans=0.0 2024-06-20 16:52:25,332 INFO [train.py:1028] (1/2) Epoch 13, batch 9450, loss[loss=0.2505, simple_loss=0.3021, pruned_loss=0.09941, over 12673.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.2994, pruned_loss=0.09875, over 2568428.12 frames. ], batch size: 22, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:52:32,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=239922.83333333334, ans=0.1 2024-06-20 16:52:52,477 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.52 vs. limit=15.0 2024-06-20 16:52:57,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=239959.5, ans=0.0 2024-06-20 16:53:09,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=239977.83333333334, ans=0.025 2024-06-20 16:53:11,008 INFO [train.py:1028] (1/2) Epoch 13, batch 9500, loss[loss=0.2392, simple_loss=0.2963, pruned_loss=0.09101, over 13205.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.2982, pruned_loss=0.09776, over 2578043.76 frames. ], batch size: 43, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:53:11,906 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:53:18,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=240014.5, ans=0.0 2024-06-20 16:53:21,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=240014.5, ans=0.0 2024-06-20 16:53:22,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=240014.5, ans=0.125 2024-06-20 16:53:23,492 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.009e+02 2.180e+02 2.373e+02 3.150e+02, threshold=4.359e+02, percent-clipped=0.0 2024-06-20 16:53:31,390 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2024-06-20 16:53:32,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=240032.83333333334, ans=0.0 2024-06-20 16:53:39,720 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.29 vs. limit=15.0 2024-06-20 16:53:48,278 INFO [train.py:1028] (1/2) Epoch 13, batch 9550, loss[loss=0.2095, simple_loss=0.2747, pruned_loss=0.07218, over 12895.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.2981, pruned_loss=0.09787, over 2574113.70 frames. ], batch size: 39, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:53:56,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=240106.16666666666, ans=0.0 2024-06-20 16:54:00,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=240106.16666666666, ans=0.5 2024-06-20 16:54:19,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=240142.83333333334, ans=0.125 2024-06-20 16:54:27,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=240161.16666666666, ans=0.0 2024-06-20 16:54:29,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=240161.16666666666, ans=0.0 2024-06-20 16:54:30,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=240161.16666666666, ans=0.2 2024-06-20 16:54:32,935 INFO [train.py:1028] (1/2) Epoch 13, batch 9600, loss[loss=0.2624, simple_loss=0.2939, pruned_loss=0.1154, over 10662.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.2975, pruned_loss=0.09744, over 2574180.96 frames. ], batch size: 303, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:54:37,544 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.70 vs. limit=15.0 2024-06-20 16:54:39,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=240179.5, ans=0.0 2024-06-20 16:54:39,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=240179.5, ans=0.125 2024-06-20 16:54:42,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=240197.83333333334, ans=0.2 2024-06-20 16:54:42,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=240197.83333333334, ans=0.09899494936611666 2024-06-20 16:54:49,036 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.969e+02 2.119e+02 2.334e+02 3.153e+02, threshold=4.237e+02, percent-clipped=0.0 2024-06-20 16:54:54,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=240216.16666666666, ans=0.0 2024-06-20 16:55:06,163 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.34 vs. limit=10.0 2024-06-20 16:55:06,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=240234.5, ans=0.125 2024-06-20 16:55:10,390 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=240234.5, ans=0.2 2024-06-20 16:55:11,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=240234.5, ans=0.125 2024-06-20 16:55:17,346 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.87 vs. limit=22.5 2024-06-20 16:55:23,467 INFO [train.py:1028] (1/2) Epoch 13, batch 9650, loss[loss=0.2235, simple_loss=0.2698, pruned_loss=0.08856, over 13062.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.2971, pruned_loss=0.09784, over 2563907.88 frames. ], batch size: 132, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:55:31,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=240289.5, ans=10.0 2024-06-20 16:55:34,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=240289.5, ans=0.2 2024-06-20 16:55:37,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=240289.5, ans=0.125 2024-06-20 16:55:50,681 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.27 vs. limit=10.0 2024-06-20 16:55:55,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=240326.16666666666, ans=10.0 2024-06-20 16:55:59,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=240344.5, ans=0.025 2024-06-20 16:56:06,753 INFO [train.py:1028] (1/2) Epoch 13, batch 9700, loss[loss=0.2355, simple_loss=0.2864, pruned_loss=0.09229, over 13100.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.2968, pruned_loss=0.09782, over 2557902.23 frames. ], batch size: 145, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:56:17,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=240362.83333333334, ans=0.95 2024-06-20 16:56:24,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=240381.16666666666, ans=0.0 2024-06-20 16:56:25,107 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 1.994e+02 2.192e+02 2.512e+02 4.078e+02, threshold=4.384e+02, percent-clipped=0.0 2024-06-20 16:56:29,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=240399.5, ans=0.04949747468305833 2024-06-20 16:56:32,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=240399.5, ans=0.0 2024-06-20 16:56:37,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=240417.83333333334, ans=0.125 2024-06-20 16:56:53,794 INFO [train.py:1028] (1/2) Epoch 13, batch 9750, loss[loss=0.2102, simple_loss=0.2693, pruned_loss=0.0755, over 13071.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.2952, pruned_loss=0.09678, over 2554011.53 frames. ], batch size: 132, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:56:55,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=240454.5, ans=0.0 2024-06-20 16:56:55,543 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.62 vs. limit=15.0 2024-06-20 16:56:57,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=240454.5, ans=0.04949747468305833 2024-06-20 16:57:00,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=240454.5, ans=0.05 2024-06-20 16:57:01,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=240454.5, ans=0.125 2024-06-20 16:57:07,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=240472.83333333334, ans=0.0 2024-06-20 16:57:15,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=240491.16666666666, ans=0.1 2024-06-20 16:57:21,180 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.22 vs. limit=15.0 2024-06-20 16:57:25,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=240527.83333333334, ans=0.125 2024-06-20 16:57:31,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=240527.83333333334, ans=0.2 2024-06-20 16:57:38,480 INFO [train.py:1028] (1/2) Epoch 13, batch 9800, loss[loss=0.2382, simple_loss=0.2969, pruned_loss=0.08975, over 12889.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.2941, pruned_loss=0.09608, over 2546318.29 frames. ], batch size: 39, lr: 4.42e-03, grad_scale: 64.0 2024-06-20 16:57:47,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=240564.5, ans=0.125 2024-06-20 16:57:49,125 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.74 vs. limit=22.5 2024-06-20 16:57:49,254 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.81 vs. limit=15.0 2024-06-20 16:57:52,593 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.957e+02 2.166e+02 2.353e+02 3.246e+02, threshold=4.333e+02, percent-clipped=0.0 2024-06-20 16:57:57,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=240582.83333333334, ans=0.125 2024-06-20 16:58:01,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=240582.83333333334, ans=0.2 2024-06-20 16:58:20,427 INFO [train.py:1028] (1/2) Epoch 13, batch 9850, loss[loss=0.2595, simple_loss=0.3028, pruned_loss=0.1081, over 13068.00 frames. ], tot_loss[loss=0.244, simple_loss=0.2946, pruned_loss=0.09669, over 2538900.25 frames. ], batch size: 102, lr: 4.42e-03, grad_scale: 64.0 2024-06-20 16:58:23,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=240637.83333333334, ans=0.125 2024-06-20 16:58:23,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=240637.83333333334, ans=0.2 2024-06-20 16:58:24,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=240637.83333333334, ans=0.125 2024-06-20 16:58:50,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=240692.83333333334, ans=0.125 2024-06-20 16:58:52,014 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:59:02,761 INFO [train.py:1028] (1/2) Epoch 13, batch 9900, loss[loss=0.217, simple_loss=0.2748, pruned_loss=0.07959, over 12954.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.2942, pruned_loss=0.09726, over 2530273.20 frames. ], batch size: 39, lr: 4.42e-03, grad_scale: 64.0 2024-06-20 16:59:07,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=240729.5, ans=0.0 2024-06-20 16:59:08,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=240729.5, ans=0.125 2024-06-20 16:59:17,198 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.017e+02 2.227e+02 2.496e+02 3.038e+02, threshold=4.454e+02, percent-clipped=0.0 2024-06-20 16:59:28,401 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.01 vs. limit=15.0 2024-06-20 16:59:31,127 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.69 vs. limit=22.5 2024-06-20 16:59:47,312 INFO [train.py:1028] (1/2) Epoch 13, batch 9950, loss[loss=0.2525, simple_loss=0.307, pruned_loss=0.09902, over 12591.00 frames. ], tot_loss[loss=0.244, simple_loss=0.2934, pruned_loss=0.09733, over 2523403.01 frames. ], batch size: 29, lr: 4.42e-03, grad_scale: 64.0 2024-06-20 16:59:49,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=240821.16666666666, ans=0.125 2024-06-20 17:00:29,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=240894.5, ans=0.1 2024-06-20 17:00:31,044 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.84 vs. limit=15.0 2024-06-20 17:00:32,375 INFO [train.py:1028] (1/2) Epoch 13, batch 10000, loss[loss=0.2394, simple_loss=0.3011, pruned_loss=0.0889, over 12658.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.2942, pruned_loss=0.09825, over 2485172.61 frames. ], batch size: 22, lr: 4.42e-03, grad_scale: 64.0 2024-06-20 17:00:34,708 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.42 vs. limit=10.0 2024-06-20 17:00:47,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=240931.16666666666, ans=0.05 2024-06-20 17:00:47,968 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.021e+02 2.165e+02 2.334e+02 2.919e+02, threshold=4.331e+02, percent-clipped=0.0 2024-06-20 17:01:15,883 INFO [train.py:1028] (1/2) Epoch 13, batch 10050, loss[loss=0.2061, simple_loss=0.2532, pruned_loss=0.07944, over 12491.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.2946, pruned_loss=0.09915, over 2443944.74 frames. ], batch size: 22, lr: 4.42e-03, grad_scale: 64.0 2024-06-20 17:01:28,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=241022.83333333334, ans=0.1 2024-06-20 17:01:53,490 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.08 vs. limit=6.0 2024-06-20 17:01:57,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=241096.16666666666, ans=0.125 2024-06-20 17:01:58,403 INFO [train.py:1028] (1/2) Epoch 13, batch 10100, loss[loss=0.2344, simple_loss=0.2934, pruned_loss=0.08771, over 12431.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.2935, pruned_loss=0.09774, over 2428625.51 frames. ], batch size: 19, lr: 4.42e-03, grad_scale: 64.0 2024-06-20 17:02:05,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=241096.16666666666, ans=0.0 2024-06-20 17:05:20,043 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.960e+02 2.127e+02 2.308e+02 3.996e+02, threshold=4.254e+02, percent-clipped=0.0 2024-06-20 17:05:20,097 INFO [train.py:1028] (1/2) Epoch 14, batch 0, loss[loss=0.1907, simple_loss=0.2529, pruned_loss=0.06423, over 12968.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2529, pruned_loss=0.06423, over 12968.00 frames. ], batch size: 36, lr: 4.26e-03, grad_scale: 64.0 2024-06-20 17:05:20,098 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 17:05:30,714 INFO [train.py:1060] (1/2) Epoch 14, validation: loss=0.193, simple_loss=0.2578, pruned_loss=0.06414, over 351949.00 frames. 2024-06-20 17:05:30,715 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17479MB 2024-06-20 17:05:39,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=241127.33333333334, ans=0.1 2024-06-20 17:05:40,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=241145.66666666666, ans=0.1 2024-06-20 17:06:12,926 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.99 vs. limit=15.0 2024-06-20 17:06:15,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=241200.66666666666, ans=0.125 2024-06-20 17:06:27,123 INFO [train.py:1028] (1/2) Epoch 14, batch 50, loss[loss=0.1954, simple_loss=0.2503, pruned_loss=0.07026, over 12634.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2749, pruned_loss=0.09004, over 574931.24 frames. ], batch size: 29, lr: 4.26e-03, grad_scale: 64.0 2024-06-20 17:06:29,637 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.26 vs. limit=22.5 2024-06-20 17:06:33,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=241219.0, ans=0.1 2024-06-20 17:06:40,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=241237.33333333334, ans=0.05 2024-06-20 17:06:40,570 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.17 vs. limit=15.0 2024-06-20 17:06:47,604 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.47 vs. limit=15.0 2024-06-20 17:07:06,464 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:07:11,550 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 1.999e+02 2.159e+02 2.420e+02 3.006e+02, threshold=4.318e+02, percent-clipped=0.0 2024-06-20 17:07:11,585 INFO [train.py:1028] (1/2) Epoch 14, batch 100, loss[loss=0.2191, simple_loss=0.2763, pruned_loss=0.08098, over 13341.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2738, pruned_loss=0.08909, over 1018044.06 frames. ], batch size: 46, lr: 4.26e-03, grad_scale: 64.0 2024-06-20 17:07:12,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=241310.66666666666, ans=0.125 2024-06-20 17:07:13,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2024-06-20 17:07:27,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=241329.0, ans=0.025 2024-06-20 17:07:53,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=241384.0, ans=0.1 2024-06-20 17:07:58,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=241384.0, ans=0.125 2024-06-20 17:08:02,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=241402.33333333334, ans=0.125 2024-06-20 17:08:02,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=241402.33333333334, ans=0.025 2024-06-20 17:08:02,702 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.68 vs. limit=15.0 2024-06-20 17:08:02,982 INFO [train.py:1028] (1/2) Epoch 14, batch 150, loss[loss=0.1939, simple_loss=0.249, pruned_loss=0.0694, over 12636.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.273, pruned_loss=0.08757, over 1364768.53 frames. ], batch size: 29, lr: 4.26e-03, grad_scale: 64.0 2024-06-20 17:08:07,078 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.01 vs. limit=6.0 2024-06-20 17:08:15,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=241420.66666666666, ans=0.07 2024-06-20 17:08:28,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=241439.0, ans=0.0 2024-06-20 17:08:30,778 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.89 vs. limit=15.0 2024-06-20 17:08:48,570 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.93 vs. limit=15.0 2024-06-20 17:08:49,766 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 1.867e+02 2.017e+02 2.217e+02 2.995e+02, threshold=4.033e+02, percent-clipped=0.0 2024-06-20 17:08:49,802 INFO [train.py:1028] (1/2) Epoch 14, batch 200, loss[loss=0.2595, simple_loss=0.299, pruned_loss=0.11, over 12471.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2733, pruned_loss=0.08801, over 1634220.09 frames. ], batch size: 202, lr: 4.25e-03, grad_scale: 64.0 2024-06-20 17:08:51,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=241494.0, ans=0.1 2024-06-20 17:08:54,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=241494.0, ans=0.09899494936611666 2024-06-20 17:08:55,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=241494.0, ans=0.125 2024-06-20 17:08:58,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=241512.33333333334, ans=0.125 2024-06-20 17:09:10,825 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:09:25,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=241549.0, ans=0.125 2024-06-20 17:09:29,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=241549.0, ans=0.0 2024-06-20 17:09:40,607 INFO [train.py:1028] (1/2) Epoch 14, batch 250, loss[loss=0.2203, simple_loss=0.2623, pruned_loss=0.08913, over 13013.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2729, pruned_loss=0.08796, over 1845786.35 frames. ], batch size: 144, lr: 4.25e-03, grad_scale: 64.0 2024-06-20 17:09:40,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=241585.66666666666, ans=0.0 2024-06-20 17:09:49,189 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.55 vs. limit=22.5 2024-06-20 17:09:54,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=241604.0, ans=0.125 2024-06-20 17:10:08,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=241640.66666666666, ans=0.0 2024-06-20 17:10:12,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=241640.66666666666, ans=0.125 2024-06-20 17:10:23,358 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2024-06-20 17:10:29,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=241659.0, ans=0.1 2024-06-20 17:10:31,515 INFO [train.py:1028] (1/2) Epoch 14, batch 300, loss[loss=0.2341, simple_loss=0.2739, pruned_loss=0.09713, over 13181.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2732, pruned_loss=0.08814, over 2008842.37 frames. ], batch size: 112, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:10:32,099 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.856e+02 1.968e+02 2.112e+02 2.615e+02, threshold=3.936e+02, percent-clipped=0.0 2024-06-20 17:10:48,849 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.84 vs. limit=15.0 2024-06-20 17:10:56,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=241732.33333333334, ans=0.2 2024-06-20 17:11:05,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=241750.66666666666, ans=0.125 2024-06-20 17:11:06,742 INFO [train.py:1028] (1/2) Epoch 14, batch 350, loss[loss=0.2197, simple_loss=0.263, pruned_loss=0.08814, over 13032.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2729, pruned_loss=0.08796, over 2137970.38 frames. ], batch size: 33, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:11:13,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.76 vs. limit=15.0 2024-06-20 17:11:24,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=241805.66666666666, ans=0.125 2024-06-20 17:11:24,206 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.46 vs. limit=15.0 2024-06-20 17:11:31,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=241824.0, ans=0.125 2024-06-20 17:11:45,236 INFO [train.py:1028] (1/2) Epoch 14, batch 400, loss[loss=0.2132, simple_loss=0.2665, pruned_loss=0.07991, over 13235.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2731, pruned_loss=0.08768, over 2239325.13 frames. ], batch size: 63, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:11:45,951 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.674e+02 1.924e+02 2.079e+02 2.305e+02 3.125e+02, threshold=4.157e+02, percent-clipped=0.0 2024-06-20 17:11:49,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=241860.66666666666, ans=0.0 2024-06-20 17:11:59,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=241879.0, ans=0.125 2024-06-20 17:11:59,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=241879.0, ans=0.0 2024-06-20 17:12:18,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=241934.0, ans=0.2 2024-06-20 17:12:26,946 INFO [train.py:1028] (1/2) Epoch 14, batch 450, loss[loss=0.2264, simple_loss=0.2799, pruned_loss=0.08644, over 13185.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2734, pruned_loss=0.08813, over 2313630.41 frames. ], batch size: 67, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:12:31,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=241952.33333333334, ans=0.0 2024-06-20 17:12:59,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=242007.33333333334, ans=0.125 2024-06-20 17:13:02,374 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=9.45 vs. limit=12.0 2024-06-20 17:13:08,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=242025.66666666666, ans=0.0 2024-06-20 17:13:14,574 INFO [train.py:1028] (1/2) Epoch 14, batch 500, loss[loss=0.224, simple_loss=0.2727, pruned_loss=0.08763, over 13070.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2736, pruned_loss=0.08805, over 2376749.25 frames. ], batch size: 121, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:13:15,212 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.704e+02 1.863e+02 1.955e+02 2.145e+02 2.797e+02, threshold=3.909e+02, percent-clipped=0.0 2024-06-20 17:13:41,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=242099.0, ans=0.125 2024-06-20 17:13:42,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=242099.0, ans=0.125 2024-06-20 17:13:47,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=242117.33333333334, ans=0.125 2024-06-20 17:13:53,493 INFO [train.py:1028] (1/2) Epoch 14, batch 550, loss[loss=0.2437, simple_loss=0.2799, pruned_loss=0.1037, over 12950.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2733, pruned_loss=0.08762, over 2421254.76 frames. ], batch size: 158, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:13:54,157 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2024-06-20 17:13:57,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=242135.66666666666, ans=0.2 2024-06-20 17:13:57,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=242135.66666666666, ans=0.125 2024-06-20 17:14:14,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=242172.33333333334, ans=0.0 2024-06-20 17:14:17,964 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.81 vs. limit=6.0 2024-06-20 17:14:19,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=242190.66666666666, ans=0.0 2024-06-20 17:14:29,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=242209.0, ans=0.125 2024-06-20 17:14:30,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=242209.0, ans=0.125 2024-06-20 17:14:34,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=242227.33333333334, ans=0.0 2024-06-20 17:14:34,889 INFO [train.py:1028] (1/2) Epoch 14, batch 600, loss[loss=0.2144, simple_loss=0.2555, pruned_loss=0.0867, over 13047.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2734, pruned_loss=0.08776, over 2458827.05 frames. ], batch size: 144, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:14:35,675 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.980e+02 2.105e+02 2.345e+02 3.080e+02, threshold=4.210e+02, percent-clipped=0.0 2024-06-20 17:14:36,111 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.49 vs. limit=12.0 2024-06-20 17:14:40,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=242227.33333333334, ans=0.05 2024-06-20 17:14:45,777 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.72 vs. limit=15.0 2024-06-20 17:14:47,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242245.66666666666, ans=0.1 2024-06-20 17:14:50,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=242264.0, ans=0.0 2024-06-20 17:15:00,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=242282.33333333334, ans=0.09899494936611666 2024-06-20 17:15:04,563 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.50 vs. limit=10.0 2024-06-20 17:15:10,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=242300.66666666666, ans=0.2 2024-06-20 17:15:13,542 INFO [train.py:1028] (1/2) Epoch 14, batch 650, loss[loss=0.223, simple_loss=0.277, pruned_loss=0.08444, over 13152.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2732, pruned_loss=0.0874, over 2490050.28 frames. ], batch size: 59, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:15:22,768 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.96 vs. limit=15.0 2024-06-20 17:15:27,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242337.33333333334, ans=0.1 2024-06-20 17:15:38,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=242355.66666666666, ans=10.0 2024-06-20 17:15:42,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=242374.0, ans=0.0 2024-06-20 17:15:43,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=242374.0, ans=0.125 2024-06-20 17:15:44,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=242374.0, ans=0.025 2024-06-20 17:15:46,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=242374.0, ans=0.125 2024-06-20 17:15:50,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=242392.33333333334, ans=0.0 2024-06-20 17:15:50,731 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.76 vs. limit=10.0 2024-06-20 17:15:51,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=242392.33333333334, ans=0.2 2024-06-20 17:15:55,716 INFO [train.py:1028] (1/2) Epoch 14, batch 700, loss[loss=0.2207, simple_loss=0.2847, pruned_loss=0.07839, over 13277.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2728, pruned_loss=0.08706, over 2512059.30 frames. ], batch size: 46, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:15:56,389 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 1.882e+02 2.032e+02 2.250e+02 3.227e+02, threshold=4.063e+02, percent-clipped=0.0 2024-06-20 17:16:01,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=242410.66666666666, ans=0.125 2024-06-20 17:16:06,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=242429.0, ans=0.125 2024-06-20 17:16:12,295 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.50 vs. limit=15.0 2024-06-20 17:16:15,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=242447.33333333334, ans=0.2 2024-06-20 17:16:27,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=242484.0, ans=0.0 2024-06-20 17:16:32,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242484.0, ans=0.1 2024-06-20 17:16:34,101 INFO [train.py:1028] (1/2) Epoch 14, batch 750, loss[loss=0.2234, simple_loss=0.2771, pruned_loss=0.08488, over 13251.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2724, pruned_loss=0.08654, over 2527815.52 frames. ], batch size: 63, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:16:35,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=242502.33333333334, ans=0.125 2024-06-20 17:16:45,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=242520.66666666666, ans=0.125 2024-06-20 17:16:48,677 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.75 vs. limit=10.0 2024-06-20 17:16:53,809 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.91 vs. limit=10.0 2024-06-20 17:17:02,396 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.84 vs. limit=15.0 2024-06-20 17:17:15,401 INFO [train.py:1028] (1/2) Epoch 14, batch 800, loss[loss=0.2018, simple_loss=0.2525, pruned_loss=0.07554, over 12847.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2724, pruned_loss=0.08658, over 2539900.56 frames. ], batch size: 36, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:17:16,018 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.895e+02 2.039e+02 2.275e+02 3.849e+02, threshold=4.078e+02, percent-clipped=0.0 2024-06-20 17:17:29,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=242612.33333333334, ans=0.125 2024-06-20 17:17:40,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=242649.0, ans=0.2 2024-06-20 17:17:57,050 INFO [train.py:1028] (1/2) Epoch 14, batch 850, loss[loss=0.2321, simple_loss=0.2717, pruned_loss=0.09627, over 13153.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2721, pruned_loss=0.08655, over 2549708.03 frames. ], batch size: 95, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:18:11,635 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.87 vs. limit=15.0 2024-06-20 17:18:15,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=242722.33333333334, ans=0.2 2024-06-20 17:18:27,310 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=242759.0, ans=0.125 2024-06-20 17:18:30,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=242759.0, ans=0.125 2024-06-20 17:18:35,735 INFO [train.py:1028] (1/2) Epoch 14, batch 900, loss[loss=0.2099, simple_loss=0.2652, pruned_loss=0.07731, over 12903.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2712, pruned_loss=0.08601, over 2555073.67 frames. ], batch size: 36, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:18:36,426 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.817e+02 1.913e+02 2.086e+02 2.995e+02, threshold=3.826e+02, percent-clipped=0.0 2024-06-20 17:18:49,120 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.93 vs. limit=15.0 2024-06-20 17:18:55,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=242814.0, ans=0.07 2024-06-20 17:18:57,589 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.54 vs. limit=6.0 2024-06-20 17:18:59,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=242832.33333333334, ans=0.1 2024-06-20 17:19:00,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242832.33333333334, ans=0.1 2024-06-20 17:19:00,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=242832.33333333334, ans=0.125 2024-06-20 17:19:02,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=242832.33333333334, ans=0.1 2024-06-20 17:19:14,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=242869.0, ans=0.1 2024-06-20 17:19:14,747 INFO [train.py:1028] (1/2) Epoch 14, batch 950, loss[loss=0.2092, simple_loss=0.2612, pruned_loss=0.07857, over 12944.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2712, pruned_loss=0.08595, over 2558352.64 frames. ], batch size: 39, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:19:22,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=242869.0, ans=0.125 2024-06-20 17:19:25,155 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.96 vs. limit=10.0 2024-06-20 17:19:27,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=242887.33333333334, ans=0.025 2024-06-20 17:19:33,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=242905.66666666666, ans=0.125 2024-06-20 17:19:34,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242905.66666666666, ans=0.1 2024-06-20 17:19:35,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=242905.66666666666, ans=0.07 2024-06-20 17:19:37,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.94 vs. limit=15.0 2024-06-20 17:19:48,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=242942.33333333334, ans=0.125 2024-06-20 17:19:49,831 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2024-06-20 17:19:54,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242942.33333333334, ans=0.1 2024-06-20 17:19:56,730 INFO [train.py:1028] (1/2) Epoch 14, batch 1000, loss[loss=0.2292, simple_loss=0.2831, pruned_loss=0.0877, over 13268.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.271, pruned_loss=0.08586, over 2561106.99 frames. ], batch size: 49, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:19:57,404 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.868e+02 1.995e+02 2.239e+02 2.729e+02, threshold=3.990e+02, percent-clipped=0.0 2024-06-20 17:20:16,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=242997.33333333334, ans=0.125 2024-06-20 17:20:22,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=243015.66666666666, ans=0.2 2024-06-20 17:20:23,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=243015.66666666666, ans=0.125 2024-06-20 17:20:24,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=243015.66666666666, ans=0.2 2024-06-20 17:20:25,914 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:20:28,678 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.92 vs. limit=10.0 2024-06-20 17:20:39,064 INFO [train.py:1028] (1/2) Epoch 14, batch 1050, loss[loss=0.2245, simple_loss=0.2799, pruned_loss=0.08452, over 13159.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2715, pruned_loss=0.08624, over 2564942.52 frames. ], batch size: 77, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:20:39,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=243052.33333333334, ans=0.0 2024-06-20 17:20:57,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=243089.0, ans=0.125 2024-06-20 17:21:04,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=243107.33333333334, ans=0.125 2024-06-20 17:21:18,295 INFO [train.py:1028] (1/2) Epoch 14, batch 1100, loss[loss=0.208, simple_loss=0.2585, pruned_loss=0.07876, over 13258.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2725, pruned_loss=0.08686, over 2569093.34 frames. ], batch size: 52, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:21:18,957 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.890e+02 2.001e+02 2.214e+02 2.965e+02, threshold=4.001e+02, percent-clipped=0.0 2024-06-20 17:21:26,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=243162.33333333334, ans=0.0 2024-06-20 17:21:44,381 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:21:51,619 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.26 vs. limit=15.0 2024-06-20 17:21:57,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=243217.33333333334, ans=0.125 2024-06-20 17:21:59,884 INFO [train.py:1028] (1/2) Epoch 14, batch 1150, loss[loss=0.2298, simple_loss=0.2871, pruned_loss=0.0863, over 13250.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2729, pruned_loss=0.08734, over 2571661.40 frames. ], batch size: 52, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:22:03,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=243235.66666666666, ans=0.0 2024-06-20 17:22:07,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=243254.0, ans=0.125 2024-06-20 17:22:09,128 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.51 vs. limit=15.0 2024-06-20 17:22:12,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=243254.0, ans=0.125 2024-06-20 17:22:24,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243272.33333333334, ans=0.1 2024-06-20 17:22:41,856 INFO [train.py:1028] (1/2) Epoch 14, batch 1200, loss[loss=0.2211, simple_loss=0.2689, pruned_loss=0.08661, over 13179.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2726, pruned_loss=0.08741, over 2573466.24 frames. ], batch size: 77, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:22:42,738 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.668e+02 1.901e+02 2.026e+02 2.241e+02 2.826e+02, threshold=4.053e+02, percent-clipped=0.0 2024-06-20 17:22:45,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=243327.33333333334, ans=0.025 2024-06-20 17:22:52,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=243345.66666666666, ans=0.2 2024-06-20 17:22:54,826 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.07 vs. limit=15.0 2024-06-20 17:22:56,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=243345.66666666666, ans=0.015 2024-06-20 17:23:00,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=243364.0, ans=0.125 2024-06-20 17:23:12,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=243400.66666666666, ans=10.0 2024-06-20 17:23:17,376 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=243400.66666666666, ans=10.0 2024-06-20 17:23:20,412 INFO [train.py:1028] (1/2) Epoch 14, batch 1250, loss[loss=0.2001, simple_loss=0.2437, pruned_loss=0.07823, over 13192.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2731, pruned_loss=0.08795, over 2583063.16 frames. ], batch size: 112, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:23:36,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=243455.66666666666, ans=0.0 2024-06-20 17:23:39,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=243455.66666666666, ans=0.125 2024-06-20 17:23:41,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=243455.66666666666, ans=0.125 2024-06-20 17:23:43,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=243474.0, ans=0.0 2024-06-20 17:23:47,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=243474.0, ans=0.0 2024-06-20 17:23:51,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=243492.33333333334, ans=0.0 2024-06-20 17:23:54,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=243492.33333333334, ans=0.0 2024-06-20 17:23:58,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=243492.33333333334, ans=0.125 2024-06-20 17:23:59,576 INFO [train.py:1028] (1/2) Epoch 14, batch 1300, loss[loss=0.2192, simple_loss=0.2643, pruned_loss=0.08704, over 12802.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2731, pruned_loss=0.08791, over 2583130.36 frames. ], batch size: 176, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:24:00,246 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 1.853e+02 1.966e+02 2.092e+02 3.282e+02, threshold=3.931e+02, percent-clipped=0.0 2024-06-20 17:24:04,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243510.66666666666, ans=0.1 2024-06-20 17:24:09,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=243529.0, ans=0.0 2024-06-20 17:24:15,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=243529.0, ans=0.0 2024-06-20 17:24:16,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=243529.0, ans=0.0 2024-06-20 17:24:21,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=243547.33333333334, ans=0.1 2024-06-20 17:24:26,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=243565.66666666666, ans=0.025 2024-06-20 17:24:28,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=243565.66666666666, ans=0.125 2024-06-20 17:24:35,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=243584.0, ans=0.125 2024-06-20 17:24:41,461 INFO [train.py:1028] (1/2) Epoch 14, batch 1350, loss[loss=0.2288, simple_loss=0.2838, pruned_loss=0.08686, over 13209.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2724, pruned_loss=0.08736, over 2585674.99 frames. ], batch size: 59, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:24:41,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=243602.33333333334, ans=0.05 2024-06-20 17:24:53,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=243620.66666666666, ans=0.125 2024-06-20 17:24:59,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=243639.0, ans=0.125 2024-06-20 17:25:06,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=243639.0, ans=0.125 2024-06-20 17:25:24,019 INFO [train.py:1028] (1/2) Epoch 14, batch 1400, loss[loss=0.2143, simple_loss=0.2668, pruned_loss=0.0809, over 12713.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2731, pruned_loss=0.08787, over 2587492.25 frames. ], batch size: 25, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:25:24,667 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.858e+02 1.973e+02 2.145e+02 2.652e+02, threshold=3.946e+02, percent-clipped=0.0 2024-06-20 17:25:27,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=243694.0, ans=0.025 2024-06-20 17:25:39,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=243730.66666666666, ans=0.0 2024-06-20 17:25:54,992 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:26:03,168 INFO [train.py:1028] (1/2) Epoch 14, batch 1450, loss[loss=0.2095, simple_loss=0.2569, pruned_loss=0.08105, over 13104.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2733, pruned_loss=0.08794, over 2587253.99 frames. ], batch size: 121, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:26:13,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=243804.0, ans=0.125 2024-06-20 17:26:21,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=243822.33333333334, ans=0.0 2024-06-20 17:26:33,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=243840.66666666666, ans=0.125 2024-06-20 17:26:34,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243840.66666666666, ans=0.1 2024-06-20 17:26:37,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=243840.66666666666, ans=0.2 2024-06-20 17:26:39,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=243859.0, ans=0.125 2024-06-20 17:26:41,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=243859.0, ans=0.0 2024-06-20 17:26:46,735 INFO [train.py:1028] (1/2) Epoch 14, batch 1500, loss[loss=0.2245, simple_loss=0.2748, pruned_loss=0.08713, over 13226.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2733, pruned_loss=0.08799, over 2589620.18 frames. ], batch size: 83, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:26:47,361 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 1.877e+02 1.995e+02 2.132e+02 3.022e+02, threshold=3.990e+02, percent-clipped=0.0 2024-06-20 17:26:47,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=243877.33333333334, ans=0.125 2024-06-20 17:26:49,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=243877.33333333334, ans=0.125 2024-06-20 17:27:00,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=243895.66666666666, ans=0.125 2024-06-20 17:27:15,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=243932.33333333334, ans=0.2 2024-06-20 17:27:20,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=243950.66666666666, ans=0.125 2024-06-20 17:27:23,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=243950.66666666666, ans=0.125 2024-06-20 17:27:25,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=243950.66666666666, ans=0.2 2024-06-20 17:27:28,630 INFO [train.py:1028] (1/2) Epoch 14, batch 1550, loss[loss=0.2362, simple_loss=0.2802, pruned_loss=0.09611, over 13021.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2732, pruned_loss=0.08796, over 2584580.52 frames. ], batch size: 102, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:27:31,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=243969.0, ans=0.2 2024-06-20 17:27:35,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=243987.33333333334, ans=0.125 2024-06-20 17:27:44,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=244005.66666666666, ans=0.125 2024-06-20 17:27:48,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=244005.66666666666, ans=0.05 2024-06-20 17:27:51,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=244005.66666666666, ans=0.125 2024-06-20 17:28:07,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=244060.66666666666, ans=0.2 2024-06-20 17:28:08,228 INFO [train.py:1028] (1/2) Epoch 14, batch 1600, loss[loss=0.2118, simple_loss=0.2566, pruned_loss=0.08346, over 13153.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2728, pruned_loss=0.08767, over 2579462.11 frames. ], batch size: 77, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:28:08,953 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 1.888e+02 2.045e+02 2.263e+02 3.518e+02, threshold=4.089e+02, percent-clipped=0.0 2024-06-20 17:28:09,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244060.66666666666, ans=0.1 2024-06-20 17:28:12,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=244060.66666666666, ans=0.125 2024-06-20 17:28:18,321 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:28:44,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=244134.0, ans=0.025 2024-06-20 17:28:45,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=244134.0, ans=0.125 2024-06-20 17:28:46,735 INFO [train.py:1028] (1/2) Epoch 14, batch 1650, loss[loss=0.2398, simple_loss=0.2848, pruned_loss=0.09742, over 13152.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2733, pruned_loss=0.08812, over 2577106.47 frames. ], batch size: 95, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:28:48,823 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.24 vs. limit=15.0 2024-06-20 17:28:54,938 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.12 vs. limit=22.5 2024-06-20 17:28:57,213 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.07 vs. limit=15.0 2024-06-20 17:29:00,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=244170.66666666666, ans=0.1 2024-06-20 17:29:00,174 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.56 vs. limit=6.0 2024-06-20 17:29:17,902 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.80 vs. limit=22.5 2024-06-20 17:29:26,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=244225.66666666666, ans=0.125 2024-06-20 17:29:28,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=244244.0, ans=0.0 2024-06-20 17:29:29,398 INFO [train.py:1028] (1/2) Epoch 14, batch 1700, loss[loss=0.2234, simple_loss=0.2805, pruned_loss=0.08316, over 12503.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2736, pruned_loss=0.08788, over 2581655.37 frames. ], batch size: 25, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:29:30,030 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 1.901e+02 2.097e+02 2.403e+02 3.396e+02, threshold=4.194e+02, percent-clipped=0.0 2024-06-20 17:29:32,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=244244.0, ans=10.0 2024-06-20 17:29:39,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=244244.0, ans=0.2 2024-06-20 17:29:39,780 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:29:40,114 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.80 vs. limit=15.0 2024-06-20 17:29:53,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=244280.66666666666, ans=0.0 2024-06-20 17:29:56,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=244299.0, ans=0.125 2024-06-20 17:30:00,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=244299.0, ans=0.125 2024-06-20 17:30:03,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=244317.33333333334, ans=0.0 2024-06-20 17:30:12,005 INFO [train.py:1028] (1/2) Epoch 14, batch 1750, loss[loss=0.2268, simple_loss=0.2822, pruned_loss=0.08573, over 12670.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.274, pruned_loss=0.08771, over 2582380.64 frames. ], batch size: 22, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:30:14,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244335.66666666666, ans=0.1 2024-06-20 17:30:32,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=244372.33333333334, ans=0.125 2024-06-20 17:30:34,265 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.24 vs. limit=22.5 2024-06-20 17:30:35,604 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=6.753e-01 2024-06-20 17:30:42,241 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2024-06-20 17:30:53,214 INFO [train.py:1028] (1/2) Epoch 14, batch 1800, loss[loss=0.2134, simple_loss=0.2663, pruned_loss=0.08018, over 13212.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2743, pruned_loss=0.08808, over 2582461.33 frames. ], batch size: 67, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:30:53,953 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.831e+02 1.987e+02 2.105e+02 3.039e+02, threshold=3.973e+02, percent-clipped=0.0 2024-06-20 17:31:07,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=244445.66666666666, ans=0.1 2024-06-20 17:31:16,942 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=244482.33333333334, ans=0.125 2024-06-20 17:31:24,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=244500.66666666666, ans=0.1 2024-06-20 17:31:36,892 INFO [train.py:1028] (1/2) Epoch 14, batch 1850, loss[loss=0.2344, simple_loss=0.2724, pruned_loss=0.09822, over 13267.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2742, pruned_loss=0.08805, over 2584114.90 frames. ], batch size: 83, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:31:42,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244519.0, ans=0.1 2024-06-20 17:31:51,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=244537.33333333334, ans=0.2 2024-06-20 17:31:57,276 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.570e+00 2024-06-20 17:32:09,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=244574.0, ans=0.125 2024-06-20 17:32:16,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=244592.33333333334, ans=15.0 2024-06-20 17:32:19,646 INFO [train.py:1028] (1/2) Epoch 14, batch 1900, loss[loss=0.2292, simple_loss=0.278, pruned_loss=0.09023, over 13170.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2736, pruned_loss=0.08801, over 2587503.94 frames. ], batch size: 95, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:32:20,462 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.830e+02 1.923e+02 2.091e+02 2.625e+02, threshold=3.846e+02, percent-clipped=0.0 2024-06-20 17:32:34,143 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.48 vs. limit=15.0 2024-06-20 17:32:52,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=244684.0, ans=0.1 2024-06-20 17:32:55,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=244684.0, ans=0.0 2024-06-20 17:32:59,420 INFO [train.py:1028] (1/2) Epoch 14, batch 1950, loss[loss=0.2257, simple_loss=0.2818, pruned_loss=0.08481, over 13270.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.273, pruned_loss=0.08785, over 2592335.01 frames. ], batch size: 52, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:33:01,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=244702.33333333334, ans=0.0 2024-06-20 17:33:02,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=244702.33333333334, ans=0.0 2024-06-20 17:33:05,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244702.33333333334, ans=0.1 2024-06-20 17:33:20,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244739.0, ans=0.1 2024-06-20 17:33:33,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=244775.66666666666, ans=0.125 2024-06-20 17:33:38,809 INFO [train.py:1028] (1/2) Epoch 14, batch 2000, loss[loss=0.2226, simple_loss=0.2793, pruned_loss=0.08295, over 12637.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2729, pruned_loss=0.0876, over 2588000.86 frames. ], batch size: 22, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:33:42,876 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.721e+02 1.909e+02 2.046e+02 2.282e+02 3.258e+02, threshold=4.091e+02, percent-clipped=0.0 2024-06-20 17:33:58,743 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.57 vs. limit=15.0 2024-06-20 17:34:03,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=244830.66666666666, ans=0.125 2024-06-20 17:34:10,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=244849.0, ans=0.125 2024-06-20 17:34:20,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=244867.33333333334, ans=0.0 2024-06-20 17:34:20,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=244867.33333333334, ans=0.025 2024-06-20 17:34:24,370 INFO [train.py:1028] (1/2) Epoch 14, batch 2050, loss[loss=0.1917, simple_loss=0.2487, pruned_loss=0.06736, over 12693.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.273, pruned_loss=0.08765, over 2582813.31 frames. ], batch size: 29, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:34:30,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=244885.66666666666, ans=0.125 2024-06-20 17:34:59,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=244959.0, ans=0.0 2024-06-20 17:35:04,224 INFO [train.py:1028] (1/2) Epoch 14, batch 2100, loss[loss=0.2074, simple_loss=0.2616, pruned_loss=0.07659, over 13211.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2732, pruned_loss=0.08735, over 2585430.28 frames. ], batch size: 59, lr: 4.22e-03, grad_scale: 32.0 2024-06-20 17:35:04,935 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 1.880e+02 1.989e+02 2.121e+02 2.737e+02, threshold=3.979e+02, percent-clipped=0.0 2024-06-20 17:35:05,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=244977.33333333334, ans=0.125 2024-06-20 17:35:07,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=244977.33333333334, ans=0.0 2024-06-20 17:35:28,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=245032.33333333334, ans=0.0 2024-06-20 17:35:32,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.44 vs. limit=15.0 2024-06-20 17:35:43,349 INFO [train.py:1028] (1/2) Epoch 14, batch 2150, loss[loss=0.2214, simple_loss=0.2714, pruned_loss=0.08566, over 13269.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2728, pruned_loss=0.08704, over 2587345.51 frames. ], batch size: 52, lr: 4.22e-03, grad_scale: 32.0 2024-06-20 17:35:43,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=245069.0, ans=0.1 2024-06-20 17:35:46,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=245069.0, ans=0.1 2024-06-20 17:36:01,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=245105.66666666666, ans=0.125 2024-06-20 17:36:04,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=245105.66666666666, ans=0.125 2024-06-20 17:36:16,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=245142.33333333334, ans=0.125 2024-06-20 17:36:23,190 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.51 vs. limit=15.0 2024-06-20 17:36:26,480 INFO [train.py:1028] (1/2) Epoch 14, batch 2200, loss[loss=0.224, simple_loss=0.2751, pruned_loss=0.08647, over 13214.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2733, pruned_loss=0.08736, over 2588407.76 frames. ], batch size: 83, lr: 4.22e-03, grad_scale: 32.0 2024-06-20 17:36:27,264 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.848e+02 1.980e+02 2.133e+02 3.134e+02, threshold=3.960e+02, percent-clipped=0.0 2024-06-20 17:36:30,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=245160.66666666666, ans=0.1 2024-06-20 17:36:36,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=245179.0, ans=0.125 2024-06-20 17:36:37,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=245179.0, ans=0.125 2024-06-20 17:36:37,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=245179.0, ans=0.0 2024-06-20 17:37:01,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=245234.0, ans=0.1 2024-06-20 17:37:03,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=245234.0, ans=0.0 2024-06-20 17:37:09,199 INFO [train.py:1028] (1/2) Epoch 14, batch 2250, loss[loss=0.233, simple_loss=0.2827, pruned_loss=0.09163, over 13239.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2738, pruned_loss=0.08769, over 2587330.11 frames. ], batch size: 63, lr: 4.22e-03, grad_scale: 32.0 2024-06-20 17:37:10,602 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.12 vs. limit=15.0 2024-06-20 17:37:15,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=245252.33333333334, ans=0.035 2024-06-20 17:37:21,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=245270.66666666666, ans=0.1 2024-06-20 17:37:22,683 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=12.0 2024-06-20 17:37:29,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=245289.0, ans=0.0 2024-06-20 17:37:31,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=245289.0, ans=0.125 2024-06-20 17:37:41,417 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2024-06-20 17:37:44,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=245325.66666666666, ans=0.0 2024-06-20 17:37:48,751 INFO [train.py:1028] (1/2) Epoch 14, batch 2300, loss[loss=0.2177, simple_loss=0.262, pruned_loss=0.0867, over 12781.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2741, pruned_loss=0.08755, over 2582050.71 frames. ], batch size: 33, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:37:49,426 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.698e+02 1.849e+02 1.988e+02 2.212e+02 3.408e+02, threshold=3.976e+02, percent-clipped=0.0 2024-06-20 17:37:52,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=245344.0, ans=0.125 2024-06-20 17:37:59,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=245362.33333333334, ans=0.0 2024-06-20 17:38:06,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=245380.66666666666, ans=0.025 2024-06-20 17:38:10,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=245380.66666666666, ans=0.0 2024-06-20 17:38:14,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=245399.0, ans=0.0 2024-06-20 17:38:19,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=245417.33333333334, ans=0.125 2024-06-20 17:38:27,364 INFO [train.py:1028] (1/2) Epoch 14, batch 2350, loss[loss=0.221, simple_loss=0.2737, pruned_loss=0.08414, over 13214.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2743, pruned_loss=0.08764, over 2585025.38 frames. ], batch size: 67, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:38:28,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.57 vs. limit=10.0 2024-06-20 17:38:57,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=245490.66666666666, ans=0.125 2024-06-20 17:39:02,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=245509.0, ans=0.125 2024-06-20 17:39:02,694 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.59 vs. limit=22.5 2024-06-20 17:39:08,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=245509.0, ans=0.2 2024-06-20 17:39:11,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=245509.0, ans=0.1 2024-06-20 17:39:13,557 INFO [train.py:1028] (1/2) Epoch 14, batch 2400, loss[loss=0.2287, simple_loss=0.2775, pruned_loss=0.09, over 13316.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2736, pruned_loss=0.08761, over 2587164.12 frames. ], batch size: 46, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:39:14,273 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 1.862e+02 2.005e+02 2.168e+02 2.894e+02, threshold=4.011e+02, percent-clipped=0.0 2024-06-20 17:39:16,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=245527.33333333334, ans=0.125 2024-06-20 17:39:18,938 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:39:24,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=245545.66666666666, ans=0.1 2024-06-20 17:39:47,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=245600.66666666666, ans=0.125 2024-06-20 17:39:52,064 INFO [train.py:1028] (1/2) Epoch 14, batch 2450, loss[loss=0.2109, simple_loss=0.268, pruned_loss=0.07686, over 13304.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2725, pruned_loss=0.08744, over 2584186.39 frames. ], batch size: 63, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:39:54,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=245619.0, ans=0.125 2024-06-20 17:39:55,966 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.32 vs. limit=22.5 2024-06-20 17:39:59,041 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.87 vs. limit=15.0 2024-06-20 17:40:03,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=245637.33333333334, ans=0.125 2024-06-20 17:40:30,262 INFO [train.py:1028] (1/2) Epoch 14, batch 2500, loss[loss=0.2266, simple_loss=0.27, pruned_loss=0.09156, over 13238.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2713, pruned_loss=0.08701, over 2586838.77 frames. ], batch size: 83, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:40:30,889 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 1.909e+02 2.063e+02 2.265e+02 3.669e+02, threshold=4.126e+02, percent-clipped=0.0 2024-06-20 17:40:32,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=245710.66666666666, ans=0.0 2024-06-20 17:40:39,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=245729.0, ans=0.1 2024-06-20 17:40:50,931 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.50 vs. limit=22.5 2024-06-20 17:40:53,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=245765.66666666666, ans=0.125 2024-06-20 17:40:59,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=245765.66666666666, ans=0.1 2024-06-20 17:41:00,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=245765.66666666666, ans=0.1 2024-06-20 17:41:07,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=245784.0, ans=0.125 2024-06-20 17:41:08,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=245784.0, ans=0.1 2024-06-20 17:41:10,961 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=12.0 2024-06-20 17:41:12,743 INFO [train.py:1028] (1/2) Epoch 14, batch 2550, loss[loss=0.2073, simple_loss=0.2641, pruned_loss=0.07525, over 12723.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2701, pruned_loss=0.08661, over 2586411.57 frames. ], batch size: 22, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:41:23,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=245820.66666666666, ans=0.0 2024-06-20 17:41:33,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=245839.0, ans=0.0 2024-06-20 17:41:37,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=245857.33333333334, ans=0.0 2024-06-20 17:41:40,032 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.88 vs. limit=10.0 2024-06-20 17:41:46,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=245875.66666666666, ans=0.125 2024-06-20 17:41:54,272 INFO [train.py:1028] (1/2) Epoch 14, batch 2600, loss[loss=0.1809, simple_loss=0.2409, pruned_loss=0.06046, over 13210.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2688, pruned_loss=0.08606, over 2586715.98 frames. ], batch size: 52, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:41:54,991 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.867e+02 1.977e+02 2.136e+02 2.762e+02, threshold=3.955e+02, percent-clipped=0.0 2024-06-20 17:41:57,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=245894.0, ans=0.125 2024-06-20 17:41:59,663 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=15.0 2024-06-20 17:41:59,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=245894.0, ans=0.05 2024-06-20 17:42:08,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=245912.33333333334, ans=0.0 2024-06-20 17:42:13,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=245930.66666666666, ans=0.1 2024-06-20 17:42:19,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=245949.0, ans=0.0 2024-06-20 17:42:20,790 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.26 vs. limit=15.0 2024-06-20 17:42:33,758 INFO [train.py:1028] (1/2) Epoch 14, batch 2650, loss[loss=0.2061, simple_loss=0.2491, pruned_loss=0.08155, over 13012.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2675, pruned_loss=0.08573, over 2586403.30 frames. ], batch size: 144, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:42:34,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=245985.66666666666, ans=0.125 2024-06-20 17:42:36,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=245985.66666666666, ans=0.1 2024-06-20 17:42:39,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=245985.66666666666, ans=0.125 2024-06-20 17:42:48,778 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=246022.33333333334, ans=0.2 2024-06-20 17:42:50,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=246022.33333333334, ans=0.0 2024-06-20 17:43:10,450 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=26.12 vs. limit=22.5 2024-06-20 17:43:15,961 INFO [train.py:1028] (1/2) Epoch 14, batch 2700, loss[loss=0.1889, simple_loss=0.2364, pruned_loss=0.07071, over 13224.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2663, pruned_loss=0.08581, over 2584350.04 frames. ], batch size: 89, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:43:16,828 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 1.870e+02 2.008e+02 2.247e+02 2.788e+02, threshold=4.017e+02, percent-clipped=0.0 2024-06-20 17:43:20,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=246077.33333333334, ans=0.125 2024-06-20 17:43:25,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=246095.66666666666, ans=0.1 2024-06-20 17:43:29,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=246095.66666666666, ans=0.125 2024-06-20 17:43:30,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=246095.66666666666, ans=0.125 2024-06-20 17:43:53,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=246150.66666666666, ans=0.2 2024-06-20 17:43:56,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=246150.66666666666, ans=0.0 2024-06-20 17:43:58,044 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.98 vs. limit=15.0 2024-06-20 17:44:00,216 INFO [train.py:1028] (1/2) Epoch 14, batch 2750, loss[loss=0.2011, simple_loss=0.2478, pruned_loss=0.07724, over 13313.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.265, pruned_loss=0.08521, over 2580628.40 frames. ], batch size: 43, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:44:00,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=246169.0, ans=0.125 2024-06-20 17:44:20,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=246205.66666666666, ans=0.2 2024-06-20 17:44:22,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=246205.66666666666, ans=0.0 2024-06-20 17:44:36,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=246242.33333333334, ans=0.125 2024-06-20 17:44:40,209 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.78 vs. limit=6.0 2024-06-20 17:44:40,458 INFO [train.py:1028] (1/2) Epoch 14, batch 2800, loss[loss=0.2269, simple_loss=0.2599, pruned_loss=0.09697, over 11097.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2645, pruned_loss=0.08506, over 2579016.12 frames. ], batch size: 304, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:44:41,189 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.851e+02 2.043e+02 2.242e+02 2.881e+02, threshold=4.086e+02, percent-clipped=0.0 2024-06-20 17:44:41,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=246260.66666666666, ans=0.1 2024-06-20 17:44:42,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=246260.66666666666, ans=0.125 2024-06-20 17:44:43,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=246260.66666666666, ans=0.125 2024-06-20 17:44:51,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=246279.0, ans=0.125 2024-06-20 17:44:55,398 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.23 vs. limit=15.0 2024-06-20 17:44:58,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=246297.33333333334, ans=0.125 2024-06-20 17:45:03,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=246315.66666666666, ans=0.125 2024-06-20 17:45:07,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=246315.66666666666, ans=0.0 2024-06-20 17:45:10,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=246334.0, ans=0.1 2024-06-20 17:45:10,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=246334.0, ans=0.025 2024-06-20 17:45:22,894 INFO [train.py:1028] (1/2) Epoch 14, batch 2850, loss[loss=0.2119, simple_loss=0.2683, pruned_loss=0.07776, over 13292.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2635, pruned_loss=0.0847, over 2577434.54 frames. ], batch size: 49, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:45:23,421 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2024-06-20 17:45:24,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=246352.33333333334, ans=0.125 2024-06-20 17:45:43,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=246389.0, ans=15.0 2024-06-20 17:45:51,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=246407.33333333334, ans=0.125 2024-06-20 17:45:52,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=246407.33333333334, ans=0.0 2024-06-20 17:46:03,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=246425.66666666666, ans=0.1 2024-06-20 17:46:05,255 INFO [train.py:1028] (1/2) Epoch 14, batch 2900, loss[loss=0.2177, simple_loss=0.2717, pruned_loss=0.0819, over 13171.00 frames. ], tot_loss[loss=0.215, simple_loss=0.262, pruned_loss=0.08402, over 2585362.27 frames. ], batch size: 55, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:46:05,949 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.816e+02 1.932e+02 2.083e+02 2.989e+02, threshold=3.864e+02, percent-clipped=0.0 2024-06-20 17:46:13,674 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.01 vs. limit=15.0 2024-06-20 17:46:22,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=246480.66666666666, ans=22.5 2024-06-20 17:46:26,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=246480.66666666666, ans=0.1 2024-06-20 17:46:45,150 INFO [train.py:1028] (1/2) Epoch 14, batch 2950, loss[loss=0.2211, simple_loss=0.2669, pruned_loss=0.08765, over 13301.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2618, pruned_loss=0.08399, over 2580301.82 frames. ], batch size: 43, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:47:02,644 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.25 vs. limit=12.0 2024-06-20 17:47:15,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=246590.66666666666, ans=0.1 2024-06-20 17:47:18,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=246609.0, ans=0.2 2024-06-20 17:47:20,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=246609.0, ans=0.125 2024-06-20 17:47:22,151 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=22.5 2024-06-20 17:47:25,119 INFO [train.py:1028] (1/2) Epoch 14, batch 3000, loss[loss=0.2184, simple_loss=0.2704, pruned_loss=0.08321, over 13266.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2604, pruned_loss=0.0831, over 2579386.59 frames. ], batch size: 59, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:47:25,121 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 17:47:34,045 INFO [train.py:1060] (1/2) Epoch 14, validation: loss=0.1902, simple_loss=0.2549, pruned_loss=0.06279, over 351949.00 frames. 2024-06-20 17:47:34,046 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17479MB 2024-06-20 17:47:34,747 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 1.824e+02 1.907e+02 2.070e+02 3.060e+02, threshold=3.814e+02, percent-clipped=0.0 2024-06-20 17:47:46,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=246645.66666666666, ans=0.2 2024-06-20 17:47:46,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=246645.66666666666, ans=0.0 2024-06-20 17:47:48,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=246645.66666666666, ans=0.1 2024-06-20 17:47:59,883 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.82 vs. limit=10.0 2024-06-20 17:48:02,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=246682.33333333334, ans=0.0 2024-06-20 17:48:13,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=246700.66666666666, ans=0.0 2024-06-20 17:48:16,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=246700.66666666666, ans=0.125 2024-06-20 17:48:17,603 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:48:20,359 INFO [train.py:1028] (1/2) Epoch 14, batch 3050, loss[loss=0.229, simple_loss=0.2798, pruned_loss=0.08905, over 13238.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2602, pruned_loss=0.08337, over 2579530.08 frames. ], batch size: 46, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:48:31,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=246737.33333333334, ans=0.125 2024-06-20 17:48:45,387 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.91 vs. limit=6.0 2024-06-20 17:48:48,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=246774.0, ans=0.07 2024-06-20 17:48:51,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=246792.33333333334, ans=0.0 2024-06-20 17:48:58,518 INFO [train.py:1028] (1/2) Epoch 14, batch 3100, loss[loss=0.2103, simple_loss=0.2481, pruned_loss=0.0863, over 13041.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2596, pruned_loss=0.08303, over 2580311.51 frames. ], batch size: 144, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:48:59,347 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.877e+02 2.020e+02 2.192e+02 2.658e+02, threshold=4.041e+02, percent-clipped=0.0 2024-06-20 17:49:01,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=246810.66666666666, ans=0.125 2024-06-20 17:49:22,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=246865.66666666666, ans=0.0 2024-06-20 17:49:36,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=246902.33333333334, ans=0.125 2024-06-20 17:49:36,764 INFO [train.py:1028] (1/2) Epoch 14, batch 3150, loss[loss=0.2248, simple_loss=0.2713, pruned_loss=0.08916, over 12948.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2584, pruned_loss=0.08229, over 2582729.97 frames. ], batch size: 158, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:49:37,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=246902.33333333334, ans=0.1 2024-06-20 17:49:43,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=246902.33333333334, ans=0.0 2024-06-20 17:49:51,597 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.41 vs. limit=10.0 2024-06-20 17:49:53,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=246939.0, ans=0.1 2024-06-20 17:49:57,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=246939.0, ans=0.1 2024-06-20 17:49:57,809 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=22.5 2024-06-20 17:50:03,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=246957.33333333334, ans=0.05 2024-06-20 17:50:10,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=246957.33333333334, ans=0.0 2024-06-20 17:50:11,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=246975.66666666666, ans=0.125 2024-06-20 17:50:19,904 INFO [train.py:1028] (1/2) Epoch 14, batch 3200, loss[loss=0.2087, simple_loss=0.2545, pruned_loss=0.0814, over 13149.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2578, pruned_loss=0.08217, over 2583294.43 frames. ], batch size: 55, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:50:20,682 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.739e+02 1.839e+02 1.976e+02 2.345e+02, threshold=3.678e+02, percent-clipped=0.0 2024-06-20 17:50:30,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=247012.33333333334, ans=0.125 2024-06-20 17:50:32,449 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.74 vs. limit=22.5 2024-06-20 17:50:50,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=247049.0, ans=0.1 2024-06-20 17:50:56,330 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.73 vs. limit=15.0 2024-06-20 17:51:02,124 INFO [train.py:1028] (1/2) Epoch 14, batch 3250, loss[loss=0.2037, simple_loss=0.2498, pruned_loss=0.07878, over 13240.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2574, pruned_loss=0.08221, over 2587233.25 frames. ], batch size: 72, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:51:17,484 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.93 vs. limit=15.0 2024-06-20 17:51:18,837 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:51:22,464 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.25 vs. limit=15.0 2024-06-20 17:51:24,737 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:51:34,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=247159.0, ans=0.125 2024-06-20 17:51:42,970 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.20 vs. limit=10.0 2024-06-20 17:51:43,130 INFO [train.py:1028] (1/2) Epoch 14, batch 3300, loss[loss=0.2095, simple_loss=0.2479, pruned_loss=0.08561, over 12820.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2564, pruned_loss=0.08173, over 2583992.72 frames. ], batch size: 177, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:51:43,930 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.809e+02 1.956e+02 2.108e+02 2.658e+02, threshold=3.912e+02, percent-clipped=0.0 2024-06-20 17:51:52,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=247195.66666666666, ans=0.07 2024-06-20 17:52:23,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=247250.66666666666, ans=0.0 2024-06-20 17:52:25,710 INFO [train.py:1028] (1/2) Epoch 14, batch 3350, loss[loss=0.1991, simple_loss=0.2455, pruned_loss=0.07631, over 12906.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2563, pruned_loss=0.08199, over 2579113.44 frames. ], batch size: 158, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:52:37,605 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:52:40,882 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.16 vs. limit=22.5 2024-06-20 17:52:47,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=247305.66666666666, ans=0.0 2024-06-20 17:53:08,256 INFO [train.py:1028] (1/2) Epoch 14, batch 3400, loss[loss=0.2046, simple_loss=0.2486, pruned_loss=0.0803, over 12529.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2559, pruned_loss=0.08219, over 2576885.26 frames. ], batch size: 22, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:53:08,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247360.66666666666, ans=0.1 2024-06-20 17:53:08,890 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 1.844e+02 1.985e+02 2.161e+02 2.619e+02, threshold=3.970e+02, percent-clipped=0.0 2024-06-20 17:53:17,131 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.11 vs. limit=22.5 2024-06-20 17:53:29,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=247397.33333333334, ans=0.0 2024-06-20 17:53:30,807 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.26 vs. limit=10.0 2024-06-20 17:53:33,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=247415.66666666666, ans=0.05 2024-06-20 17:53:34,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=247415.66666666666, ans=0.1 2024-06-20 17:53:34,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=247415.66666666666, ans=0.07 2024-06-20 17:53:36,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247415.66666666666, ans=0.1 2024-06-20 17:53:44,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=247434.0, ans=0.0 2024-06-20 17:53:47,461 INFO [train.py:1028] (1/2) Epoch 14, batch 3450, loss[loss=0.2241, simple_loss=0.2624, pruned_loss=0.09289, over 12662.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2554, pruned_loss=0.08204, over 2577421.62 frames. ], batch size: 176, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:54:04,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=247489.0, ans=0.0 2024-06-20 17:54:09,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=247489.0, ans=0.0 2024-06-20 17:54:22,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=247525.66666666666, ans=0.125 2024-06-20 17:54:27,518 INFO [train.py:1028] (1/2) Epoch 14, batch 3500, loss[loss=0.1917, simple_loss=0.2454, pruned_loss=0.06905, over 12909.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2553, pruned_loss=0.08194, over 2575671.74 frames. ], batch size: 33, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:54:28,267 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.800e+02 1.877e+02 2.003e+02 3.203e+02, threshold=3.754e+02, percent-clipped=0.0 2024-06-20 17:54:33,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247544.0, ans=0.1 2024-06-20 17:54:42,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=247562.33333333334, ans=0.125 2024-06-20 17:54:44,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=247562.33333333334, ans=0.09899494936611666 2024-06-20 17:54:45,814 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.29 vs. limit=15.0 2024-06-20 17:54:47,438 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.44 vs. limit=15.0 2024-06-20 17:55:06,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=247617.33333333334, ans=0.125 2024-06-20 17:55:12,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=247617.33333333334, ans=0.125 2024-06-20 17:55:13,878 INFO [train.py:1028] (1/2) Epoch 14, batch 3550, loss[loss=0.2102, simple_loss=0.2585, pruned_loss=0.08098, over 13177.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2547, pruned_loss=0.08126, over 2578045.90 frames. ], batch size: 95, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:55:16,336 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.475e+00 2024-06-20 17:55:20,449 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=9.13 vs. limit=12.0 2024-06-20 17:55:23,979 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.50 vs. limit=15.0 2024-06-20 17:55:28,966 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.92 vs. limit=15.0 2024-06-20 17:55:35,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=247672.33333333334, ans=0.125 2024-06-20 17:55:37,966 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.72 vs. limit=10.0 2024-06-20 17:55:45,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=247709.0, ans=0.125 2024-06-20 17:55:47,718 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=12.0 2024-06-20 17:55:49,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=247709.0, ans=0.125 2024-06-20 17:55:51,709 INFO [train.py:1028] (1/2) Epoch 14, batch 3600, loss[loss=0.1974, simple_loss=0.2482, pruned_loss=0.07328, over 13298.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2538, pruned_loss=0.08076, over 2581082.28 frames. ], batch size: 49, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:55:52,439 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.829e+02 1.993e+02 2.209e+02 3.683e+02, threshold=3.987e+02, percent-clipped=0.0 2024-06-20 17:56:11,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=247764.0, ans=0.0 2024-06-20 17:56:13,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=247764.0, ans=0.0 2024-06-20 17:56:14,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=247764.0, ans=0.0 2024-06-20 17:56:19,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=247782.33333333334, ans=0.125 2024-06-20 17:56:20,249 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.24 vs. limit=15.0 2024-06-20 17:56:31,496 INFO [train.py:1028] (1/2) Epoch 14, batch 3650, loss[loss=0.2172, simple_loss=0.2634, pruned_loss=0.08546, over 13041.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2537, pruned_loss=0.08031, over 2579139.05 frames. ], batch size: 102, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:56:32,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=247819.0, ans=0.125 2024-06-20 17:56:34,270 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.32 vs. limit=15.0 2024-06-20 17:56:43,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=247837.33333333334, ans=0.035 2024-06-20 17:56:56,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=247874.0, ans=0.0 2024-06-20 17:56:58,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=247874.0, ans=0.125 2024-06-20 17:57:01,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=247874.0, ans=0.125 2024-06-20 17:57:01,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=247874.0, ans=0.125 2024-06-20 17:57:12,685 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:57:14,932 INFO [train.py:1028] (1/2) Epoch 14, batch 3700, loss[loss=0.1987, simple_loss=0.2524, pruned_loss=0.07253, over 13158.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2528, pruned_loss=0.08005, over 2583823.62 frames. ], batch size: 72, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:57:15,646 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.802e+02 1.902e+02 2.098e+02 3.278e+02, threshold=3.804e+02, percent-clipped=0.0 2024-06-20 17:57:26,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=247929.0, ans=0.0 2024-06-20 17:57:33,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=247947.33333333334, ans=0.0 2024-06-20 17:57:44,634 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.43 vs. limit=15.0 2024-06-20 17:57:50,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247984.0, ans=0.1 2024-06-20 17:57:52,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247984.0, ans=0.1 2024-06-20 17:57:56,590 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.90 vs. limit=15.0 2024-06-20 17:57:57,775 INFO [train.py:1028] (1/2) Epoch 14, batch 3750, loss[loss=0.2192, simple_loss=0.2754, pruned_loss=0.08146, over 12495.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2526, pruned_loss=0.07991, over 2585938.40 frames. ], batch size: 22, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:58:12,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=248020.66666666666, ans=0.2 2024-06-20 17:58:19,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=248039.0, ans=0.125 2024-06-20 17:58:21,606 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2024-06-20 17:58:24,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=248057.33333333334, ans=0.1 2024-06-20 17:58:34,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=248075.66666666666, ans=0.125 2024-06-20 17:58:36,485 INFO [train.py:1028] (1/2) Epoch 14, batch 3800, loss[loss=0.1962, simple_loss=0.2423, pruned_loss=0.07508, over 13216.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2525, pruned_loss=0.07996, over 2583803.90 frames. ], batch size: 83, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:58:37,166 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.805e+02 1.945e+02 2.090e+02 2.810e+02, threshold=3.890e+02, percent-clipped=0.0 2024-06-20 17:58:38,547 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.00 vs. limit=22.5 2024-06-20 17:58:44,943 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.74 vs. limit=22.5 2024-06-20 17:58:49,497 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.34 vs. limit=15.0 2024-06-20 17:59:01,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=248149.0, ans=0.125 2024-06-20 17:59:03,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=248149.0, ans=0.0 2024-06-20 17:59:04,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=248149.0, ans=0.0 2024-06-20 17:59:06,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=248149.0, ans=0.0 2024-06-20 17:59:16,291 INFO [train.py:1028] (1/2) Epoch 14, batch 3850, loss[loss=0.2055, simple_loss=0.247, pruned_loss=0.08194, over 13010.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2512, pruned_loss=0.07924, over 2584327.69 frames. ], batch size: 144, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:59:16,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=248185.66666666666, ans=0.125 2024-06-20 17:59:20,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=248185.66666666666, ans=0.125 2024-06-20 17:59:23,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=248204.0, ans=0.125 2024-06-20 17:59:32,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=248222.33333333334, ans=0.1 2024-06-20 17:59:36,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=248222.33333333334, ans=0.1 2024-06-20 17:59:42,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=248240.66666666666, ans=0.0 2024-06-20 17:59:50,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=248259.0, ans=0.1 2024-06-20 17:59:58,666 INFO [train.py:1028] (1/2) Epoch 14, batch 3900, loss[loss=0.2198, simple_loss=0.2641, pruned_loss=0.08773, over 13206.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2514, pruned_loss=0.07973, over 2588324.19 frames. ], batch size: 83, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:59:59,348 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.746e+02 1.871e+02 2.041e+02 2.633e+02, threshold=3.742e+02, percent-clipped=0.0 2024-06-20 18:00:21,092 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.49 vs. limit=15.0 2024-06-20 18:00:21,915 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.73 vs. limit=22.5 2024-06-20 18:00:32,710 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.66 vs. limit=15.0 2024-06-20 18:00:35,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=248350.66666666666, ans=0.125 2024-06-20 18:00:38,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=248350.66666666666, ans=0.125 2024-06-20 18:00:42,460 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:00:42,951 INFO [train.py:1028] (1/2) Epoch 14, batch 3950, loss[loss=0.2029, simple_loss=0.2431, pruned_loss=0.08135, over 13124.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2502, pruned_loss=0.07875, over 2589220.69 frames. ], batch size: 132, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 18:00:49,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=248369.0, ans=0.0 2024-06-20 18:01:16,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248442.33333333334, ans=0.1 2024-06-20 18:01:16,809 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.40 vs. limit=22.5 2024-06-20 18:01:18,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248442.33333333334, ans=0.1 2024-06-20 18:01:23,464 INFO [train.py:1028] (1/2) Epoch 14, batch 4000, loss[loss=0.196, simple_loss=0.2456, pruned_loss=0.07322, over 12824.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2498, pruned_loss=0.07863, over 2583913.39 frames. ], batch size: 39, lr: 4.19e-03, grad_scale: 64.0 2024-06-20 18:01:24,235 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.829e+02 2.110e+02 2.284e+02 3.853e+02, threshold=4.220e+02, percent-clipped=1.0 2024-06-20 18:01:24,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=248460.66666666666, ans=0.1 2024-06-20 18:01:25,755 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.72 vs. limit=10.0 2024-06-20 18:01:35,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=248479.0, ans=0.025 2024-06-20 18:01:43,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=248497.33333333334, ans=0.07 2024-06-20 18:01:50,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=248515.66666666666, ans=0.0 2024-06-20 18:02:01,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=248534.0, ans=0.125 2024-06-20 18:02:02,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=248534.0, ans=0.0 2024-06-20 18:02:04,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=248552.33333333334, ans=0.0 2024-06-20 18:02:04,819 INFO [train.py:1028] (1/2) Epoch 14, batch 4050, loss[loss=0.2064, simple_loss=0.2422, pruned_loss=0.0853, over 10990.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2493, pruned_loss=0.07871, over 2582133.86 frames. ], batch size: 304, lr: 4.19e-03, grad_scale: 64.0 2024-06-20 18:02:06,175 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.31 vs. limit=22.5 2024-06-20 18:02:10,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=248552.33333333334, ans=0.0 2024-06-20 18:02:11,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=248552.33333333334, ans=0.1 2024-06-20 18:02:17,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=248570.66666666666, ans=0.1 2024-06-20 18:02:20,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=248570.66666666666, ans=0.125 2024-06-20 18:02:21,333 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2024-06-20 18:02:41,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=248607.33333333334, ans=0.125 2024-06-20 18:02:41,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=248607.33333333334, ans=0.125 2024-06-20 18:02:46,705 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.17 vs. limit=10.0 2024-06-20 18:02:52,171 INFO [train.py:1028] (1/2) Epoch 14, batch 4100, loss[loss=0.2271, simple_loss=0.2545, pruned_loss=0.09983, over 13008.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2495, pruned_loss=0.07902, over 2576469.58 frames. ], batch size: 102, lr: 4.19e-03, grad_scale: 64.0 2024-06-20 18:02:52,894 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.810e+02 1.962e+02 2.166e+02 3.347e+02, threshold=3.924e+02, percent-clipped=0.0 2024-06-20 18:03:00,627 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=9.62 vs. limit=12.0 2024-06-20 18:03:03,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=248662.33333333334, ans=0.1 2024-06-20 18:03:12,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=248680.66666666666, ans=0.0 2024-06-20 18:03:16,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=248699.0, ans=0.125 2024-06-20 18:03:26,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=248717.33333333334, ans=0.0 2024-06-20 18:03:29,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=248717.33333333334, ans=0.125 2024-06-20 18:03:29,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248717.33333333334, ans=0.1 2024-06-20 18:03:32,683 INFO [train.py:1028] (1/2) Epoch 14, batch 4150, loss[loss=0.1893, simple_loss=0.2426, pruned_loss=0.06803, over 13152.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2487, pruned_loss=0.07877, over 2576196.32 frames. ], batch size: 55, lr: 4.19e-03, grad_scale: 64.0 2024-06-20 18:03:35,330 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:03:37,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=248735.66666666666, ans=0.125 2024-06-20 18:04:07,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=248809.0, ans=0.0 2024-06-20 18:04:11,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=248827.33333333334, ans=0.0 2024-06-20 18:04:11,940 INFO [train.py:1028] (1/2) Epoch 14, batch 4200, loss[loss=0.1909, simple_loss=0.2291, pruned_loss=0.07637, over 13013.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2482, pruned_loss=0.0786, over 2580309.69 frames. ], batch size: 102, lr: 4.19e-03, grad_scale: 64.0 2024-06-20 18:04:12,496 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.31 vs. limit=15.0 2024-06-20 18:04:12,604 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.774e+02 1.842e+02 1.990e+02 2.594e+02, threshold=3.684e+02, percent-clipped=0.0 2024-06-20 18:04:29,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=248864.0, ans=0.2 2024-06-20 18:04:41,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=248882.33333333334, ans=0.125 2024-06-20 18:04:42,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=248882.33333333334, ans=0.125 2024-06-20 18:04:46,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=248900.66666666666, ans=0.05 2024-06-20 18:04:55,677 INFO [train.py:1028] (1/2) Epoch 14, batch 4250, loss[loss=0.2003, simple_loss=0.2472, pruned_loss=0.07668, over 13297.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2485, pruned_loss=0.07867, over 2582484.83 frames. ], batch size: 46, lr: 4.19e-03, grad_scale: 64.0 2024-06-20 18:05:15,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=248955.66666666666, ans=22.5 2024-06-20 18:05:27,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=248974.0, ans=0.125 2024-06-20 18:05:27,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=248974.0, ans=0.025 2024-06-20 18:05:29,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=248974.0, ans=0.125 2024-06-20 18:05:37,509 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.55 vs. limit=15.0 2024-06-20 18:05:38,341 INFO [train.py:1028] (1/2) Epoch 14, batch 4300, loss[loss=0.2248, simple_loss=0.2674, pruned_loss=0.09114, over 13187.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2486, pruned_loss=0.07884, over 2582438.40 frames. ], batch size: 59, lr: 4.19e-03, grad_scale: 128.0 2024-06-20 18:05:39,072 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.829e+02 1.977e+02 2.267e+02 3.051e+02, threshold=3.953e+02, percent-clipped=0.0 2024-06-20 18:06:16,812 INFO [train.py:1028] (1/2) Epoch 14, batch 4350, loss[loss=0.212, simple_loss=0.2586, pruned_loss=0.08265, over 13164.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2483, pruned_loss=0.07874, over 2586348.22 frames. ], batch size: 59, lr: 4.19e-03, grad_scale: 128.0 2024-06-20 18:06:17,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=249102.33333333334, ans=0.025 2024-06-20 18:06:59,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=249194.0, ans=0.125 2024-06-20 18:07:00,206 INFO [train.py:1028] (1/2) Epoch 14, batch 4400, loss[loss=0.1937, simple_loss=0.24, pruned_loss=0.0737, over 13198.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2481, pruned_loss=0.07877, over 2586599.17 frames. ], batch size: 83, lr: 4.19e-03, grad_scale: 128.0 2024-06-20 18:07:01,100 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.737e+02 1.848e+02 2.010e+02 2.772e+02, threshold=3.696e+02, percent-clipped=0.0 2024-06-20 18:07:01,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=249194.0, ans=0.125 2024-06-20 18:07:41,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=249267.33333333334, ans=0.025 2024-06-20 18:07:43,392 INFO [train.py:1028] (1/2) Epoch 14, batch 4450, loss[loss=0.2211, simple_loss=0.2704, pruned_loss=0.08595, over 12967.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2491, pruned_loss=0.07928, over 2580899.37 frames. ], batch size: 33, lr: 4.19e-03, grad_scale: 128.0 2024-06-20 18:08:15,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=249340.66666666666, ans=0.05 2024-06-20 18:08:21,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.27 vs. limit=22.5 2024-06-20 18:08:22,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=249359.0, ans=0.125 2024-06-20 18:08:25,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=249359.0, ans=0.0 2024-06-20 18:08:27,692 INFO [train.py:1028] (1/2) Epoch 14, batch 4500, loss[loss=0.1884, simple_loss=0.2354, pruned_loss=0.07065, over 13246.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2482, pruned_loss=0.07869, over 2584791.99 frames. ], batch size: 89, lr: 4.19e-03, grad_scale: 128.0 2024-06-20 18:08:28,422 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.796e+02 1.938e+02 2.168e+02 3.017e+02, threshold=3.877e+02, percent-clipped=0.0 2024-06-20 18:08:33,491 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=22.5 2024-06-20 18:09:03,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=249450.66666666666, ans=0.2 2024-06-20 18:09:07,298 INFO [train.py:1028] (1/2) Epoch 14, batch 4550, loss[loss=0.2005, simple_loss=0.2416, pruned_loss=0.07967, over 13236.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2482, pruned_loss=0.07879, over 2587929.11 frames. ], batch size: 52, lr: 4.19e-03, grad_scale: 128.0 2024-06-20 18:09:09,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=249469.0, ans=0.0 2024-06-20 18:09:15,376 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.43 vs. limit=15.0 2024-06-20 18:09:21,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=249487.33333333334, ans=0.125 2024-06-20 18:09:51,203 INFO [train.py:1028] (1/2) Epoch 14, batch 4600, loss[loss=0.213, simple_loss=0.2507, pruned_loss=0.08764, over 12499.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2485, pruned_loss=0.07878, over 2584224.77 frames. ], batch size: 202, lr: 4.19e-03, grad_scale: 128.0 2024-06-20 18:09:51,902 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.845e+02 1.990e+02 2.231e+02 3.373e+02, threshold=3.979e+02, percent-clipped=0.0 2024-06-20 18:09:56,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=249560.66666666666, ans=0.0 2024-06-20 18:09:59,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=249560.66666666666, ans=0.125 2024-06-20 18:10:01,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=249560.66666666666, ans=0.0 2024-06-20 18:10:07,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=249579.0, ans=0.2 2024-06-20 18:10:17,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=249615.66666666666, ans=0.2 2024-06-20 18:10:25,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=249634.0, ans=0.025 2024-06-20 18:10:26,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=249634.0, ans=0.2 2024-06-20 18:10:27,902 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.39 vs. limit=10.0 2024-06-20 18:10:31,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=249634.0, ans=0.07 2024-06-20 18:10:34,400 INFO [train.py:1028] (1/2) Epoch 14, batch 4650, loss[loss=0.1911, simple_loss=0.2369, pruned_loss=0.07271, over 13021.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.248, pruned_loss=0.07873, over 2587314.82 frames. ], batch size: 132, lr: 4.18e-03, grad_scale: 128.0 2024-06-20 18:10:36,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=249652.33333333334, ans=0.1 2024-06-20 18:10:37,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=249652.33333333334, ans=0.0 2024-06-20 18:10:49,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=249689.0, ans=0.125 2024-06-20 18:11:00,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=249707.33333333334, ans=0.0 2024-06-20 18:11:15,414 INFO [train.py:1028] (1/2) Epoch 14, batch 4700, loss[loss=0.1754, simple_loss=0.2279, pruned_loss=0.06145, over 12521.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2483, pruned_loss=0.07905, over 2583081.00 frames. ], batch size: 25, lr: 4.18e-03, grad_scale: 128.0 2024-06-20 18:11:16,188 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 1.841e+02 1.961e+02 2.198e+02 2.972e+02, threshold=3.922e+02, percent-clipped=0.0 2024-06-20 18:11:27,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=249762.33333333334, ans=0.0 2024-06-20 18:11:39,719 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.29 vs. limit=15.0 2024-06-20 18:11:41,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=249799.0, ans=0.0 2024-06-20 18:11:42,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=249799.0, ans=0.125 2024-06-20 18:11:47,197 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.46 vs. limit=15.0 2024-06-20 18:11:50,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.59 vs. limit=22.5 2024-06-20 18:12:06,867 INFO [train.py:1028] (1/2) Epoch 14, batch 4750, loss[loss=0.2145, simple_loss=0.2571, pruned_loss=0.08598, over 12582.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2478, pruned_loss=0.07886, over 2579888.21 frames. ], batch size: 202, lr: 4.18e-03, grad_scale: 128.0 2024-06-20 18:12:09,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=249835.66666666666, ans=0.0 2024-06-20 18:12:11,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=249835.66666666666, ans=0.05 2024-06-20 18:12:16,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=249854.0, ans=0.125 2024-06-20 18:12:47,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=249890.66666666666, ans=0.0 2024-06-20 18:12:51,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=249909.0, ans=0.2 2024-06-20 18:12:58,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=249909.0, ans=0.125 2024-06-20 18:13:02,195 INFO [train.py:1028] (1/2) Epoch 14, batch 4800, loss[loss=0.1996, simple_loss=0.248, pruned_loss=0.07561, over 13246.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2476, pruned_loss=0.07857, over 2577789.20 frames. ], batch size: 63, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:13:04,174 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 1.855e+02 2.078e+02 2.342e+02 3.031e+02, threshold=4.156e+02, percent-clipped=0.0 2024-06-20 18:13:06,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=249927.33333333334, ans=0.125 2024-06-20 18:13:13,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=249945.66666666666, ans=0.125 2024-06-20 18:13:17,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=249945.66666666666, ans=0.04949747468305833 2024-06-20 18:13:21,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=249964.0, ans=0.125 2024-06-20 18:13:37,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=249982.33333333334, ans=0.1 2024-06-20 18:13:37,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=249982.33333333334, ans=0.125 2024-06-20 18:13:45,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=250000.66666666666, ans=0.125 2024-06-20 18:13:49,213 INFO [train.py:1028] (1/2) Epoch 14, batch 4850, loss[loss=0.2058, simple_loss=0.2454, pruned_loss=0.08312, over 13229.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2474, pruned_loss=0.07857, over 2575656.40 frames. ], batch size: 89, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:13:49,545 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:13:51,704 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.69 vs. limit=15.0 2024-06-20 18:13:53,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=250019.0, ans=0.125 2024-06-20 18:13:53,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=250019.0, ans=0.125 2024-06-20 18:13:54,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=250019.0, ans=0.125 2024-06-20 18:13:56,986 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.77 vs. limit=15.0 2024-06-20 18:14:02,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=250037.33333333334, ans=0.125 2024-06-20 18:14:08,179 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=250055.66666666666, ans=0.0 2024-06-20 18:14:12,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=250055.66666666666, ans=0.125 2024-06-20 18:14:38,530 INFO [train.py:1028] (1/2) Epoch 14, batch 4900, loss[loss=0.2116, simple_loss=0.2591, pruned_loss=0.08201, over 13241.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2475, pruned_loss=0.07876, over 2576447.77 frames. ], batch size: 59, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:14:38,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=250110.66666666666, ans=0.125 2024-06-20 18:14:40,325 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.761e+02 1.892e+02 2.045e+02 2.723e+02, threshold=3.784e+02, percent-clipped=0.0 2024-06-20 18:14:56,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=250129.0, ans=0.2 2024-06-20 18:15:10,726 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.08 vs. limit=10.0 2024-06-20 18:15:33,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=250184.0, ans=0.125 2024-06-20 18:15:39,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=250184.0, ans=0.125 2024-06-20 18:15:41,478 INFO [train.py:1028] (1/2) Epoch 14, batch 4950, loss[loss=0.2121, simple_loss=0.2474, pruned_loss=0.08843, over 10961.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2475, pruned_loss=0.07913, over 2570843.08 frames. ], batch size: 304, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:15:47,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=250202.33333333334, ans=0.0 2024-06-20 18:15:48,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=250202.33333333334, ans=0.125 2024-06-20 18:15:59,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=250239.0, ans=0.0 2024-06-20 18:16:06,113 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:16:18,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=250275.66666666666, ans=0.1 2024-06-20 18:16:19,356 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.76 vs. limit=22.5 2024-06-20 18:16:24,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=250275.66666666666, ans=0.0 2024-06-20 18:16:26,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=250275.66666666666, ans=0.125 2024-06-20 18:16:27,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=250294.0, ans=0.125 2024-06-20 18:16:28,122 INFO [train.py:1028] (1/2) Epoch 14, batch 5000, loss[loss=0.2044, simple_loss=0.2459, pruned_loss=0.08141, over 13217.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2471, pruned_loss=0.07888, over 2574679.88 frames. ], batch size: 95, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:16:28,637 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.42 vs. limit=15.0 2024-06-20 18:16:29,762 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.773e+02 1.909e+02 2.061e+02 2.857e+02, threshold=3.818e+02, percent-clipped=0.0 2024-06-20 18:16:34,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=250294.0, ans=0.09899494936611666 2024-06-20 18:16:37,327 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=12.0 2024-06-20 18:16:37,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=250312.33333333334, ans=0.125 2024-06-20 18:16:44,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=250312.33333333334, ans=0.0 2024-06-20 18:16:59,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=250349.0, ans=0.1 2024-06-20 18:16:59,944 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2024-06-20 18:17:06,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=250349.0, ans=0.035 2024-06-20 18:17:20,270 INFO [train.py:1028] (1/2) Epoch 14, batch 5050, loss[loss=0.1928, simple_loss=0.2407, pruned_loss=0.07245, over 12921.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2474, pruned_loss=0.07863, over 2573391.97 frames. ], batch size: 36, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:17:35,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=250404.0, ans=0.125 2024-06-20 18:17:36,955 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.84 vs. limit=6.0 2024-06-20 18:17:38,029 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.24 vs. limit=15.0 2024-06-20 18:17:45,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=250422.33333333334, ans=0.0 2024-06-20 18:17:52,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=250440.66666666666, ans=0.125 2024-06-20 18:18:01,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=250440.66666666666, ans=0.125 2024-06-20 18:18:02,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=250440.66666666666, ans=0.125 2024-06-20 18:18:12,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=250459.0, ans=0.125 2024-06-20 18:18:18,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=250459.0, ans=0.125 2024-06-20 18:18:19,930 INFO [train.py:1028] (1/2) Epoch 14, batch 5100, loss[loss=0.1961, simple_loss=0.2401, pruned_loss=0.07599, over 12925.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2475, pruned_loss=0.07884, over 2570307.45 frames. ], batch size: 39, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:18:21,727 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 1.779e+02 1.925e+02 2.149e+02 3.112e+02, threshold=3.850e+02, percent-clipped=0.0 2024-06-20 18:18:23,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=250477.33333333334, ans=0.1 2024-06-20 18:18:32,088 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.39 vs. limit=15.0 2024-06-20 18:18:32,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=250495.66666666666, ans=0.0 2024-06-20 18:18:38,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=250514.0, ans=0.2 2024-06-20 18:19:08,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=250569.0, ans=15.0 2024-06-20 18:19:08,633 INFO [train.py:1028] (1/2) Epoch 14, batch 5150, loss[loss=0.2018, simple_loss=0.2377, pruned_loss=0.08292, over 13127.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2475, pruned_loss=0.07922, over 2571843.96 frames. ], batch size: 132, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:19:09,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=250569.0, ans=0.1 2024-06-20 18:19:24,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=250587.33333333334, ans=0.0 2024-06-20 18:19:32,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=250605.66666666666, ans=0.125 2024-06-20 18:19:33,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=250605.66666666666, ans=0.5 2024-06-20 18:19:37,405 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2024-06-20 18:19:54,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=250642.33333333334, ans=0.0 2024-06-20 18:19:55,833 INFO [train.py:1028] (1/2) Epoch 14, batch 5200, loss[loss=0.1792, simple_loss=0.2241, pruned_loss=0.06708, over 13180.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2471, pruned_loss=0.07888, over 2575437.32 frames. ], batch size: 95, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:19:56,641 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.77 vs. limit=12.0 2024-06-20 18:19:57,645 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.828e+02 1.960e+02 2.126e+02 3.107e+02, threshold=3.919e+02, percent-clipped=0.0 2024-06-20 18:20:01,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=250660.66666666666, ans=0.025 2024-06-20 18:20:05,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=250679.0, ans=10.0 2024-06-20 18:20:09,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=250679.0, ans=0.125 2024-06-20 18:20:18,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=250697.33333333334, ans=0.0 2024-06-20 18:20:36,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=250734.0, ans=0.0 2024-06-20 18:20:52,207 INFO [train.py:1028] (1/2) Epoch 14, batch 5250, loss[loss=0.1999, simple_loss=0.2459, pruned_loss=0.07694, over 13188.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2477, pruned_loss=0.0792, over 2569947.10 frames. ], batch size: 52, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:20:55,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=250752.33333333334, ans=0.125 2024-06-20 18:21:14,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=250789.0, ans=0.0 2024-06-20 18:21:42,713 INFO [train.py:1028] (1/2) Epoch 14, batch 5300, loss[loss=0.1987, simple_loss=0.2377, pruned_loss=0.07986, over 13000.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2473, pruned_loss=0.07878, over 2566738.46 frames. ], batch size: 144, lr: 4.17e-03, grad_scale: 64.0 2024-06-20 18:21:42,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=250844.0, ans=0.125 2024-06-20 18:21:44,954 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.809e+02 1.959e+02 2.124e+02 3.333e+02, threshold=3.918e+02, percent-clipped=0.0 2024-06-20 18:21:56,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=250862.33333333334, ans=0.1 2024-06-20 18:21:59,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=250862.33333333334, ans=0.125 2024-06-20 18:22:06,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=250880.66666666666, ans=0.1 2024-06-20 18:22:13,859 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.93 vs. limit=15.0 2024-06-20 18:22:26,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=250917.33333333334, ans=0.125 2024-06-20 18:22:30,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=250917.33333333334, ans=0.1 2024-06-20 18:22:32,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=250935.66666666666, ans=0.09899494936611666 2024-06-20 18:22:32,919 INFO [train.py:1028] (1/2) Epoch 14, batch 5350, loss[loss=0.1769, simple_loss=0.2337, pruned_loss=0.06008, over 11581.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2468, pruned_loss=0.07859, over 2574322.65 frames. ], batch size: 16, lr: 4.17e-03, grad_scale: 64.0 2024-06-20 18:22:45,383 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.52 vs. limit=15.0 2024-06-20 18:22:52,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=250972.33333333334, ans=0.1 2024-06-20 18:22:54,247 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=250972.33333333334, ans=0.025 2024-06-20 18:22:56,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=250972.33333333334, ans=0.125 2024-06-20 18:23:06,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=250990.66666666666, ans=0.1 2024-06-20 18:23:11,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=251009.0, ans=0.0 2024-06-20 18:23:19,086 INFO [train.py:1028] (1/2) Epoch 14, batch 5400, loss[loss=0.2153, simple_loss=0.2521, pruned_loss=0.08922, over 12062.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2469, pruned_loss=0.07894, over 2566701.83 frames. ], batch size: 240, lr: 4.17e-03, grad_scale: 64.0 2024-06-20 18:23:21,047 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.763e+02 1.882e+02 2.068e+02 2.790e+02, threshold=3.764e+02, percent-clipped=0.0 2024-06-20 18:23:25,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=251027.33333333334, ans=0.2 2024-06-20 18:23:27,036 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=8.86 vs. limit=12.0 2024-06-20 18:23:36,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=251045.66666666666, ans=0.0 2024-06-20 18:23:45,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251064.0, ans=0.1 2024-06-20 18:23:47,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=251064.0, ans=0.1 2024-06-20 18:24:03,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=251082.33333333334, ans=0.125 2024-06-20 18:24:11,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=251100.66666666666, ans=0.125 2024-06-20 18:24:22,501 INFO [train.py:1028] (1/2) Epoch 14, batch 5450, loss[loss=0.1846, simple_loss=0.2244, pruned_loss=0.07235, over 12793.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2466, pruned_loss=0.07887, over 2572493.10 frames. ], batch size: 26, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:24:23,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=251119.0, ans=0.0 2024-06-20 18:24:25,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=251119.0, ans=0.1 2024-06-20 18:24:26,292 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.22 vs. limit=10.0 2024-06-20 18:24:37,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=251137.33333333334, ans=0.125 2024-06-20 18:24:45,124 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.82 vs. limit=22.5 2024-06-20 18:24:59,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251192.33333333334, ans=0.1 2024-06-20 18:25:00,285 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.17 vs. limit=15.0 2024-06-20 18:25:08,391 INFO [train.py:1028] (1/2) Epoch 14, batch 5500, loss[loss=0.2177, simple_loss=0.2518, pruned_loss=0.09181, over 12294.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2469, pruned_loss=0.07906, over 2565640.75 frames. ], batch size: 240, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:25:10,142 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2024-06-20 18:25:11,332 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.761e+02 1.888e+02 2.090e+02 2.925e+02, threshold=3.776e+02, percent-clipped=0.0 2024-06-20 18:25:27,437 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.05 vs. limit=15.0 2024-06-20 18:25:56,948 INFO [train.py:1028] (1/2) Epoch 14, batch 5550, loss[loss=0.2273, simple_loss=0.2706, pruned_loss=0.09194, over 13297.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2465, pruned_loss=0.07854, over 2568858.14 frames. ], batch size: 43, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:25:59,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=251302.33333333334, ans=0.2 2024-06-20 18:26:00,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=251302.33333333334, ans=0.125 2024-06-20 18:26:28,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251339.0, ans=0.1 2024-06-20 18:26:36,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=251357.33333333334, ans=0.0 2024-06-20 18:26:41,227 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.29 vs. limit=22.5 2024-06-20 18:26:41,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=251375.66666666666, ans=0.125 2024-06-20 18:26:50,706 INFO [train.py:1028] (1/2) Epoch 14, batch 5600, loss[loss=0.1903, simple_loss=0.2386, pruned_loss=0.071, over 13229.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2463, pruned_loss=0.07841, over 2570585.29 frames. ], batch size: 89, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:26:53,341 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.786e+02 1.897e+02 2.090e+02 3.050e+02, threshold=3.794e+02, percent-clipped=0.0 2024-06-20 18:27:02,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=251394.0, ans=0.02 2024-06-20 18:27:20,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=251430.66666666666, ans=0.125 2024-06-20 18:27:24,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=251449.0, ans=0.125 2024-06-20 18:27:44,427 INFO [train.py:1028] (1/2) Epoch 14, batch 5650, loss[loss=0.2254, simple_loss=0.2619, pruned_loss=0.09447, over 12484.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2464, pruned_loss=0.07822, over 2575389.43 frames. ], batch size: 202, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:27:53,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=251504.0, ans=0.1 2024-06-20 18:27:53,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=251504.0, ans=0.0 2024-06-20 18:28:10,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=251522.33333333334, ans=0.0 2024-06-20 18:28:20,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=251540.66666666666, ans=0.0 2024-06-20 18:28:25,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=251559.0, ans=0.1 2024-06-20 18:28:30,441 INFO [train.py:1028] (1/2) Epoch 14, batch 5700, loss[loss=0.1767, simple_loss=0.2339, pruned_loss=0.05974, over 13317.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2458, pruned_loss=0.07792, over 2578545.45 frames. ], batch size: 63, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:28:32,755 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.761e+02 1.841e+02 2.020e+02 2.713e+02, threshold=3.682e+02, percent-clipped=0.0 2024-06-20 18:28:35,803 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:28:45,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=251595.66666666666, ans=0.2 2024-06-20 18:28:47,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=251614.0, ans=0.5 2024-06-20 18:28:51,374 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.61 vs. limit=15.0 2024-06-20 18:29:14,921 INFO [train.py:1028] (1/2) Epoch 14, batch 5750, loss[loss=0.2271, simple_loss=0.261, pruned_loss=0.09665, over 12759.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2462, pruned_loss=0.07795, over 2579480.71 frames. ], batch size: 176, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:29:18,202 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=15.0 2024-06-20 18:29:46,400 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=15.0 2024-06-20 18:29:55,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=251705.66666666666, ans=0.02 2024-06-20 18:29:56,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=251724.0, ans=0.0 2024-06-20 18:30:13,476 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.778e+00 2024-06-20 18:30:14,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=251760.66666666666, ans=0.0 2024-06-20 18:30:14,646 INFO [train.py:1028] (1/2) Epoch 14, batch 5800, loss[loss=0.2058, simple_loss=0.245, pruned_loss=0.08324, over 12782.00 frames. ], tot_loss[loss=0.203, simple_loss=0.248, pruned_loss=0.07899, over 2578300.66 frames. ], batch size: 176, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:30:17,318 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.822e+02 1.912e+02 2.053e+02 2.592e+02, threshold=3.823e+02, percent-clipped=0.0 2024-06-20 18:30:17,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=251760.66666666666, ans=10.0 2024-06-20 18:30:30,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=251779.0, ans=0.05 2024-06-20 18:30:36,778 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=251797.33333333334, ans=10.0 2024-06-20 18:30:36,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=251797.33333333334, ans=0.2 2024-06-20 18:30:36,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=251797.33333333334, ans=0.2 2024-06-20 18:30:51,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=251834.0, ans=0.0 2024-06-20 18:30:57,280 INFO [train.py:1028] (1/2) Epoch 14, batch 5850, loss[loss=0.2188, simple_loss=0.2619, pruned_loss=0.08786, over 12596.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2497, pruned_loss=0.0797, over 2576846.09 frames. ], batch size: 202, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:31:06,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=251870.66666666666, ans=0.125 2024-06-20 18:31:07,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=251870.66666666666, ans=0.125 2024-06-20 18:31:07,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=251870.66666666666, ans=0.1 2024-06-20 18:31:09,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=251870.66666666666, ans=0.125 2024-06-20 18:31:23,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=251907.33333333334, ans=0.2 2024-06-20 18:31:25,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=251907.33333333334, ans=0.025 2024-06-20 18:31:43,625 INFO [train.py:1028] (1/2) Epoch 14, batch 5900, loss[loss=0.1802, simple_loss=0.2204, pruned_loss=0.07004, over 13109.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2516, pruned_loss=0.0805, over 2577041.16 frames. ], batch size: 121, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:31:47,015 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 1.859e+02 2.040e+02 2.234e+02 3.471e+02, threshold=4.080e+02, percent-clipped=0.0 2024-06-20 18:32:22,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=251999.0, ans=0.125 2024-06-20 18:32:23,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=251999.0, ans=0.05 2024-06-20 18:32:25,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=251999.0, ans=0.025 2024-06-20 18:32:39,297 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.77 vs. limit=10.0 2024-06-20 18:32:42,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=252017.33333333334, ans=0.1 2024-06-20 18:32:46,926 INFO [train.py:1028] (1/2) Epoch 14, batch 5950, loss[loss=0.2146, simple_loss=0.254, pruned_loss=0.08758, over 13095.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2533, pruned_loss=0.08105, over 2581210.16 frames. ], batch size: 121, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:32:47,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=252035.66666666666, ans=0.1 2024-06-20 18:32:48,770 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.66 vs. limit=6.0 2024-06-20 18:32:56,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=252054.0, ans=0.025 2024-06-20 18:33:04,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=252054.0, ans=0.125 2024-06-20 18:33:11,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=252072.33333333334, ans=0.125 2024-06-20 18:33:12,237 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=252072.33333333334, ans=0.125 2024-06-20 18:33:14,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=252072.33333333334, ans=0.125 2024-06-20 18:33:26,714 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=15.0 2024-06-20 18:33:27,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=252109.0, ans=0.2 2024-06-20 18:33:27,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=252109.0, ans=15.0 2024-06-20 18:33:37,178 INFO [train.py:1028] (1/2) Epoch 14, batch 6000, loss[loss=0.2622, simple_loss=0.2944, pruned_loss=0.115, over 12231.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2545, pruned_loss=0.08147, over 2574616.65 frames. ], batch size: 240, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:33:37,180 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 18:33:48,558 INFO [train.py:1060] (1/2) Epoch 14, validation: loss=0.1905, simple_loss=0.255, pruned_loss=0.06294, over 351949.00 frames. 2024-06-20 18:33:48,558 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17479MB 2024-06-20 18:33:51,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=252127.33333333334, ans=0.125 2024-06-20 18:33:51,852 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 1.829e+02 1.927e+02 2.147e+02 2.908e+02, threshold=3.854e+02, percent-clipped=0.0 2024-06-20 18:34:00,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.38 vs. limit=10.0 2024-06-20 18:34:06,984 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2024-06-20 18:34:08,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=252164.0, ans=0.1 2024-06-20 18:34:14,565 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.76 vs. limit=15.0 2024-06-20 18:34:16,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=252164.0, ans=0.2 2024-06-20 18:34:31,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=252200.66666666666, ans=0.125 2024-06-20 18:34:31,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=252200.66666666666, ans=0.125 2024-06-20 18:34:37,921 INFO [train.py:1028] (1/2) Epoch 14, batch 6050, loss[loss=0.2039, simple_loss=0.2542, pruned_loss=0.07682, over 12916.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2567, pruned_loss=0.08259, over 2577780.04 frames. ], batch size: 39, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:34:56,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=252237.33333333334, ans=0.0 2024-06-20 18:34:57,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=252255.66666666666, ans=0.025 2024-06-20 18:35:16,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=252274.0, ans=0.2 2024-06-20 18:35:19,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=252292.33333333334, ans=0.0 2024-06-20 18:35:35,585 INFO [train.py:1028] (1/2) Epoch 14, batch 6100, loss[loss=0.2168, simple_loss=0.2558, pruned_loss=0.08891, over 13125.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2573, pruned_loss=0.08262, over 2580785.44 frames. ], batch size: 121, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:35:38,348 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 1.865e+02 2.039e+02 2.256e+02 2.881e+02, threshold=4.078e+02, percent-clipped=0.0 2024-06-20 18:35:59,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=252347.33333333334, ans=0.125 2024-06-20 18:36:00,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=252347.33333333334, ans=0.125 2024-06-20 18:36:03,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=252347.33333333334, ans=0.125 2024-06-20 18:36:16,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=252365.66666666666, ans=0.0 2024-06-20 18:36:28,884 INFO [train.py:1028] (1/2) Epoch 14, batch 6150, loss[loss=0.2267, simple_loss=0.2672, pruned_loss=0.09311, over 10736.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2591, pruned_loss=0.0833, over 2578726.48 frames. ], batch size: 303, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:36:31,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=252402.33333333334, ans=0.0 2024-06-20 18:36:45,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=252420.66666666666, ans=0.125 2024-06-20 18:36:50,504 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:36:59,936 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:37:03,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=252457.33333333334, ans=0.125 2024-06-20 18:37:03,255 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.11 vs. limit=6.0 2024-06-20 18:37:03,416 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.49 vs. limit=15.0 2024-06-20 18:37:18,158 INFO [train.py:1028] (1/2) Epoch 14, batch 6200, loss[loss=0.2485, simple_loss=0.3024, pruned_loss=0.09731, over 13257.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2615, pruned_loss=0.08433, over 2576964.40 frames. ], batch size: 89, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:37:21,159 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.916e+02 2.075e+02 2.395e+02 3.221e+02, threshold=4.151e+02, percent-clipped=0.0 2024-06-20 18:37:27,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=252512.33333333334, ans=0.125 2024-06-20 18:37:36,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=252530.66666666666, ans=0.1 2024-06-20 18:37:39,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=252530.66666666666, ans=0.125 2024-06-20 18:37:39,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=252530.66666666666, ans=0.0 2024-06-20 18:37:57,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=252567.33333333334, ans=0.0 2024-06-20 18:38:03,088 INFO [train.py:1028] (1/2) Epoch 14, batch 6250, loss[loss=0.208, simple_loss=0.2618, pruned_loss=0.07708, over 13258.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2624, pruned_loss=0.08477, over 2569791.84 frames. ], batch size: 83, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:38:08,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=252585.66666666666, ans=0.025 2024-06-20 18:38:36,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=252604.0, ans=0.1 2024-06-20 18:39:04,930 INFO [train.py:1028] (1/2) Epoch 14, batch 6300, loss[loss=0.2079, simple_loss=0.2517, pruned_loss=0.08205, over 11500.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2635, pruned_loss=0.08519, over 2564461.16 frames. ], batch size: 17, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:39:07,397 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.662e+02 1.936e+02 2.114e+02 2.455e+02 3.862e+02, threshold=4.229e+02, percent-clipped=0.0 2024-06-20 18:39:16,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=252695.66666666666, ans=0.0 2024-06-20 18:39:26,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=252714.0, ans=0.0 2024-06-20 18:39:32,033 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=15.0 2024-06-20 18:39:45,830 INFO [train.py:1028] (1/2) Epoch 14, batch 6350, loss[loss=0.2378, simple_loss=0.2806, pruned_loss=0.09743, over 12523.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2651, pruned_loss=0.08544, over 2573558.43 frames. ], batch size: 202, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:40:04,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=252805.66666666666, ans=0.0 2024-06-20 18:40:29,991 INFO [train.py:1028] (1/2) Epoch 14, batch 6400, loss[loss=0.1856, simple_loss=0.2402, pruned_loss=0.0655, over 13284.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2664, pruned_loss=0.08591, over 2574660.96 frames. ], batch size: 67, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:40:32,796 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.934e+02 2.095e+02 2.383e+02 3.300e+02, threshold=4.190e+02, percent-clipped=0.0 2024-06-20 18:40:36,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=252860.66666666666, ans=0.125 2024-06-20 18:40:41,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=252879.0, ans=0.025 2024-06-20 18:40:43,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=252879.0, ans=0.125 2024-06-20 18:40:45,241 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.45 vs. limit=22.5 2024-06-20 18:41:02,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=252915.66666666666, ans=0.125 2024-06-20 18:41:27,433 INFO [train.py:1028] (1/2) Epoch 14, batch 6450, loss[loss=0.2398, simple_loss=0.2779, pruned_loss=0.1008, over 12510.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2681, pruned_loss=0.08652, over 2580543.81 frames. ], batch size: 202, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:41:27,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=252952.33333333334, ans=0.1 2024-06-20 18:41:31,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=252952.33333333334, ans=0.1 2024-06-20 18:41:36,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=252970.66666666666, ans=0.04949747468305833 2024-06-20 18:42:14,922 INFO [train.py:1028] (1/2) Epoch 14, batch 6500, loss[loss=0.2175, simple_loss=0.2544, pruned_loss=0.09031, over 10729.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2696, pruned_loss=0.08688, over 2584825.33 frames. ], batch size: 304, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:42:16,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=253044.0, ans=0.125 2024-06-20 18:42:18,000 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.907e+02 2.117e+02 2.353e+02 3.080e+02, threshold=4.235e+02, percent-clipped=0.0 2024-06-20 18:42:28,511 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:42:33,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=253080.66666666666, ans=0.125 2024-06-20 18:42:46,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=253099.0, ans=0.07 2024-06-20 18:42:47,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=253099.0, ans=0.1 2024-06-20 18:42:48,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=253099.0, ans=0.0 2024-06-20 18:42:55,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=253117.33333333334, ans=0.0 2024-06-20 18:42:55,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=253117.33333333334, ans=0.125 2024-06-20 18:42:57,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=253117.33333333334, ans=0.125 2024-06-20 18:42:58,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=253117.33333333334, ans=0.1 2024-06-20 18:43:00,322 INFO [train.py:1028] (1/2) Epoch 14, batch 6550, loss[loss=0.1955, simple_loss=0.2526, pruned_loss=0.06925, over 12478.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2704, pruned_loss=0.0871, over 2588409.17 frames. ], batch size: 22, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:43:18,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=253172.33333333334, ans=0.0 2024-06-20 18:43:21,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=253172.33333333334, ans=0.5 2024-06-20 18:43:33,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=253190.66666666666, ans=0.1 2024-06-20 18:43:33,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=253190.66666666666, ans=0.125 2024-06-20 18:43:46,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=253209.0, ans=0.0 2024-06-20 18:43:47,911 INFO [train.py:1028] (1/2) Epoch 14, batch 6600, loss[loss=0.2357, simple_loss=0.2945, pruned_loss=0.08844, over 13264.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2707, pruned_loss=0.0868, over 2590602.54 frames. ], batch size: 72, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:43:50,952 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.959e+02 2.119e+02 2.253e+02 3.068e+02, threshold=4.238e+02, percent-clipped=0.0 2024-06-20 18:43:59,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=253245.66666666666, ans=0.0 2024-06-20 18:44:34,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=253282.33333333334, ans=0.0 2024-06-20 18:44:47,955 INFO [train.py:1028] (1/2) Epoch 14, batch 6650, loss[loss=0.24, simple_loss=0.2761, pruned_loss=0.102, over 12940.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2722, pruned_loss=0.08752, over 2583316.87 frames. ], batch size: 158, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:44:52,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=253319.0, ans=0.2 2024-06-20 18:44:56,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=253319.0, ans=0.125 2024-06-20 18:45:31,159 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.31 vs. limit=15.0 2024-06-20 18:45:33,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=253392.33333333334, ans=0.125 2024-06-20 18:45:37,470 INFO [train.py:1028] (1/2) Epoch 14, batch 6700, loss[loss=0.2502, simple_loss=0.2909, pruned_loss=0.1048, over 12745.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2738, pruned_loss=0.08839, over 2582131.79 frames. ], batch size: 176, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:45:40,194 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 1.976e+02 2.151e+02 2.424e+02 4.483e+02, threshold=4.302e+02, percent-clipped=1.0 2024-06-20 18:45:58,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=253447.33333333334, ans=0.025 2024-06-20 18:46:03,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=253447.33333333334, ans=0.125 2024-06-20 18:46:23,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=253484.0, ans=0.125 2024-06-20 18:46:23,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=253484.0, ans=0.1 2024-06-20 18:46:26,127 INFO [train.py:1028] (1/2) Epoch 14, batch 6750, loss[loss=0.2764, simple_loss=0.3109, pruned_loss=0.121, over 12221.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2747, pruned_loss=0.08909, over 2576518.72 frames. ], batch size: 241, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:46:38,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=253520.66666666666, ans=0.125 2024-06-20 18:46:39,096 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.57 vs. limit=15.0 2024-06-20 18:47:01,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=253557.33333333334, ans=0.0 2024-06-20 18:47:03,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=253575.66666666666, ans=0.125 2024-06-20 18:47:04,781 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.96 vs. limit=22.5 2024-06-20 18:47:21,755 INFO [train.py:1028] (1/2) Epoch 14, batch 6800, loss[loss=0.1961, simple_loss=0.2533, pruned_loss=0.06948, over 13283.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2753, pruned_loss=0.08901, over 2578912.86 frames. ], batch size: 67, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:47:22,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=253594.0, ans=0.1 2024-06-20 18:47:24,657 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 1.916e+02 2.061e+02 2.317e+02 3.955e+02, threshold=4.123e+02, percent-clipped=0.0 2024-06-20 18:47:24,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=253594.0, ans=0.125 2024-06-20 18:47:35,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=253594.0, ans=0.125 2024-06-20 18:47:36,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=253594.0, ans=0.125 2024-06-20 18:47:47,582 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.43 vs. limit=15.0 2024-06-20 18:48:11,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=253667.33333333334, ans=0.0 2024-06-20 18:48:14,034 INFO [train.py:1028] (1/2) Epoch 14, batch 6850, loss[loss=0.2392, simple_loss=0.2953, pruned_loss=0.09157, over 13287.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2758, pruned_loss=0.08909, over 2583464.13 frames. ], batch size: 63, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:48:20,806 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.10 vs. limit=22.5 2024-06-20 18:48:27,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=253704.0, ans=0.125 2024-06-20 18:48:42,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=253722.33333333334, ans=15.0 2024-06-20 18:48:43,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.75 vs. limit=15.0 2024-06-20 18:48:54,147 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.90 vs. limit=6.0 2024-06-20 18:48:59,703 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:49:03,150 INFO [train.py:1028] (1/2) Epoch 14, batch 6900, loss[loss=0.2299, simple_loss=0.2839, pruned_loss=0.08797, over 13040.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2771, pruned_loss=0.0895, over 2585049.89 frames. ], batch size: 48, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:49:05,680 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 1.943e+02 2.084e+02 2.319e+02 3.046e+02, threshold=4.169e+02, percent-clipped=0.0 2024-06-20 18:49:12,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=253795.66666666666, ans=0.0 2024-06-20 18:49:15,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=253795.66666666666, ans=0.2 2024-06-20 18:49:20,224 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.30 vs. limit=15.0 2024-06-20 18:49:31,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=253832.33333333334, ans=0.125 2024-06-20 18:49:41,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=253850.66666666666, ans=0.2 2024-06-20 18:49:42,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=253850.66666666666, ans=0.0 2024-06-20 18:49:46,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.58 vs. limit=15.0 2024-06-20 18:49:49,502 INFO [train.py:1028] (1/2) Epoch 14, batch 6950, loss[loss=0.1904, simple_loss=0.238, pruned_loss=0.0714, over 11455.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2772, pruned_loss=0.08936, over 2580496.79 frames. ], batch size: 17, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:49:53,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=253869.0, ans=0.07 2024-06-20 18:50:01,221 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=7.91 vs. limit=12.0 2024-06-20 18:50:20,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=253924.0, ans=0.05 2024-06-20 18:50:28,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=253924.0, ans=0.125 2024-06-20 18:50:29,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=253924.0, ans=0.125 2024-06-20 18:50:36,255 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2024-06-20 18:50:37,127 INFO [train.py:1028] (1/2) Epoch 14, batch 7000, loss[loss=0.2479, simple_loss=0.2931, pruned_loss=0.1014, over 12895.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2774, pruned_loss=0.08922, over 2575765.13 frames. ], batch size: 158, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:50:39,098 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.963e+02 2.089e+02 2.283e+02 3.555e+02, threshold=4.179e+02, percent-clipped=0.0 2024-06-20 18:50:42,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=253960.66666666666, ans=0.1 2024-06-20 18:50:46,237 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.61 vs. limit=15.0 2024-06-20 18:51:05,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=254015.66666666666, ans=0.0 2024-06-20 18:51:17,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=254034.0, ans=0.125 2024-06-20 18:51:17,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=254034.0, ans=0.2 2024-06-20 18:51:23,594 INFO [train.py:1028] (1/2) Epoch 14, batch 7050, loss[loss=0.2396, simple_loss=0.2912, pruned_loss=0.09396, over 12823.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2785, pruned_loss=0.08957, over 2581678.68 frames. ], batch size: 176, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:51:29,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=254052.33333333334, ans=0.0 2024-06-20 18:51:33,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=254070.66666666666, ans=0.125 2024-06-20 18:51:39,636 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.94 vs. limit=15.0 2024-06-20 18:51:41,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=254070.66666666666, ans=15.0 2024-06-20 18:51:43,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=254089.0, ans=0.125 2024-06-20 18:51:51,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=254107.33333333334, ans=0.1 2024-06-20 18:52:11,369 INFO [train.py:1028] (1/2) Epoch 14, batch 7100, loss[loss=0.2606, simple_loss=0.309, pruned_loss=0.1061, over 13136.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2801, pruned_loss=0.09094, over 2574471.98 frames. ], batch size: 112, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:52:14,090 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 2.013e+02 2.250e+02 2.530e+02 3.490e+02, threshold=4.500e+02, percent-clipped=0.0 2024-06-20 18:52:23,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=254162.33333333334, ans=0.1 2024-06-20 18:52:23,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=254162.33333333334, ans=0.0 2024-06-20 18:52:50,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=254217.33333333334, ans=0.0 2024-06-20 18:52:52,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=254217.33333333334, ans=0.1 2024-06-20 18:52:59,094 INFO [train.py:1028] (1/2) Epoch 14, batch 7150, loss[loss=0.2704, simple_loss=0.31, pruned_loss=0.1154, over 12475.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.2818, pruned_loss=0.0912, over 2573687.39 frames. ], batch size: 202, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:53:03,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=254235.66666666666, ans=0.125 2024-06-20 18:53:08,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=254254.0, ans=0.125 2024-06-20 18:53:27,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=254254.0, ans=0.125 2024-06-20 18:53:34,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=254272.33333333334, ans=0.2 2024-06-20 18:53:34,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=254272.33333333334, ans=0.125 2024-06-20 18:53:40,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=254272.33333333334, ans=0.0 2024-06-20 18:54:02,660 INFO [train.py:1028] (1/2) Epoch 14, batch 7200, loss[loss=0.2305, simple_loss=0.2856, pruned_loss=0.08765, over 13142.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.2829, pruned_loss=0.09161, over 2579133.51 frames. ], batch size: 112, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:54:05,256 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 2.013e+02 2.149e+02 2.413e+02 3.309e+02, threshold=4.299e+02, percent-clipped=0.0 2024-06-20 18:54:06,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=254327.33333333334, ans=0.025 2024-06-20 18:54:11,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=254345.66666666666, ans=0.025 2024-06-20 18:54:15,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=254345.66666666666, ans=0.2 2024-06-20 18:54:25,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=254364.0, ans=0.125 2024-06-20 18:54:32,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=254382.33333333334, ans=0.025 2024-06-20 18:54:39,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=254382.33333333334, ans=0.125 2024-06-20 18:54:41,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=254400.66666666666, ans=0.125 2024-06-20 18:54:41,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=254400.66666666666, ans=0.2 2024-06-20 18:54:50,756 INFO [train.py:1028] (1/2) Epoch 14, batch 7250, loss[loss=0.228, simple_loss=0.2776, pruned_loss=0.08921, over 12965.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.2836, pruned_loss=0.09172, over 2579488.52 frames. ], batch size: 36, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:54:52,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=254419.0, ans=0.125 2024-06-20 18:55:05,025 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.54 vs. limit=15.0 2024-06-20 18:55:12,224 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.10 vs. limit=15.0 2024-06-20 18:55:13,202 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.57 vs. limit=10.0 2024-06-20 18:55:16,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=254455.66666666666, ans=0.1 2024-06-20 18:55:24,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=254474.0, ans=0.2 2024-06-20 18:55:24,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=254474.0, ans=0.0 2024-06-20 18:55:26,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=254474.0, ans=0.125 2024-06-20 18:55:37,680 INFO [train.py:1028] (1/2) Epoch 14, batch 7300, loss[loss=0.2287, simple_loss=0.2842, pruned_loss=0.08664, over 12945.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.2846, pruned_loss=0.09212, over 2578663.21 frames. ], batch size: 36, lr: 4.14e-03, grad_scale: 32.0 2024-06-20 18:55:40,208 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 1.911e+02 2.056e+02 2.245e+02 3.705e+02, threshold=4.112e+02, percent-clipped=0.0 2024-06-20 18:55:46,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=254529.0, ans=0.1 2024-06-20 18:56:04,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=254547.33333333334, ans=0.1 2024-06-20 18:56:22,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=254584.0, ans=0.0 2024-06-20 18:56:35,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=254584.0, ans=22.5 2024-06-20 18:56:38,309 INFO [train.py:1028] (1/2) Epoch 14, batch 7350, loss[loss=0.2398, simple_loss=0.2932, pruned_loss=0.09323, over 13366.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.2844, pruned_loss=0.09212, over 2579844.16 frames. ], batch size: 46, lr: 4.14e-03, grad_scale: 32.0 2024-06-20 18:57:04,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=254639.0, ans=0.125 2024-06-20 18:57:09,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=254657.33333333334, ans=0.125 2024-06-20 18:57:15,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=254657.33333333334, ans=15.0 2024-06-20 18:57:21,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=254675.66666666666, ans=0.0 2024-06-20 18:57:27,204 INFO [train.py:1028] (1/2) Epoch 14, batch 7400, loss[loss=0.2542, simple_loss=0.3106, pruned_loss=0.09888, over 13228.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.2846, pruned_loss=0.0919, over 2585326.95 frames. ], batch size: 63, lr: 4.14e-03, grad_scale: 32.0 2024-06-20 18:57:28,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=254694.0, ans=0.125 2024-06-20 18:57:30,121 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 1.941e+02 2.018e+02 2.195e+02 3.257e+02, threshold=4.036e+02, percent-clipped=0.0 2024-06-20 18:57:33,601 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.34 vs. limit=15.0 2024-06-20 18:57:34,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=254694.0, ans=0.0 2024-06-20 18:57:44,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=254712.33333333334, ans=0.2 2024-06-20 18:57:50,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=254730.66666666666, ans=0.0 2024-06-20 18:57:57,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=254749.0, ans=0.2 2024-06-20 18:58:10,536 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:58:19,420 INFO [train.py:1028] (1/2) Epoch 14, batch 7450, loss[loss=0.2048, simple_loss=0.2609, pruned_loss=0.07434, over 12689.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.2841, pruned_loss=0.09146, over 2580672.24 frames. ], batch size: 29, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 18:58:27,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=254785.66666666666, ans=0.1 2024-06-20 18:58:41,301 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.59 vs. limit=15.0 2024-06-20 18:58:47,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=254840.66666666666, ans=0.0 2024-06-20 18:58:48,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=254840.66666666666, ans=0.2 2024-06-20 18:59:04,936 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.71 vs. limit=15.0 2024-06-20 18:59:07,424 INFO [train.py:1028] (1/2) Epoch 14, batch 7500, loss[loss=0.2509, simple_loss=0.2859, pruned_loss=0.108, over 10916.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.2847, pruned_loss=0.09188, over 2577823.39 frames. ], batch size: 303, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 18:59:10,259 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 1.914e+02 2.015e+02 2.177e+02 3.048e+02, threshold=4.030e+02, percent-clipped=0.0 2024-06-20 18:59:13,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=254877.33333333334, ans=0.125 2024-06-20 18:59:47,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=254932.33333333334, ans=0.025 2024-06-20 19:00:07,219 INFO [train.py:1028] (1/2) Epoch 14, batch 7550, loss[loss=0.2312, simple_loss=0.2848, pruned_loss=0.0888, over 12955.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.2856, pruned_loss=0.09256, over 2577011.41 frames. ], batch size: 158, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:00:07,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=254969.0, ans=0.0 2024-06-20 19:00:07,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=254969.0, ans=0.125 2024-06-20 19:00:25,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=255005.66666666666, ans=0.0 2024-06-20 19:00:36,008 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.04 vs. limit=22.5 2024-06-20 19:00:52,185 INFO [train.py:1028] (1/2) Epoch 14, batch 7600, loss[loss=0.2481, simple_loss=0.2963, pruned_loss=0.09995, over 13210.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.2859, pruned_loss=0.09272, over 2576059.21 frames. ], batch size: 83, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:00:55,330 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 1.979e+02 2.176e+02 2.438e+02 3.465e+02, threshold=4.352e+02, percent-clipped=0.0 2024-06-20 19:01:00,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=255060.66666666666, ans=0.0 2024-06-20 19:01:07,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=255079.0, ans=0.125 2024-06-20 19:01:18,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=255097.33333333334, ans=0.025 2024-06-20 19:01:43,837 INFO [train.py:1028] (1/2) Epoch 14, batch 7650, loss[loss=0.2075, simple_loss=0.2609, pruned_loss=0.077, over 12903.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.2861, pruned_loss=0.0925, over 2572234.96 frames. ], batch size: 33, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:01:45,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=255152.33333333334, ans=0.5 2024-06-20 19:02:28,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=255207.33333333334, ans=0.125 2024-06-20 19:02:28,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=255207.33333333334, ans=0.0 2024-06-20 19:02:38,890 INFO [train.py:1028] (1/2) Epoch 14, batch 7700, loss[loss=0.2334, simple_loss=0.2896, pruned_loss=0.08861, over 13280.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.2867, pruned_loss=0.09273, over 2568762.59 frames. ], batch size: 63, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:02:40,776 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.17 vs. limit=22.5 2024-06-20 19:02:41,980 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.996e+02 2.113e+02 2.316e+02 3.307e+02, threshold=4.226e+02, percent-clipped=0.0 2024-06-20 19:02:52,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=255262.33333333334, ans=0.1 2024-06-20 19:03:01,998 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2024-06-20 19:03:04,246 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.12 vs. limit=15.0 2024-06-20 19:03:26,892 INFO [train.py:1028] (1/2) Epoch 14, batch 7750, loss[loss=0.2369, simple_loss=0.2905, pruned_loss=0.09168, over 13262.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.2876, pruned_loss=0.09336, over 2573616.41 frames. ], batch size: 72, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:03:40,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=255354.0, ans=0.1 2024-06-20 19:03:43,243 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.40 vs. limit=10.0 2024-06-20 19:04:08,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=255409.0, ans=0.125 2024-06-20 19:04:15,316 INFO [train.py:1028] (1/2) Epoch 14, batch 7800, loss[loss=0.2322, simple_loss=0.2872, pruned_loss=0.0886, over 13116.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.2882, pruned_loss=0.09364, over 2578361.14 frames. ], batch size: 95, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:04:17,988 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.003e+02 2.206e+02 2.477e+02 3.692e+02, threshold=4.412e+02, percent-clipped=0.0 2024-06-20 19:04:26,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=255445.66666666666, ans=0.025 2024-06-20 19:04:26,949 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.98 vs. limit=15.0 2024-06-20 19:04:45,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=255482.33333333334, ans=0.1 2024-06-20 19:04:52,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=255482.33333333334, ans=0.2 2024-06-20 19:04:54,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=255482.33333333334, ans=0.125 2024-06-20 19:05:13,552 INFO [train.py:1028] (1/2) Epoch 14, batch 7850, loss[loss=0.1971, simple_loss=0.2538, pruned_loss=0.07018, over 10913.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.2892, pruned_loss=0.09414, over 2571582.72 frames. ], batch size: 16, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:05:20,519 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.37 vs. limit=10.0 2024-06-20 19:05:21,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=255519.0, ans=0.07 2024-06-20 19:05:29,890 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.57 vs. limit=6.0 2024-06-20 19:05:36,573 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.23 vs. limit=15.0 2024-06-20 19:05:41,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=255555.66666666666, ans=0.1 2024-06-20 19:05:47,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=255574.0, ans=0.0 2024-06-20 19:05:47,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=255574.0, ans=0.125 2024-06-20 19:06:02,093 INFO [train.py:1028] (1/2) Epoch 14, batch 7900, loss[loss=0.2498, simple_loss=0.3004, pruned_loss=0.09959, over 13190.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.2897, pruned_loss=0.09435, over 2571168.44 frames. ], batch size: 77, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:06:05,212 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 2.009e+02 2.195e+02 2.375e+02 3.072e+02, threshold=4.391e+02, percent-clipped=0.0 2024-06-20 19:06:11,283 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.35 vs. limit=15.0 2024-06-20 19:06:14,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=255629.0, ans=0.125 2024-06-20 19:06:32,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=255665.66666666666, ans=0.0 2024-06-20 19:06:48,034 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.33 vs. limit=15.0 2024-06-20 19:06:51,352 INFO [train.py:1028] (1/2) Epoch 14, batch 7950, loss[loss=0.247, simple_loss=0.2911, pruned_loss=0.1014, over 10646.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.29, pruned_loss=0.09448, over 2574257.97 frames. ], batch size: 304, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:06:55,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=255702.33333333334, ans=0.0 2024-06-20 19:07:02,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=255720.66666666666, ans=0.0 2024-06-20 19:07:08,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=255720.66666666666, ans=0.0 2024-06-20 19:07:18,762 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:07:20,958 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.70 vs. limit=15.0 2024-06-20 19:07:28,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=255757.33333333334, ans=0.0 2024-06-20 19:07:35,454 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.85 vs. limit=15.0 2024-06-20 19:07:36,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=255775.66666666666, ans=0.0 2024-06-20 19:07:40,819 INFO [train.py:1028] (1/2) Epoch 14, batch 8000, loss[loss=0.2421, simple_loss=0.2889, pruned_loss=0.09762, over 12676.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.2903, pruned_loss=0.09453, over 2571574.52 frames. ], batch size: 29, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:07:43,856 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.074e+02 2.204e+02 2.421e+02 3.264e+02, threshold=4.407e+02, percent-clipped=0.0 2024-06-20 19:07:45,475 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.35 vs. limit=15.0 2024-06-20 19:07:56,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=255812.33333333334, ans=0.125 2024-06-20 19:08:11,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=255830.66666666666, ans=0.125 2024-06-20 19:08:24,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=255849.0, ans=0.0 2024-06-20 19:08:32,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=255867.33333333334, ans=0.125 2024-06-20 19:08:41,530 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=8.19 vs. limit=12.0 2024-06-20 19:08:42,085 INFO [train.py:1028] (1/2) Epoch 14, batch 8050, loss[loss=0.2343, simple_loss=0.2816, pruned_loss=0.09352, over 13234.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.29, pruned_loss=0.09408, over 2571890.09 frames. ], batch size: 83, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:08:42,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=255885.66666666666, ans=0.07 2024-06-20 19:09:07,792 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:09:18,324 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.63 vs. limit=22.5 2024-06-20 19:09:28,692 INFO [train.py:1028] (1/2) Epoch 14, batch 8100, loss[loss=0.2363, simple_loss=0.2879, pruned_loss=0.09238, over 13156.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.2906, pruned_loss=0.09463, over 2575662.14 frames. ], batch size: 112, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:09:31,748 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.952e+02 2.074e+02 2.243e+02 3.015e+02, threshold=4.147e+02, percent-clipped=0.0 2024-06-20 19:09:35,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=255977.33333333334, ans=0.125 2024-06-20 19:09:38,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=255995.66666666666, ans=0.1 2024-06-20 19:09:42,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=255995.66666666666, ans=0.125 2024-06-20 19:09:57,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=256032.33333333334, ans=0.125 2024-06-20 19:10:03,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=256050.66666666666, ans=0.1 2024-06-20 19:10:12,765 INFO [train.py:1028] (1/2) Epoch 14, batch 8150, loss[loss=0.2568, simple_loss=0.3044, pruned_loss=0.1046, over 13114.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.29, pruned_loss=0.09384, over 2578944.26 frames. ], batch size: 121, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:10:15,936 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.95 vs. limit=15.0 2024-06-20 19:10:18,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=256069.0, ans=0.125 2024-06-20 19:10:20,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=256069.0, ans=0.2 2024-06-20 19:10:31,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=256105.66666666666, ans=0.125 2024-06-20 19:10:33,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=256105.66666666666, ans=0.1 2024-06-20 19:10:33,895 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:10:37,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=256105.66666666666, ans=0.125 2024-06-20 19:10:50,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=256124.0, ans=0.5 2024-06-20 19:10:54,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=256124.0, ans=0.125 2024-06-20 19:10:55,009 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2024-06-20 19:11:10,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=256142.33333333334, ans=0.125 2024-06-20 19:11:11,909 INFO [train.py:1028] (1/2) Epoch 14, batch 8200, loss[loss=0.2484, simple_loss=0.2932, pruned_loss=0.1018, over 13146.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.2907, pruned_loss=0.09419, over 2582867.09 frames. ], batch size: 112, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:11:14,588 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 1.958e+02 2.114e+02 2.345e+02 3.112e+02, threshold=4.228e+02, percent-clipped=0.0 2024-06-20 19:11:29,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=12.0 2024-06-20 19:11:31,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.73 vs. limit=15.0 2024-06-20 19:11:35,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=256197.33333333334, ans=0.2 2024-06-20 19:11:51,082 INFO [train.py:1028] (1/2) Epoch 14, batch 8250, loss[loss=0.2378, simple_loss=0.2987, pruned_loss=0.0885, over 13232.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.2909, pruned_loss=0.09408, over 2584031.67 frames. ], batch size: 52, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:11:53,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=256252.33333333334, ans=0.125 2024-06-20 19:11:55,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=256252.33333333334, ans=0.1 2024-06-20 19:11:55,392 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2024-06-20 19:12:04,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=256270.66666666666, ans=0.0 2024-06-20 19:12:08,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=256289.0, ans=0.0 2024-06-20 19:12:17,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=256307.33333333334, ans=0.2 2024-06-20 19:12:33,842 INFO [train.py:1028] (1/2) Epoch 14, batch 8300, loss[loss=0.225, simple_loss=0.2731, pruned_loss=0.08841, over 13021.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.2906, pruned_loss=0.09393, over 2581172.66 frames. ], batch size: 102, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:12:36,563 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 2.000e+02 2.132e+02 2.417e+02 3.636e+02, threshold=4.264e+02, percent-clipped=0.0 2024-06-20 19:12:38,243 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=8.0 2024-06-20 19:12:57,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=256380.66666666666, ans=0.2 2024-06-20 19:13:18,673 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.62 vs. limit=15.0 2024-06-20 19:13:29,606 INFO [train.py:1028] (1/2) Epoch 14, batch 8350, loss[loss=0.2286, simple_loss=0.2809, pruned_loss=0.08813, over 13199.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.2906, pruned_loss=0.09358, over 2581273.87 frames. ], batch size: 112, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:13:38,788 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.78 vs. limit=15.0 2024-06-20 19:14:13,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=256509.0, ans=0.1 2024-06-20 19:14:15,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=256509.0, ans=0.2 2024-06-20 19:14:19,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=256509.0, ans=0.125 2024-06-20 19:14:21,941 INFO [train.py:1028] (1/2) Epoch 14, batch 8400, loss[loss=0.2246, simple_loss=0.2844, pruned_loss=0.08241, over 13199.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.2905, pruned_loss=0.09367, over 2579070.20 frames. ], batch size: 40, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:14:24,457 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.733e+02 1.985e+02 2.142e+02 2.354e+02 3.037e+02, threshold=4.284e+02, percent-clipped=0.0 2024-06-20 19:14:27,500 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:14:34,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=256545.66666666666, ans=0.2 2024-06-20 19:14:44,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=256564.0, ans=0.025 2024-06-20 19:14:51,520 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=12.0 2024-06-20 19:14:52,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=256582.33333333334, ans=10.0 2024-06-20 19:14:53,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=256582.33333333334, ans=0.125 2024-06-20 19:14:56,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=256600.66666666666, ans=0.025 2024-06-20 19:14:59,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=256600.66666666666, ans=0.125 2024-06-20 19:15:00,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=256600.66666666666, ans=0.1 2024-06-20 19:15:07,206 INFO [train.py:1028] (1/2) Epoch 14, batch 8450, loss[loss=0.2469, simple_loss=0.3003, pruned_loss=0.09675, over 13183.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.2904, pruned_loss=0.09352, over 2581049.82 frames. ], batch size: 112, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:15:07,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=256619.0, ans=0.04949747468305833 2024-06-20 19:15:17,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=256637.33333333334, ans=0.125 2024-06-20 19:15:23,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=256637.33333333334, ans=0.1 2024-06-20 19:15:31,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=256655.66666666666, ans=0.125 2024-06-20 19:15:40,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=256655.66666666666, ans=0.0 2024-06-20 19:15:42,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=256674.0, ans=0.05 2024-06-20 19:15:46,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=256674.0, ans=0.125 2024-06-20 19:15:48,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=256674.0, ans=0.0 2024-06-20 19:16:02,870 INFO [train.py:1028] (1/2) Epoch 14, batch 8500, loss[loss=0.2454, simple_loss=0.3018, pruned_loss=0.09444, over 12521.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.292, pruned_loss=0.09411, over 2578188.05 frames. ], batch size: 29, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:16:04,253 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2024-06-20 19:16:05,691 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.023e+02 2.176e+02 2.396e+02 3.327e+02, threshold=4.352e+02, percent-clipped=0.0 2024-06-20 19:16:30,310 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.92 vs. limit=15.0 2024-06-20 19:16:57,113 INFO [train.py:1028] (1/2) Epoch 14, batch 8550, loss[loss=0.2593, simple_loss=0.32, pruned_loss=0.09931, over 12467.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.2928, pruned_loss=0.09451, over 2576049.31 frames. ], batch size: 22, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:17:04,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=256802.33333333334, ans=0.125 2024-06-20 19:17:21,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=256839.0, ans=0.035 2024-06-20 19:17:23,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=256839.0, ans=0.0 2024-06-20 19:17:23,561 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2024-06-20 19:17:38,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=256875.66666666666, ans=0.125 2024-06-20 19:17:40,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=256875.66666666666, ans=0.2 2024-06-20 19:17:45,791 INFO [train.py:1028] (1/2) Epoch 14, batch 8600, loss[loss=0.2644, simple_loss=0.3103, pruned_loss=0.1092, over 13084.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.2935, pruned_loss=0.0949, over 2575225.28 frames. ], batch size: 121, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:17:46,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=256894.0, ans=0.125 2024-06-20 19:17:48,603 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 1.985e+02 2.114e+02 2.262e+02 3.380e+02, threshold=4.228e+02, percent-clipped=0.0 2024-06-20 19:17:52,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=256894.0, ans=0.2 2024-06-20 19:17:57,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=256912.33333333334, ans=0.125 2024-06-20 19:18:01,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=256912.33333333334, ans=0.025 2024-06-20 19:18:04,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=256930.66666666666, ans=0.125 2024-06-20 19:18:05,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=256930.66666666666, ans=0.125 2024-06-20 19:18:10,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=256930.66666666666, ans=0.0 2024-06-20 19:18:10,239 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2024-06-20 19:18:10,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=256930.66666666666, ans=0.05 2024-06-20 19:18:15,568 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.88 vs. limit=22.5 2024-06-20 19:18:21,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=256949.0, ans=0.04949747468305833 2024-06-20 19:18:31,386 INFO [train.py:1028] (1/2) Epoch 14, batch 8650, loss[loss=0.2207, simple_loss=0.2698, pruned_loss=0.0858, over 13182.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.2934, pruned_loss=0.09479, over 2577991.84 frames. ], batch size: 103, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:18:31,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=256985.66666666666, ans=0.0 2024-06-20 19:18:32,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=256985.66666666666, ans=0.125 2024-06-20 19:19:13,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=257059.0, ans=0.0 2024-06-20 19:19:19,857 INFO [train.py:1028] (1/2) Epoch 14, batch 8700, loss[loss=0.2393, simple_loss=0.2948, pruned_loss=0.09195, over 13174.00 frames. ], tot_loss[loss=0.242, simple_loss=0.2936, pruned_loss=0.09515, over 2573687.84 frames. ], batch size: 59, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:19:22,562 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.770e+02 1.970e+02 2.087e+02 2.261e+02 2.793e+02, threshold=4.174e+02, percent-clipped=0.0 2024-06-20 19:19:23,170 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.86 vs. limit=15.0 2024-06-20 19:19:46,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=257095.66666666666, ans=0.125 2024-06-20 19:19:48,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=257095.66666666666, ans=0.125 2024-06-20 19:19:57,912 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.83 vs. limit=15.0 2024-06-20 19:20:04,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=257132.33333333334, ans=0.02 2024-06-20 19:20:21,355 INFO [train.py:1028] (1/2) Epoch 14, batch 8750, loss[loss=0.2303, simple_loss=0.2787, pruned_loss=0.091, over 13056.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.2934, pruned_loss=0.09518, over 2569835.39 frames. ], batch size: 121, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:20:21,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=257169.0, ans=0.125 2024-06-20 19:20:33,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=257187.33333333334, ans=0.025 2024-06-20 19:20:35,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=257187.33333333334, ans=0.1 2024-06-20 19:20:49,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=257224.0, ans=0.125 2024-06-20 19:20:50,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=257224.0, ans=0.02 2024-06-20 19:20:56,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=257242.33333333334, ans=0.125 2024-06-20 19:20:57,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257242.33333333334, ans=0.1 2024-06-20 19:21:04,146 INFO [train.py:1028] (1/2) Epoch 14, batch 8800, loss[loss=0.246, simple_loss=0.3039, pruned_loss=0.09401, over 13221.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.2943, pruned_loss=0.09554, over 2574163.30 frames. ], batch size: 72, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:21:06,732 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 2.005e+02 2.127e+02 2.367e+02 3.102e+02, threshold=4.255e+02, percent-clipped=0.0 2024-06-20 19:21:08,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=257260.66666666666, ans=0.1 2024-06-20 19:21:54,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=257352.33333333334, ans=0.125 2024-06-20 19:21:55,330 INFO [train.py:1028] (1/2) Epoch 14, batch 8850, loss[loss=0.2477, simple_loss=0.2956, pruned_loss=0.09991, over 12566.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.2941, pruned_loss=0.09582, over 2562696.32 frames. ], batch size: 202, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:22:36,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=257389.0, ans=0.125 2024-06-20 19:22:38,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=257407.33333333334, ans=15.0 2024-06-20 19:22:50,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257425.66666666666, ans=0.1 2024-06-20 19:22:58,202 INFO [train.py:1028] (1/2) Epoch 14, batch 8900, loss[loss=0.2515, simple_loss=0.3053, pruned_loss=0.09889, over 12923.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.2951, pruned_loss=0.0964, over 2561553.72 frames. ], batch size: 33, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:23:00,889 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.021e+02 2.216e+02 2.416e+02 3.364e+02, threshold=4.433e+02, percent-clipped=0.0 2024-06-20 19:23:20,210 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.39 vs. limit=6.0 2024-06-20 19:23:23,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=257499.0, ans=0.125 2024-06-20 19:23:40,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=257517.33333333334, ans=0.02 2024-06-20 19:23:43,425 INFO [train.py:1028] (1/2) Epoch 14, batch 8950, loss[loss=0.2788, simple_loss=0.3214, pruned_loss=0.1181, over 12474.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.2952, pruned_loss=0.09599, over 2562503.38 frames. ], batch size: 202, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:23:48,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=257535.66666666666, ans=0.125 2024-06-20 19:23:54,411 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.42 vs. limit=15.0 2024-06-20 19:24:25,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=257609.0, ans=0.0 2024-06-20 19:24:30,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=257609.0, ans=0.125 2024-06-20 19:24:32,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=257609.0, ans=0.125 2024-06-20 19:24:35,254 INFO [train.py:1028] (1/2) Epoch 14, batch 9000, loss[loss=0.2361, simple_loss=0.2877, pruned_loss=0.09231, over 13324.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.2954, pruned_loss=0.09591, over 2567155.28 frames. ], batch size: 46, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:24:35,255 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 19:24:46,192 INFO [train.py:1060] (1/2) Epoch 14, validation: loss=0.1901, simple_loss=0.255, pruned_loss=0.06264, over 351949.00 frames. 2024-06-20 19:24:46,194 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17479MB 2024-06-20 19:24:49,190 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.761e+02 2.039e+02 2.186e+02 2.416e+02 3.658e+02, threshold=4.372e+02, percent-clipped=0.0 2024-06-20 19:25:07,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=257664.0, ans=0.125 2024-06-20 19:25:19,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=257682.33333333334, ans=0.0 2024-06-20 19:25:19,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=257682.33333333334, ans=0.2 2024-06-20 19:25:24,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2024-06-20 19:25:26,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=257700.66666666666, ans=0.125 2024-06-20 19:25:37,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=257700.66666666666, ans=0.2 2024-06-20 19:25:39,958 INFO [train.py:1028] (1/2) Epoch 14, batch 9050, loss[loss=0.2376, simple_loss=0.2874, pruned_loss=0.09393, over 11565.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.2963, pruned_loss=0.09605, over 2566709.92 frames. ], batch size: 17, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:26:28,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=257792.33333333334, ans=0.0 2024-06-20 19:26:33,056 INFO [train.py:1028] (1/2) Epoch 14, batch 9100, loss[loss=0.2299, simple_loss=0.2898, pruned_loss=0.08502, over 13257.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.2949, pruned_loss=0.09512, over 2568085.34 frames. ], batch size: 72, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:26:36,018 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.616e+02 1.932e+02 2.085e+02 2.234e+02 3.364e+02, threshold=4.170e+02, percent-clipped=0.0 2024-06-20 19:26:36,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=257810.66666666666, ans=0.2 2024-06-20 19:26:36,656 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.80 vs. limit=22.5 2024-06-20 19:26:48,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=257829.0, ans=0.2 2024-06-20 19:27:15,666 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.66 vs. limit=22.5 2024-06-20 19:27:16,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=257884.0, ans=0.125 2024-06-20 19:27:19,247 INFO [train.py:1028] (1/2) Epoch 14, batch 9150, loss[loss=0.2344, simple_loss=0.2906, pruned_loss=0.08911, over 13162.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.2952, pruned_loss=0.09547, over 2568466.51 frames. ], batch size: 77, lr: 4.12e-03, grad_scale: 32.0 2024-06-20 19:27:21,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=257902.33333333334, ans=15.0 2024-06-20 19:27:33,029 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.35 vs. limit=15.0 2024-06-20 19:27:33,739 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.64 vs. limit=15.0 2024-06-20 19:27:49,075 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2024-06-20 19:27:50,770 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.88 vs. limit=6.0 2024-06-20 19:27:51,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=257957.33333333334, ans=0.125 2024-06-20 19:27:55,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=257957.33333333334, ans=0.125 2024-06-20 19:27:56,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=257957.33333333334, ans=0.125 2024-06-20 19:27:59,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=257975.66666666666, ans=0.0 2024-06-20 19:28:07,187 INFO [train.py:1028] (1/2) Epoch 14, batch 9200, loss[loss=0.2388, simple_loss=0.2989, pruned_loss=0.08937, over 12966.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.2946, pruned_loss=0.09485, over 2571334.21 frames. ], batch size: 36, lr: 4.12e-03, grad_scale: 32.0 2024-06-20 19:28:09,503 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.13 vs. limit=12.0 2024-06-20 19:28:10,608 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.086e+02 2.246e+02 2.501e+02 3.798e+02, threshold=4.492e+02, percent-clipped=0.0 2024-06-20 19:28:16,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=258012.33333333334, ans=0.125 2024-06-20 19:28:28,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=258030.66666666666, ans=0.125 2024-06-20 19:28:32,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=258049.0, ans=0.125 2024-06-20 19:28:33,533 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.84 vs. limit=15.0 2024-06-20 19:28:33,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=258049.0, ans=0.125 2024-06-20 19:28:42,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=258067.33333333334, ans=0.1 2024-06-20 19:28:48,512 INFO [train.py:1028] (1/2) Epoch 14, batch 9250, loss[loss=0.2547, simple_loss=0.3134, pruned_loss=0.09806, over 13230.00 frames. ], tot_loss[loss=0.242, simple_loss=0.2944, pruned_loss=0.09475, over 2572999.52 frames. ], batch size: 67, lr: 4.12e-03, grad_scale: 32.0 2024-06-20 19:29:06,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=258122.33333333334, ans=0.1 2024-06-20 19:29:19,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=258140.66666666666, ans=0.0 2024-06-20 19:29:31,538 INFO [train.py:1028] (1/2) Epoch 14, batch 9300, loss[loss=0.2317, simple_loss=0.2869, pruned_loss=0.08822, over 12975.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.2945, pruned_loss=0.09487, over 2570288.82 frames. ], batch size: 39, lr: 4.12e-03, grad_scale: 32.0 2024-06-20 19:29:35,238 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.044e+02 2.199e+02 2.506e+02 3.756e+02, threshold=4.397e+02, percent-clipped=0.0 2024-06-20 19:29:37,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=258177.33333333334, ans=0.1 2024-06-20 19:29:48,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=258214.0, ans=10.0 2024-06-20 19:29:48,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.46 vs. limit=15.0 2024-06-20 19:29:55,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=258214.0, ans=0.125 2024-06-20 19:30:16,868 INFO [train.py:1028] (1/2) Epoch 14, batch 9350, loss[loss=0.2419, simple_loss=0.2953, pruned_loss=0.09426, over 12551.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.2949, pruned_loss=0.09488, over 2567139.34 frames. ], batch size: 22, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:30:23,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=258269.0, ans=0.1 2024-06-20 19:30:25,601 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.86 vs. limit=15.0 2024-06-20 19:30:27,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=258287.33333333334, ans=0.125 2024-06-20 19:30:29,153 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.53 vs. limit=15.0 2024-06-20 19:30:36,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.08 vs. limit=10.0 2024-06-20 19:30:40,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=258305.66666666666, ans=0.2 2024-06-20 19:30:40,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=258305.66666666666, ans=0.125 2024-06-20 19:30:41,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=258305.66666666666, ans=0.0 2024-06-20 19:30:47,820 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:30:58,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=258342.33333333334, ans=0.125 2024-06-20 19:31:04,523 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.60 vs. limit=12.0 2024-06-20 19:31:05,972 INFO [train.py:1028] (1/2) Epoch 14, batch 9400, loss[loss=0.2524, simple_loss=0.3046, pruned_loss=0.1001, over 13273.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.2957, pruned_loss=0.09556, over 2566834.42 frames. ], batch size: 52, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:31:13,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=258360.66666666666, ans=0.1 2024-06-20 19:31:14,227 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 1.993e+02 2.233e+02 2.411e+02 3.028e+02, threshold=4.466e+02, percent-clipped=0.0 2024-06-20 19:31:24,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=258379.0, ans=0.0 2024-06-20 19:31:32,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=258397.33333333334, ans=0.125 2024-06-20 19:31:33,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=258397.33333333334, ans=0.0 2024-06-20 19:31:35,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=258415.66666666666, ans=0.125 2024-06-20 19:31:49,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=258434.0, ans=0.025 2024-06-20 19:31:53,908 INFO [train.py:1028] (1/2) Epoch 14, batch 9450, loss[loss=0.2607, simple_loss=0.3067, pruned_loss=0.1074, over 12569.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.2966, pruned_loss=0.096, over 2566373.01 frames. ], batch size: 22, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:31:58,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=258452.33333333334, ans=0.0 2024-06-20 19:31:59,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=258452.33333333334, ans=0.0 2024-06-20 19:32:19,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=258507.33333333334, ans=0.125 2024-06-20 19:32:28,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=258525.66666666666, ans=0.1 2024-06-20 19:32:30,299 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.52 vs. limit=6.0 2024-06-20 19:32:32,237 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=258525.66666666666, ans=0.125 2024-06-20 19:32:32,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=258525.66666666666, ans=0.125 2024-06-20 19:32:36,634 INFO [train.py:1028] (1/2) Epoch 14, batch 9500, loss[loss=0.2258, simple_loss=0.2884, pruned_loss=0.08159, over 13213.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.2959, pruned_loss=0.09549, over 2576124.50 frames. ], batch size: 43, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:32:39,573 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 1.974e+02 2.064e+02 2.227e+02 2.891e+02, threshold=4.129e+02, percent-clipped=0.0 2024-06-20 19:32:41,899 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:32:47,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=258562.33333333334, ans=0.2 2024-06-20 19:32:53,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=258580.66666666666, ans=0.0 2024-06-20 19:32:56,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=258580.66666666666, ans=0.0 2024-06-20 19:32:56,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=258580.66666666666, ans=0.0 2024-06-20 19:33:17,810 INFO [train.py:1028] (1/2) Epoch 14, batch 9550, loss[loss=0.2113, simple_loss=0.2714, pruned_loss=0.07559, over 12869.00 frames. ], tot_loss[loss=0.243, simple_loss=0.2953, pruned_loss=0.09539, over 2570905.90 frames. ], batch size: 39, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:33:21,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=258635.66666666666, ans=0.07 2024-06-20 19:33:26,219 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.69 vs. limit=6.0 2024-06-20 19:33:26,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=258654.0, ans=0.125 2024-06-20 19:33:30,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=258654.0, ans=0.125 2024-06-20 19:33:36,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=258672.33333333334, ans=0.125 2024-06-20 19:33:56,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=258709.0, ans=0.2 2024-06-20 19:33:58,537 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.33 vs. limit=12.0 2024-06-20 19:34:01,557 INFO [train.py:1028] (1/2) Epoch 14, batch 9600, loss[loss=0.2312, simple_loss=0.2744, pruned_loss=0.09401, over 10517.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.2949, pruned_loss=0.09536, over 2570921.75 frames. ], batch size: 303, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:34:05,119 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 1.982e+02 2.122e+02 2.367e+02 3.147e+02, threshold=4.243e+02, percent-clipped=0.0 2024-06-20 19:34:07,379 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.53 vs. limit=6.0 2024-06-20 19:34:09,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=258745.66666666666, ans=0.125 2024-06-20 19:34:11,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=258745.66666666666, ans=0.125 2024-06-20 19:34:14,269 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=10.0 2024-06-20 19:34:23,585 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.90 vs. limit=15.0 2024-06-20 19:34:42,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=258800.66666666666, ans=0.125 2024-06-20 19:34:43,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=258800.66666666666, ans=0.125 2024-06-20 19:34:46,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=258800.66666666666, ans=0.125 2024-06-20 19:34:49,260 INFO [train.py:1028] (1/2) Epoch 14, batch 9650, loss[loss=0.2386, simple_loss=0.2904, pruned_loss=0.09338, over 13138.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.2951, pruned_loss=0.09596, over 2560262.87 frames. ], batch size: 132, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:35:07,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=258855.66666666666, ans=0.125 2024-06-20 19:35:31,660 INFO [train.py:1028] (1/2) Epoch 14, batch 9700, loss[loss=0.2407, simple_loss=0.2867, pruned_loss=0.09733, over 13042.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.2951, pruned_loss=0.09628, over 2555815.88 frames. ], batch size: 144, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:35:34,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=258910.66666666666, ans=0.1 2024-06-20 19:35:34,867 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 2.017e+02 2.177e+02 2.428e+02 3.338e+02, threshold=4.355e+02, percent-clipped=0.0 2024-06-20 19:35:50,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=258947.33333333334, ans=0.015 2024-06-20 19:35:51,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=258947.33333333334, ans=0.0 2024-06-20 19:35:51,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=258947.33333333334, ans=0.125 2024-06-20 19:36:12,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=259002.33333333334, ans=0.125 2024-06-20 19:36:13,738 INFO [train.py:1028] (1/2) Epoch 14, batch 9750, loss[loss=0.2246, simple_loss=0.2734, pruned_loss=0.08794, over 13114.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.294, pruned_loss=0.09578, over 2552748.88 frames. ], batch size: 132, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:36:14,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=259002.33333333334, ans=0.125 2024-06-20 19:36:20,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=259002.33333333334, ans=0.125 2024-06-20 19:36:23,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=259020.66666666666, ans=0.5 2024-06-20 19:36:31,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=259039.0, ans=0.1 2024-06-20 19:36:52,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=259075.66666666666, ans=0.125 2024-06-20 19:36:58,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=259075.66666666666, ans=0.1 2024-06-20 19:36:59,462 INFO [train.py:1028] (1/2) Epoch 14, batch 9800, loss[loss=0.2255, simple_loss=0.2738, pruned_loss=0.08863, over 12867.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.2931, pruned_loss=0.0949, over 2545405.25 frames. ], batch size: 39, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:37:03,160 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.016e+02 2.191e+02 2.436e+02 3.252e+02, threshold=4.381e+02, percent-clipped=0.0 2024-06-20 19:37:06,558 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.80 vs. limit=15.0 2024-06-20 19:37:20,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=259130.66666666666, ans=0.0 2024-06-20 19:37:27,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=259130.66666666666, ans=22.5 2024-06-20 19:37:42,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=259167.33333333334, ans=0.0 2024-06-20 19:37:45,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=259167.33333333334, ans=0.0 2024-06-20 19:37:47,557 INFO [train.py:1028] (1/2) Epoch 14, batch 9850, loss[loss=0.2303, simple_loss=0.2769, pruned_loss=0.09187, over 13037.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.2929, pruned_loss=0.09482, over 2539169.72 frames. ], batch size: 102, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:37:59,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=259204.0, ans=0.2 2024-06-20 19:37:59,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.84 vs. limit=10.0 2024-06-20 19:38:08,003 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.87 vs. limit=15.0 2024-06-20 19:38:12,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=259222.33333333334, ans=0.1 2024-06-20 19:38:16,089 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.21 vs. limit=15.0 2024-06-20 19:38:18,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=259240.66666666666, ans=0.2 2024-06-20 19:38:30,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=259277.33333333334, ans=0.0 2024-06-20 19:38:31,363 INFO [train.py:1028] (1/2) Epoch 14, batch 9900, loss[loss=0.2291, simple_loss=0.2859, pruned_loss=0.0862, over 12861.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.2914, pruned_loss=0.09456, over 2530998.74 frames. ], batch size: 39, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:38:31,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=259277.33333333334, ans=0.125 2024-06-20 19:38:34,436 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 2.009e+02 2.184e+02 2.415e+02 3.341e+02, threshold=4.369e+02, percent-clipped=0.0 2024-06-20 19:38:34,959 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.06 vs. limit=15.0 2024-06-20 19:38:41,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=259295.66666666666, ans=0.125 2024-06-20 19:38:42,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=259295.66666666666, ans=0.125 2024-06-20 19:38:44,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=259295.66666666666, ans=0.0 2024-06-20 19:38:44,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=259295.66666666666, ans=0.125 2024-06-20 19:38:52,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=259314.0, ans=0.1 2024-06-20 19:39:01,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=259332.33333333334, ans=10.0 2024-06-20 19:39:07,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=259350.66666666666, ans=0.125 2024-06-20 19:39:11,331 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2024-06-20 19:39:12,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=259369.0, ans=0.125 2024-06-20 19:39:13,203 INFO [train.py:1028] (1/2) Epoch 14, batch 9950, loss[loss=0.248, simple_loss=0.3004, pruned_loss=0.09776, over 12654.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.2905, pruned_loss=0.09445, over 2523174.43 frames. ], batch size: 29, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:39:30,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=259405.66666666666, ans=0.125 2024-06-20 19:39:31,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=259405.66666666666, ans=0.0 2024-06-20 19:39:35,006 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.61 vs. limit=22.5 2024-06-20 19:39:35,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=259405.66666666666, ans=0.125 2024-06-20 19:39:41,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=259424.0, ans=0.125 2024-06-20 19:39:53,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=259442.33333333334, ans=0.125 2024-06-20 19:39:55,503 INFO [train.py:1028] (1/2) Epoch 14, batch 10000, loss[loss=0.2375, simple_loss=0.2995, pruned_loss=0.08775, over 12959.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.2909, pruned_loss=0.09466, over 2486137.78 frames. ], batch size: 23, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:40:00,766 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 2.067e+02 2.229e+02 2.383e+02 3.389e+02, threshold=4.459e+02, percent-clipped=0.0 2024-06-20 19:40:07,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=259479.0, ans=0.125 2024-06-20 19:40:16,076 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.01 vs. limit=10.0 2024-06-20 19:40:24,269 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.64 vs. limit=6.0 2024-06-20 19:40:25,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=259515.66666666666, ans=0.2 2024-06-20 19:40:41,499 INFO [train.py:1028] (1/2) Epoch 14, batch 10050, loss[loss=0.2609, simple_loss=0.3032, pruned_loss=0.1093, over 12760.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.2912, pruned_loss=0.09557, over 2444450.14 frames. ], batch size: 22, lr: 4.10e-03, grad_scale: 32.0 2024-06-20 19:40:43,619 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.09 vs. limit=10.0 2024-06-20 19:40:48,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=259570.66666666666, ans=0.125 2024-06-20 19:41:06,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=259607.33333333334, ans=0.125 2024-06-20 19:41:06,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=259607.33333333334, ans=0.125 2024-06-20 19:41:14,158 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.74 vs. limit=15.0 2024-06-20 19:41:16,192 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.51 vs. limit=15.0 2024-06-20 19:41:18,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=259625.66666666666, ans=0.0 2024-06-20 19:41:19,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=259625.66666666666, ans=0.125 2024-06-20 19:41:21,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=259625.66666666666, ans=0.0 2024-06-20 19:41:25,628 INFO [train.py:1028] (1/2) Epoch 14, batch 10100, loss[loss=0.2165, simple_loss=0.2676, pruned_loss=0.08268, over 11355.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.2909, pruned_loss=0.09498, over 2423632.79 frames. ], batch size: 17, lr: 4.10e-03, grad_scale: 32.0 2024-06-20 19:41:29,422 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 2.008e+02 2.174e+02 2.390e+02 3.197e+02, threshold=4.349e+02, percent-clipped=0.0 2024-06-20 19:41:36,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=259662.33333333334, ans=0.2 2024-06-20 19:44:43,745 INFO [train.py:1028] (1/2) Epoch 15, batch 0, loss[loss=0.2102, simple_loss=0.2635, pruned_loss=0.07842, over 13027.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2635, pruned_loss=0.07842, over 13027.00 frames. ], batch size: 36, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:44:43,748 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 19:44:55,096 INFO [train.py:1060] (1/2) Epoch 15, validation: loss=0.1909, simple_loss=0.2562, pruned_loss=0.06283, over 351949.00 frames. 2024-06-20 19:44:55,096 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17479MB 2024-06-20 19:44:56,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=259673.33333333334, ans=0.0 2024-06-20 19:44:59,198 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.97 vs. limit=6.0 2024-06-20 19:45:02,097 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.56 vs. limit=12.0 2024-06-20 19:45:04,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=259673.33333333334, ans=0.125 2024-06-20 19:45:22,411 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.02 vs. limit=15.0 2024-06-20 19:45:26,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=259710.0, ans=0.125 2024-06-20 19:45:35,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=259728.33333333334, ans=0.0 2024-06-20 19:45:39,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=259746.66666666666, ans=0.0 2024-06-20 19:45:43,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=259746.66666666666, ans=0.025 2024-06-20 19:45:48,095 INFO [train.py:1028] (1/2) Epoch 15, batch 50, loss[loss=0.2084, simple_loss=0.2647, pruned_loss=0.07599, over 12751.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2704, pruned_loss=0.0869, over 575330.62 frames. ], batch size: 29, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:45:51,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=259765.0, ans=0.125 2024-06-20 19:46:03,827 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.08 vs. limit=10.0 2024-06-20 19:46:04,266 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:46:16,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=259820.0, ans=0.0 2024-06-20 19:46:20,158 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.926e+02 2.151e+02 2.418e+02 3.077e+02, threshold=4.303e+02, percent-clipped=0.0 2024-06-20 19:46:31,720 INFO [train.py:1028] (1/2) Epoch 15, batch 100, loss[loss=0.2279, simple_loss=0.2832, pruned_loss=0.08631, over 13358.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2705, pruned_loss=0.08666, over 1018655.44 frames. ], batch size: 46, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:46:37,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=259856.66666666666, ans=0.0 2024-06-20 19:46:49,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=259893.33333333334, ans=0.1 2024-06-20 19:47:18,226 INFO [train.py:1028] (1/2) Epoch 15, batch 150, loss[loss=0.219, simple_loss=0.275, pruned_loss=0.08151, over 12635.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2699, pruned_loss=0.08522, over 1365417.29 frames. ], batch size: 29, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:47:39,993 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.30 vs. limit=15.0 2024-06-20 19:48:04,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=260003.33333333334, ans=0.125 2024-06-20 19:48:06,267 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 1.899e+02 2.029e+02 2.225e+02 2.992e+02, threshold=4.058e+02, percent-clipped=0.0 2024-06-20 19:48:16,813 INFO [train.py:1028] (1/2) Epoch 15, batch 200, loss[loss=0.2485, simple_loss=0.2893, pruned_loss=0.1039, over 12552.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2708, pruned_loss=0.08576, over 1634611.92 frames. ], batch size: 202, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:48:21,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=260040.0, ans=0.0 2024-06-20 19:48:23,724 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.95 vs. limit=15.0 2024-06-20 19:48:35,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=260076.66666666666, ans=0.0 2024-06-20 19:48:35,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=260076.66666666666, ans=0.95 2024-06-20 19:48:50,272 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.99 vs. limit=22.5 2024-06-20 19:48:53,152 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.57 vs. limit=6.0 2024-06-20 19:49:02,230 INFO [train.py:1028] (1/2) Epoch 15, batch 250, loss[loss=0.1969, simple_loss=0.2433, pruned_loss=0.07523, over 13022.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2703, pruned_loss=0.08541, over 1845712.63 frames. ], batch size: 144, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:49:03,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=260131.66666666666, ans=0.125 2024-06-20 19:49:04,305 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.16 vs. limit=15.0 2024-06-20 19:49:21,353 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.07 vs. limit=15.0 2024-06-20 19:49:39,315 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.669e+02 1.915e+02 2.085e+02 2.392e+02 3.311e+02, threshold=4.169e+02, percent-clipped=0.0 2024-06-20 19:49:50,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=260223.33333333334, ans=0.0 2024-06-20 19:49:51,278 INFO [train.py:1028] (1/2) Epoch 15, batch 300, loss[loss=0.2056, simple_loss=0.2491, pruned_loss=0.08109, over 13168.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2701, pruned_loss=0.08544, over 2008874.84 frames. ], batch size: 112, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:49:57,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=260223.33333333334, ans=0.0 2024-06-20 19:50:02,662 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.86 vs. limit=22.5 2024-06-20 19:50:05,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=260241.66666666666, ans=0.2 2024-06-20 19:50:08,293 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2024-06-20 19:50:18,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=260278.33333333334, ans=0.125 2024-06-20 19:50:27,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=260278.33333333334, ans=10.0 2024-06-20 19:50:34,320 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.83 vs. limit=10.0 2024-06-20 19:50:34,919 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=260296.66666666666, ans=0.125 2024-06-20 19:50:37,227 INFO [train.py:1028] (1/2) Epoch 15, batch 350, loss[loss=0.2337, simple_loss=0.2856, pruned_loss=0.09089, over 12818.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2702, pruned_loss=0.08547, over 2138263.01 frames. ], batch size: 33, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:50:51,256 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=22.5 2024-06-20 19:51:10,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=260351.66666666666, ans=10.0 2024-06-20 19:51:14,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.64 vs. limit=6.0 2024-06-20 19:51:16,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=260370.0, ans=0.125 2024-06-20 19:51:20,927 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.29 vs. limit=12.0 2024-06-20 19:51:24,625 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.858e+02 1.982e+02 2.170e+02 2.582e+02, threshold=3.964e+02, percent-clipped=0.0 2024-06-20 19:51:33,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=260388.33333333334, ans=0.125 2024-06-20 19:51:36,027 INFO [train.py:1028] (1/2) Epoch 15, batch 400, loss[loss=0.1995, simple_loss=0.2549, pruned_loss=0.07208, over 13267.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2703, pruned_loss=0.08496, over 2238440.87 frames. ], batch size: 63, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:51:37,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=260406.66666666666, ans=0.125 2024-06-20 19:52:01,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=260461.66666666666, ans=0.125 2024-06-20 19:52:06,335 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.89 vs. limit=15.0 2024-06-20 19:52:18,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=260498.33333333334, ans=0.0 2024-06-20 19:52:19,425 INFO [train.py:1028] (1/2) Epoch 15, batch 450, loss[loss=0.2153, simple_loss=0.2673, pruned_loss=0.08168, over 13195.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2699, pruned_loss=0.08471, over 2313657.77 frames. ], batch size: 67, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:52:29,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=260516.66666666666, ans=0.0 2024-06-20 19:52:43,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=260535.0, ans=0.2 2024-06-20 19:52:44,098 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.67 vs. limit=22.5 2024-06-20 19:52:53,593 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.721e+02 1.904e+02 2.022e+02 2.165e+02 2.681e+02, threshold=4.044e+02, percent-clipped=0.0 2024-06-20 19:53:00,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=260571.66666666666, ans=0.125 2024-06-20 19:53:04,219 INFO [train.py:1028] (1/2) Epoch 15, batch 500, loss[loss=0.2075, simple_loss=0.2527, pruned_loss=0.08117, over 13082.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2701, pruned_loss=0.08439, over 2375897.37 frames. ], batch size: 121, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:53:17,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=260608.33333333334, ans=0.125 2024-06-20 19:53:20,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=260608.33333333334, ans=0.125 2024-06-20 19:53:23,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=260626.66666666666, ans=0.07 2024-06-20 19:53:43,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=260645.0, ans=0.125 2024-06-20 19:53:45,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=260645.0, ans=0.1 2024-06-20 19:53:48,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=260663.33333333334, ans=0.125 2024-06-20 19:53:51,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=260663.33333333334, ans=0.2 2024-06-20 19:54:01,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=260663.33333333334, ans=0.0 2024-06-20 19:54:03,727 INFO [train.py:1028] (1/2) Epoch 15, batch 550, loss[loss=0.2259, simple_loss=0.2736, pruned_loss=0.08912, over 12936.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2696, pruned_loss=0.08431, over 2420878.39 frames. ], batch size: 158, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:54:04,203 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.24 vs. limit=15.0 2024-06-20 19:54:07,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=260681.66666666666, ans=0.125 2024-06-20 19:54:07,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=260681.66666666666, ans=0.0 2024-06-20 19:54:08,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=260681.66666666666, ans=0.125 2024-06-20 19:54:35,863 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 1.848e+02 1.989e+02 2.135e+02 3.037e+02, threshold=3.979e+02, percent-clipped=0.0 2024-06-20 19:54:47,074 INFO [train.py:1028] (1/2) Epoch 15, batch 600, loss[loss=0.2143, simple_loss=0.2528, pruned_loss=0.08792, over 13029.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2695, pruned_loss=0.08445, over 2458467.99 frames. ], batch size: 144, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:54:52,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=260773.33333333334, ans=10.0 2024-06-20 19:55:05,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=260810.0, ans=0.09899494936611666 2024-06-20 19:55:11,725 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.69 vs. limit=15.0 2024-06-20 19:55:18,844 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.95 vs. limit=15.0 2024-06-20 19:55:34,395 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2024-06-20 19:55:34,856 INFO [train.py:1028] (1/2) Epoch 15, batch 650, loss[loss=0.233, simple_loss=0.2812, pruned_loss=0.09243, over 13171.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2701, pruned_loss=0.0845, over 2490347.81 frames. ], batch size: 59, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 19:55:46,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=260883.33333333334, ans=0.125 2024-06-20 19:55:56,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=260901.66666666666, ans=0.125 2024-06-20 19:56:01,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=260920.0, ans=0.07 2024-06-20 19:56:07,229 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 1.924e+02 2.087e+02 2.236e+02 3.152e+02, threshold=4.174e+02, percent-clipped=0.0 2024-06-20 19:56:08,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=260938.33333333334, ans=0.1 2024-06-20 19:56:17,513 INFO [train.py:1028] (1/2) Epoch 15, batch 700, loss[loss=0.2278, simple_loss=0.2847, pruned_loss=0.08544, over 13319.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.27, pruned_loss=0.08454, over 2512642.43 frames. ], batch size: 46, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 19:56:26,795 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=8.31 vs. limit=12.0 2024-06-20 19:56:28,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=260975.0, ans=0.125 2024-06-20 19:56:28,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=260975.0, ans=0.125 2024-06-20 19:56:50,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=260993.33333333334, ans=0.0 2024-06-20 19:57:16,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=261030.0, ans=0.125 2024-06-20 19:57:17,737 INFO [train.py:1028] (1/2) Epoch 15, batch 750, loss[loss=0.2044, simple_loss=0.2664, pruned_loss=0.07123, over 13248.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2703, pruned_loss=0.08451, over 2526914.49 frames. ], batch size: 63, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 19:57:19,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=261048.33333333334, ans=0.125 2024-06-20 19:57:20,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=261048.33333333334, ans=0.125 2024-06-20 19:57:21,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=261048.33333333334, ans=0.2 2024-06-20 19:57:27,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=261066.66666666666, ans=0.035 2024-06-20 19:57:32,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=261066.66666666666, ans=0.125 2024-06-20 19:57:35,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=261085.0, ans=0.125 2024-06-20 19:57:45,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=261103.33333333334, ans=0.1 2024-06-20 19:57:51,514 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.652e+02 1.880e+02 1.964e+02 2.128e+02 3.494e+02, threshold=3.927e+02, percent-clipped=0.0 2024-06-20 19:58:02,801 INFO [train.py:1028] (1/2) Epoch 15, batch 800, loss[loss=0.2036, simple_loss=0.2622, pruned_loss=0.07252, over 12903.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2704, pruned_loss=0.08465, over 2540175.53 frames. ], batch size: 36, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 19:58:26,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=261176.66666666666, ans=0.0 2024-06-20 19:58:50,653 INFO [train.py:1028] (1/2) Epoch 15, batch 850, loss[loss=0.2001, simple_loss=0.2538, pruned_loss=0.07324, over 13128.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2697, pruned_loss=0.08418, over 2551822.48 frames. ], batch size: 95, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 19:58:54,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=261231.66666666666, ans=0.0 2024-06-20 19:58:55,518 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.93 vs. limit=15.0 2024-06-20 19:58:59,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=261250.0, ans=0.0 2024-06-20 19:59:01,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=261250.0, ans=0.125 2024-06-20 19:59:15,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=261268.33333333334, ans=0.125 2024-06-20 19:59:17,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=261268.33333333334, ans=0.0 2024-06-20 19:59:20,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=261286.66666666666, ans=0.05 2024-06-20 19:59:25,705 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.894e+02 2.082e+02 2.268e+02 3.992e+02, threshold=4.164e+02, percent-clipped=1.0 2024-06-20 19:59:35,095 INFO [train.py:1028] (1/2) Epoch 15, batch 900, loss[loss=0.2033, simple_loss=0.2498, pruned_loss=0.07843, over 12875.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2692, pruned_loss=0.08429, over 2556460.60 frames. ], batch size: 36, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 19:59:59,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=261360.0, ans=0.1 2024-06-20 20:00:12,536 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.55 vs. limit=15.0 2024-06-20 20:00:21,421 INFO [train.py:1028] (1/2) Epoch 15, batch 950, loss[loss=0.2102, simple_loss=0.2627, pruned_loss=0.07885, over 13250.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2696, pruned_loss=0.08448, over 2559905.72 frames. ], batch size: 40, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 20:00:21,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=261415.0, ans=0.125 2024-06-20 20:00:21,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=261415.0, ans=0.0 2024-06-20 20:00:22,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=261415.0, ans=0.125 2024-06-20 20:00:27,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=261415.0, ans=0.1 2024-06-20 20:00:35,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=261433.33333333334, ans=0.125 2024-06-20 20:00:46,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=261451.66666666666, ans=0.2 2024-06-20 20:00:56,555 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 1.926e+02 2.023e+02 2.207e+02 2.761e+02, threshold=4.046e+02, percent-clipped=0.0 2024-06-20 20:01:05,806 INFO [train.py:1028] (1/2) Epoch 15, batch 1000, loss[loss=0.2036, simple_loss=0.2549, pruned_loss=0.07615, over 13304.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.27, pruned_loss=0.08486, over 2561972.99 frames. ], batch size: 49, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 20:01:08,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=261506.66666666666, ans=0.1 2024-06-20 20:01:12,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=261506.66666666666, ans=0.05 2024-06-20 20:01:23,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=261543.33333333334, ans=0.125 2024-06-20 20:01:26,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=261543.33333333334, ans=0.0 2024-06-20 20:01:36,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=261561.66666666666, ans=0.125 2024-06-20 20:01:45,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.54 vs. limit=22.5 2024-06-20 20:01:48,463 INFO [train.py:1028] (1/2) Epoch 15, batch 1050, loss[loss=0.1946, simple_loss=0.2499, pruned_loss=0.06966, over 13216.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2703, pruned_loss=0.08487, over 2564614.90 frames. ], batch size: 77, lr: 3.95e-03, grad_scale: 64.0 2024-06-20 20:01:48,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=261598.33333333334, ans=0.0 2024-06-20 20:02:07,261 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.87 vs. limit=22.5 2024-06-20 20:02:13,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=261635.0, ans=0.125 2024-06-20 20:02:20,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=261635.0, ans=0.1 2024-06-20 20:02:21,449 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.21 vs. limit=22.5 2024-06-20 20:02:36,109 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 1.913e+02 2.045e+02 2.185e+02 3.085e+02, threshold=4.089e+02, percent-clipped=0.0 2024-06-20 20:02:45,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=261671.66666666666, ans=0.1 2024-06-20 20:02:48,000 INFO [train.py:1028] (1/2) Epoch 15, batch 1100, loss[loss=0.2302, simple_loss=0.2789, pruned_loss=0.09072, over 13218.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2709, pruned_loss=0.08502, over 2570583.57 frames. ], batch size: 52, lr: 3.95e-03, grad_scale: 64.0 2024-06-20 20:02:48,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=261690.0, ans=10.0 2024-06-20 20:02:59,365 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.34 vs. limit=22.5 2024-06-20 20:03:04,354 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2024-06-20 20:03:08,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=261726.66666666666, ans=0.125 2024-06-20 20:03:19,382 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.81 vs. limit=10.0 2024-06-20 20:03:26,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=261763.33333333334, ans=0.0 2024-06-20 20:03:31,120 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.52 vs. limit=10.0 2024-06-20 20:03:31,476 INFO [train.py:1028] (1/2) Epoch 15, batch 1150, loss[loss=0.221, simple_loss=0.2783, pruned_loss=0.08187, over 13304.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2706, pruned_loss=0.08472, over 2572187.69 frames. ], batch size: 52, lr: 3.95e-03, grad_scale: 64.0 2024-06-20 20:03:31,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=261781.66666666666, ans=0.04949747468305833 2024-06-20 20:03:31,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=261781.66666666666, ans=0.02 2024-06-20 20:03:36,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=261781.66666666666, ans=0.0 2024-06-20 20:03:39,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=261800.0, ans=0.0 2024-06-20 20:03:41,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=261800.0, ans=0.025 2024-06-20 20:04:04,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=261836.66666666666, ans=0.0 2024-06-20 20:04:06,581 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 1.935e+02 2.092e+02 2.352e+02 3.092e+02, threshold=4.184e+02, percent-clipped=0.0 2024-06-20 20:04:18,368 INFO [train.py:1028] (1/2) Epoch 15, batch 1200, loss[loss=0.2137, simple_loss=0.2688, pruned_loss=0.07933, over 13193.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2714, pruned_loss=0.08545, over 2574449.64 frames. ], batch size: 77, lr: 3.95e-03, grad_scale: 64.0 2024-06-20 20:04:27,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=261891.66666666666, ans=0.125 2024-06-20 20:04:30,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=261891.66666666666, ans=0.0 2024-06-20 20:04:34,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=261891.66666666666, ans=0.2 2024-06-20 20:04:37,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=261910.0, ans=0.2 2024-06-20 20:04:45,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=261928.33333333334, ans=0.1 2024-06-20 20:04:47,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=261928.33333333334, ans=0.125 2024-06-20 20:04:52,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=261946.66666666666, ans=0.125 2024-06-20 20:05:07,641 INFO [train.py:1028] (1/2) Epoch 15, batch 1250, loss[loss=0.219, simple_loss=0.2653, pruned_loss=0.08638, over 13164.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2702, pruned_loss=0.08493, over 2583767.21 frames. ], batch size: 112, lr: 3.95e-03, grad_scale: 64.0 2024-06-20 20:05:32,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=262001.66666666666, ans=0.125 2024-06-20 20:05:35,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=262001.66666666666, ans=0.125 2024-06-20 20:05:37,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=262001.66666666666, ans=0.125 2024-06-20 20:05:48,845 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.65 vs. limit=15.0 2024-06-20 20:05:49,144 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 1.894e+02 1.992e+02 2.183e+02 2.926e+02, threshold=3.983e+02, percent-clipped=0.0 2024-06-20 20:05:59,645 INFO [train.py:1028] (1/2) Epoch 15, batch 1300, loss[loss=0.247, simple_loss=0.2895, pruned_loss=0.1022, over 12764.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2702, pruned_loss=0.08491, over 2583640.18 frames. ], batch size: 176, lr: 3.95e-03, grad_scale: 64.0 2024-06-20 20:06:03,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=262056.66666666666, ans=0.05 2024-06-20 20:06:09,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=262075.0, ans=0.0 2024-06-20 20:06:14,118 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=8.21 vs. limit=12.0 2024-06-20 20:06:19,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=262093.33333333334, ans=0.125 2024-06-20 20:06:41,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=262130.0, ans=0.125 2024-06-20 20:06:43,982 INFO [train.py:1028] (1/2) Epoch 15, batch 1350, loss[loss=0.2116, simple_loss=0.2653, pruned_loss=0.07898, over 13234.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2698, pruned_loss=0.08489, over 2584856.99 frames. ], batch size: 59, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:06:47,350 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=28.17 vs. limit=22.5 2024-06-20 20:06:54,301 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.04 vs. limit=22.5 2024-06-20 20:07:05,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=262185.0, ans=0.0 2024-06-20 20:07:20,306 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.887e+02 2.003e+02 2.204e+02 2.767e+02, threshold=4.006e+02, percent-clipped=0.0 2024-06-20 20:07:32,001 INFO [train.py:1028] (1/2) Epoch 15, batch 1400, loss[loss=0.2179, simple_loss=0.2653, pruned_loss=0.08528, over 12600.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2698, pruned_loss=0.08495, over 2586891.50 frames. ], batch size: 25, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:07:34,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=262240.0, ans=10.0 2024-06-20 20:07:40,000 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.27 vs. limit=10.0 2024-06-20 20:07:49,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=262276.6666666667, ans=0.125 2024-06-20 20:08:09,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=262313.3333333333, ans=0.125 2024-06-20 20:08:16,810 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:08:19,036 INFO [train.py:1028] (1/2) Epoch 15, batch 1450, loss[loss=0.1943, simple_loss=0.2484, pruned_loss=0.07008, over 13131.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2702, pruned_loss=0.08515, over 2587445.12 frames. ], batch size: 121, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:08:22,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=262331.6666666667, ans=0.0 2024-06-20 20:08:27,945 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.93 vs. limit=15.0 2024-06-20 20:08:37,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=262368.3333333333, ans=0.125 2024-06-20 20:08:49,856 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.940e+02 2.041e+02 2.225e+02 3.024e+02, threshold=4.082e+02, percent-clipped=0.0 2024-06-20 20:08:53,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=262405.0, ans=0.125 2024-06-20 20:08:58,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=262405.0, ans=0.125 2024-06-20 20:08:59,957 INFO [train.py:1028] (1/2) Epoch 15, batch 1500, loss[loss=0.2196, simple_loss=0.2751, pruned_loss=0.08201, over 13201.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2705, pruned_loss=0.08558, over 2589363.38 frames. ], batch size: 83, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:09:05,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=262423.3333333333, ans=0.125 2024-06-20 20:09:05,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=262423.3333333333, ans=0.0 2024-06-20 20:09:20,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=262460.0, ans=0.0 2024-06-20 20:09:21,282 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=13.83 vs. limit=15.0 2024-06-20 20:09:33,623 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:09:40,008 INFO [train.py:1028] (1/2) Epoch 15, batch 1550, loss[loss=0.2203, simple_loss=0.2698, pruned_loss=0.08542, over 13150.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2703, pruned_loss=0.08535, over 2585550.10 frames. ], batch size: 103, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:09:59,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=262551.6666666667, ans=0.0 2024-06-20 20:10:03,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262551.6666666667, ans=0.1 2024-06-20 20:10:22,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=262570.0, ans=0.125 2024-06-20 20:10:22,556 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.886e+02 2.023e+02 2.217e+02 2.941e+02, threshold=4.046e+02, percent-clipped=0.0 2024-06-20 20:10:22,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=262570.0, ans=0.0 2024-06-20 20:10:23,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=262588.3333333333, ans=0.0 2024-06-20 20:10:24,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=262588.3333333333, ans=0.015 2024-06-20 20:10:34,014 INFO [train.py:1028] (1/2) Epoch 15, batch 1600, loss[loss=0.2038, simple_loss=0.26, pruned_loss=0.07384, over 13193.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2708, pruned_loss=0.08557, over 2580302.34 frames. ], batch size: 77, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:10:37,474 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.17 vs. limit=22.5 2024-06-20 20:11:00,048 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.66 vs. limit=22.5 2024-06-20 20:11:02,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=262643.3333333333, ans=0.2 2024-06-20 20:11:03,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=262643.3333333333, ans=0.125 2024-06-20 20:11:26,164 INFO [train.py:1028] (1/2) Epoch 15, batch 1650, loss[loss=0.2173, simple_loss=0.266, pruned_loss=0.08433, over 13134.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2706, pruned_loss=0.08563, over 2576058.78 frames. ], batch size: 95, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:11:27,247 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=262698.3333333333, ans=0.0 2024-06-20 20:11:49,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=262735.0, ans=0.125 2024-06-20 20:11:59,439 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 1.935e+02 2.155e+02 2.394e+02 3.573e+02, threshold=4.310e+02, percent-clipped=0.0 2024-06-20 20:12:11,165 INFO [train.py:1028] (1/2) Epoch 15, batch 1700, loss[loss=0.2217, simple_loss=0.2824, pruned_loss=0.08052, over 12526.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2706, pruned_loss=0.08516, over 2581815.94 frames. ], batch size: 25, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:12:31,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=262826.6666666667, ans=0.0 2024-06-20 20:12:35,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=262826.6666666667, ans=0.09899494936611666 2024-06-20 20:12:44,472 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.59 vs. limit=15.0 2024-06-20 20:12:46,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=262845.0, ans=0.125 2024-06-20 20:12:59,174 INFO [train.py:1028] (1/2) Epoch 15, batch 1750, loss[loss=0.2131, simple_loss=0.2674, pruned_loss=0.07941, over 12503.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2705, pruned_loss=0.08499, over 2582313.53 frames. ], batch size: 22, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:13:09,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262900.0, ans=0.1 2024-06-20 20:13:16,268 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.50 vs. limit=10.0 2024-06-20 20:13:35,045 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.928e+02 2.050e+02 2.259e+02 3.168e+02, threshold=4.101e+02, percent-clipped=0.0 2024-06-20 20:13:36,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=262955.0, ans=0.0 2024-06-20 20:13:40,409 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.77 vs. limit=6.0 2024-06-20 20:13:54,730 INFO [train.py:1028] (1/2) Epoch 15, batch 1800, loss[loss=0.2236, simple_loss=0.2752, pruned_loss=0.08601, over 13226.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2699, pruned_loss=0.08483, over 2582716.08 frames. ], batch size: 67, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:13:58,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=262973.3333333333, ans=0.2 2024-06-20 20:14:02,697 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.36 vs. limit=15.0 2024-06-20 20:14:08,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=262991.6666666667, ans=0.125 2024-06-20 20:14:37,373 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2024-06-20 20:14:38,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=263046.6666666667, ans=0.0 2024-06-20 20:14:40,775 INFO [train.py:1028] (1/2) Epoch 15, batch 1850, loss[loss=0.2069, simple_loss=0.2596, pruned_loss=0.07706, over 13184.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2705, pruned_loss=0.08509, over 2583966.46 frames. ], batch size: 83, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:14:46,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=263065.0, ans=0.1 2024-06-20 20:15:07,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=263101.6666666667, ans=0.0 2024-06-20 20:15:14,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=263120.0, ans=0.125 2024-06-20 20:15:15,926 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 1.904e+02 2.041e+02 2.213e+02 2.706e+02, threshold=4.081e+02, percent-clipped=0.0 2024-06-20 20:15:21,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=263138.3333333333, ans=0.0 2024-06-20 20:15:23,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=263138.3333333333, ans=0.0 2024-06-20 20:15:28,000 INFO [train.py:1028] (1/2) Epoch 15, batch 1900, loss[loss=0.2174, simple_loss=0.2641, pruned_loss=0.08533, over 13200.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2702, pruned_loss=0.08513, over 2586230.00 frames. ], batch size: 95, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:15:43,031 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.54 vs. limit=15.0 2024-06-20 20:15:52,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=263193.3333333333, ans=0.04949747468305833 2024-06-20 20:15:56,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=263211.6666666667, ans=0.2 2024-06-20 20:16:20,276 INFO [train.py:1028] (1/2) Epoch 15, batch 1950, loss[loss=0.2015, simple_loss=0.2558, pruned_loss=0.07366, over 13295.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2697, pruned_loss=0.08525, over 2591651.71 frames. ], batch size: 52, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:16:23,668 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.36 vs. limit=22.5 2024-06-20 20:16:46,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=263303.3333333333, ans=0.04949747468305833 2024-06-20 20:16:51,990 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.924e+02 2.038e+02 2.224e+02 2.942e+02, threshold=4.076e+02, percent-clipped=0.0 2024-06-20 20:16:53,181 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.94 vs. limit=15.0 2024-06-20 20:17:01,450 INFO [train.py:1028] (1/2) Epoch 15, batch 2000, loss[loss=0.2204, simple_loss=0.282, pruned_loss=0.07937, over 12524.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2696, pruned_loss=0.08525, over 2587763.03 frames. ], batch size: 22, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:17:31,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=263395.0, ans=0.125 2024-06-20 20:17:41,676 INFO [train.py:1028] (1/2) Epoch 15, batch 2050, loss[loss=0.2265, simple_loss=0.2765, pruned_loss=0.08821, over 12588.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2696, pruned_loss=0.08528, over 2583382.53 frames. ], batch size: 29, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:17:50,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=263450.0, ans=0.0 2024-06-20 20:17:56,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=263450.0, ans=0.125 2024-06-20 20:18:03,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=263468.3333333333, ans=0.125 2024-06-20 20:18:12,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=263486.6666666667, ans=0.125 2024-06-20 20:18:17,390 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 1.889e+02 2.063e+02 2.171e+02 2.727e+02, threshold=4.126e+02, percent-clipped=0.0 2024-06-20 20:18:20,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=263505.0, ans=0.0 2024-06-20 20:18:21,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=263505.0, ans=0.025 2024-06-20 20:18:25,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=263505.0, ans=0.0 2024-06-20 20:18:28,516 INFO [train.py:1028] (1/2) Epoch 15, batch 2100, loss[loss=0.2224, simple_loss=0.2784, pruned_loss=0.08319, over 13207.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2705, pruned_loss=0.08529, over 2586213.45 frames. ], batch size: 59, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:18:38,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=263541.6666666667, ans=0.0 2024-06-20 20:18:48,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=263560.0, ans=0.0 2024-06-20 20:19:02,105 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.62 vs. limit=15.0 2024-06-20 20:19:22,948 INFO [train.py:1028] (1/2) Epoch 15, batch 2150, loss[loss=0.2223, simple_loss=0.278, pruned_loss=0.08336, over 13243.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2705, pruned_loss=0.085, over 2589422.06 frames. ], batch size: 52, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:19:34,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=263615.0, ans=0.0 2024-06-20 20:19:41,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=263633.3333333333, ans=0.0 2024-06-20 20:19:45,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=263651.6666666667, ans=0.0 2024-06-20 20:19:47,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=263651.6666666667, ans=0.125 2024-06-20 20:19:49,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=263651.6666666667, ans=0.1 2024-06-20 20:19:58,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=263670.0, ans=0.2 2024-06-20 20:20:01,807 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 1.960e+02 2.157e+02 2.365e+02 2.902e+02, threshold=4.314e+02, percent-clipped=0.0 2024-06-20 20:20:06,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=263688.3333333333, ans=0.0 2024-06-20 20:20:11,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=263688.3333333333, ans=0.1 2024-06-20 20:20:13,700 INFO [train.py:1028] (1/2) Epoch 15, batch 2200, loss[loss=0.2328, simple_loss=0.2742, pruned_loss=0.09573, over 13231.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2711, pruned_loss=0.0856, over 2589767.38 frames. ], batch size: 83, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:20:17,606 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:20:29,826 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.70 vs. limit=22.5 2024-06-20 20:20:49,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=263761.6666666667, ans=0.125 2024-06-20 20:20:53,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=263780.0, ans=0.125 2024-06-20 20:21:01,260 INFO [train.py:1028] (1/2) Epoch 15, batch 2250, loss[loss=0.2101, simple_loss=0.2608, pruned_loss=0.07972, over 13256.00 frames. ], tot_loss[loss=0.221, simple_loss=0.271, pruned_loss=0.08549, over 2587846.48 frames. ], batch size: 63, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:21:13,246 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2024-06-20 20:21:34,969 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 1.883e+02 2.003e+02 2.199e+02 2.864e+02, threshold=4.005e+02, percent-clipped=0.0 2024-06-20 20:21:35,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=263853.3333333333, ans=0.0 2024-06-20 20:21:43,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=263871.6666666667, ans=0.1 2024-06-20 20:21:46,720 INFO [train.py:1028] (1/2) Epoch 15, batch 2300, loss[loss=0.2435, simple_loss=0.2846, pruned_loss=0.1012, over 12942.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.271, pruned_loss=0.08539, over 2581328.44 frames. ], batch size: 33, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:21:50,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=263890.0, ans=0.1 2024-06-20 20:22:01,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=263890.0, ans=0.125 2024-06-20 20:22:02,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=263908.3333333333, ans=0.125 2024-06-20 20:22:11,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=263926.6666666667, ans=0.025 2024-06-20 20:22:13,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=263926.6666666667, ans=0.02 2024-06-20 20:22:44,010 INFO [train.py:1028] (1/2) Epoch 15, batch 2350, loss[loss=0.216, simple_loss=0.2707, pruned_loss=0.08066, over 13254.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2709, pruned_loss=0.08546, over 2584022.84 frames. ], batch size: 67, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:22:49,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=263981.6666666667, ans=0.0 2024-06-20 20:22:51,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=263981.6666666667, ans=0.125 2024-06-20 20:23:10,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=264018.3333333333, ans=0.2 2024-06-20 20:23:13,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=264018.3333333333, ans=0.2 2024-06-20 20:23:25,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=264036.6666666667, ans=0.125 2024-06-20 20:23:27,971 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.676e+02 1.876e+02 2.011e+02 2.183e+02 3.099e+02, threshold=4.022e+02, percent-clipped=0.0 2024-06-20 20:23:28,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=264036.6666666667, ans=0.125 2024-06-20 20:23:32,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=264055.0, ans=0.125 2024-06-20 20:23:39,487 INFO [train.py:1028] (1/2) Epoch 15, batch 2400, loss[loss=0.2397, simple_loss=0.286, pruned_loss=0.09668, over 13350.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2702, pruned_loss=0.08555, over 2587231.04 frames. ], batch size: 46, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:23:40,990 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.24 vs. limit=15.0 2024-06-20 20:23:48,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=264091.6666666667, ans=0.125 2024-06-20 20:23:49,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=264091.6666666667, ans=0.125 2024-06-20 20:24:06,951 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.81 vs. limit=10.0 2024-06-20 20:24:10,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=264128.3333333333, ans=0.125 2024-06-20 20:24:19,286 INFO [train.py:1028] (1/2) Epoch 15, batch 2450, loss[loss=0.223, simple_loss=0.2718, pruned_loss=0.08708, over 13297.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2697, pruned_loss=0.08543, over 2584067.75 frames. ], batch size: 63, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:24:19,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=264165.0, ans=0.125 2024-06-20 20:24:21,890 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.66 vs. limit=15.0 2024-06-20 20:24:46,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=264220.0, ans=0.09899494936611666 2024-06-20 20:24:53,811 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 1.953e+02 2.081e+02 2.249e+02 3.079e+02, threshold=4.162e+02, percent-clipped=0.0 2024-06-20 20:24:57,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=264238.3333333333, ans=10.0 2024-06-20 20:24:59,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=264238.3333333333, ans=0.125 2024-06-20 20:25:03,862 INFO [train.py:1028] (1/2) Epoch 15, batch 2500, loss[loss=0.2088, simple_loss=0.2547, pruned_loss=0.08149, over 13216.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2681, pruned_loss=0.08452, over 2586274.25 frames. ], batch size: 83, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:25:04,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=264256.6666666667, ans=0.0 2024-06-20 20:25:04,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.10 vs. limit=22.5 2024-06-20 20:25:07,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=264256.6666666667, ans=0.0 2024-06-20 20:25:12,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=264275.0, ans=0.025 2024-06-20 20:25:25,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=264293.3333333333, ans=0.0 2024-06-20 20:25:26,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=264293.3333333333, ans=0.025 2024-06-20 20:25:27,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=264293.3333333333, ans=0.125 2024-06-20 20:25:35,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=264311.6666666667, ans=0.2 2024-06-20 20:25:45,679 INFO [train.py:1028] (1/2) Epoch 15, batch 2550, loss[loss=0.2492, simple_loss=0.3036, pruned_loss=0.09734, over 12526.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2671, pruned_loss=0.08412, over 2586216.52 frames. ], batch size: 22, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:26:01,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=264366.6666666667, ans=0.125 2024-06-20 20:26:10,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=264385.0, ans=0.0 2024-06-20 20:26:22,645 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 1.954e+02 2.071e+02 2.323e+02 3.838e+02, threshold=4.141e+02, percent-clipped=0.0 2024-06-20 20:26:29,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=264421.6666666667, ans=0.0 2024-06-20 20:26:30,536 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.76 vs. limit=15.0 2024-06-20 20:26:33,457 INFO [train.py:1028] (1/2) Epoch 15, batch 2600, loss[loss=0.1965, simple_loss=0.2464, pruned_loss=0.07332, over 13281.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2653, pruned_loss=0.08361, over 2585784.20 frames. ], batch size: 52, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:26:52,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=264476.6666666667, ans=0.125 2024-06-20 20:26:53,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=264476.6666666667, ans=0.035 2024-06-20 20:27:09,886 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.74 vs. limit=22.5 2024-06-20 20:27:26,792 INFO [train.py:1028] (1/2) Epoch 15, batch 2650, loss[loss=0.1993, simple_loss=0.2408, pruned_loss=0.07892, over 13030.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2646, pruned_loss=0.08349, over 2585678.75 frames. ], batch size: 144, lr: 3.93e-03, grad_scale: 32.0 2024-06-20 20:27:27,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=264531.6666666667, ans=0.1 2024-06-20 20:27:37,199 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.78 vs. limit=15.0 2024-06-20 20:27:39,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=264550.0, ans=0.2 2024-06-20 20:27:50,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=264586.6666666667, ans=0.2 2024-06-20 20:27:57,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=264586.6666666667, ans=0.125 2024-06-20 20:27:59,595 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 1.920e+02 2.110e+02 2.307e+02 2.909e+02, threshold=4.220e+02, percent-clipped=0.0 2024-06-20 20:28:10,824 INFO [train.py:1028] (1/2) Epoch 15, batch 2700, loss[loss=0.2027, simple_loss=0.2501, pruned_loss=0.07763, over 13217.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2635, pruned_loss=0.08334, over 2583057.63 frames. ], batch size: 89, lr: 3.93e-03, grad_scale: 32.0 2024-06-20 20:28:16,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=264623.3333333333, ans=0.125 2024-06-20 20:28:33,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264660.0, ans=0.1 2024-06-20 20:28:35,784 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.36 vs. limit=15.0 2024-06-20 20:28:41,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=264678.3333333333, ans=0.2 2024-06-20 20:28:52,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264696.6666666667, ans=0.1 2024-06-20 20:28:58,419 INFO [train.py:1028] (1/2) Epoch 15, batch 2750, loss[loss=0.219, simple_loss=0.2644, pruned_loss=0.08683, over 13206.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2627, pruned_loss=0.08293, over 2580198.54 frames. ], batch size: 43, lr: 3.93e-03, grad_scale: 32.0 2024-06-20 20:28:58,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=264715.0, ans=0.125 2024-06-20 20:29:00,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=264715.0, ans=0.07 2024-06-20 20:29:13,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=264733.3333333333, ans=0.125 2024-06-20 20:29:32,419 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.623e+02 1.865e+02 1.995e+02 2.156e+02 3.232e+02, threshold=3.990e+02, percent-clipped=0.0 2024-06-20 20:29:42,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=264788.3333333333, ans=0.0 2024-06-20 20:29:47,473 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.52 vs. limit=15.0 2024-06-20 20:29:48,949 INFO [train.py:1028] (1/2) Epoch 15, batch 2800, loss[loss=0.2269, simple_loss=0.2674, pruned_loss=0.09314, over 10919.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2619, pruned_loss=0.0826, over 2578040.83 frames. ], batch size: 303, lr: 3.93e-03, grad_scale: 32.0 2024-06-20 20:30:02,416 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:30:03,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=264825.0, ans=0.1 2024-06-20 20:30:07,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=264825.0, ans=0.025 2024-06-20 20:30:43,468 INFO [train.py:1028] (1/2) Epoch 15, batch 2850, loss[loss=0.2132, simple_loss=0.2595, pruned_loss=0.08345, over 13332.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2614, pruned_loss=0.08247, over 2576305.66 frames. ], batch size: 49, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:31:12,997 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.68 vs. limit=15.0 2024-06-20 20:31:18,163 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 1.907e+02 2.062e+02 2.339e+02 3.362e+02, threshold=4.124e+02, percent-clipped=0.0 2024-06-20 20:31:23,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=264971.6666666667, ans=0.07 2024-06-20 20:31:28,107 INFO [train.py:1028] (1/2) Epoch 15, batch 2900, loss[loss=0.2151, simple_loss=0.2703, pruned_loss=0.07991, over 13165.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2599, pruned_loss=0.08207, over 2584394.82 frames. ], batch size: 55, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:31:38,382 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.66 vs. limit=22.5 2024-06-20 20:32:10,479 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:32:15,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=265063.3333333333, ans=0.1 2024-06-20 20:32:17,846 INFO [train.py:1028] (1/2) Epoch 15, batch 2950, loss[loss=0.2367, simple_loss=0.2867, pruned_loss=0.09335, over 13244.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2598, pruned_loss=0.08186, over 2578364.91 frames. ], batch size: 43, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:32:33,063 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:33:01,648 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.831e+02 1.923e+02 2.084e+02 2.737e+02, threshold=3.847e+02, percent-clipped=0.0 2024-06-20 20:33:10,887 INFO [train.py:1028] (1/2) Epoch 15, batch 3000, loss[loss=0.21, simple_loss=0.2632, pruned_loss=0.07836, over 13219.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2591, pruned_loss=0.0814, over 2578469.11 frames. ], batch size: 59, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:33:10,888 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 20:33:20,959 INFO [train.py:1060] (1/2) Epoch 15, validation: loss=0.1888, simple_loss=0.2537, pruned_loss=0.06193, over 351949.00 frames. 2024-06-20 20:33:20,961 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17479MB 2024-06-20 20:33:21,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=265173.3333333333, ans=0.0 2024-06-20 20:33:27,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=265173.3333333333, ans=0.125 2024-06-20 20:33:55,526 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.97 vs. limit=10.0 2024-06-20 20:34:07,469 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2024-06-20 20:34:12,538 INFO [train.py:1028] (1/2) Epoch 15, batch 3050, loss[loss=0.2044, simple_loss=0.2604, pruned_loss=0.07421, over 13260.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2586, pruned_loss=0.08143, over 2578863.49 frames. ], batch size: 46, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:34:12,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=265265.0, ans=0.125 2024-06-20 20:34:27,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=265283.3333333333, ans=0.0 2024-06-20 20:34:34,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=265301.6666666667, ans=0.2 2024-06-20 20:34:38,982 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.65 vs. limit=15.0 2024-06-20 20:34:41,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=265320.0, ans=0.2 2024-06-20 20:34:46,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=265320.0, ans=0.2 2024-06-20 20:34:50,091 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:34:50,776 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.642e+02 1.906e+02 2.004e+02 2.143e+02 3.096e+02, threshold=4.008e+02, percent-clipped=0.0 2024-06-20 20:34:59,738 INFO [train.py:1028] (1/2) Epoch 15, batch 3100, loss[loss=0.2071, simple_loss=0.2484, pruned_loss=0.08285, over 13087.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2579, pruned_loss=0.08094, over 2578553.72 frames. ], batch size: 144, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:35:09,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=265356.6666666667, ans=0.0 2024-06-20 20:35:16,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=265375.0, ans=0.125 2024-06-20 20:35:18,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265393.3333333333, ans=0.1 2024-06-20 20:35:48,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=265430.0, ans=0.025 2024-06-20 20:35:52,002 INFO [train.py:1028] (1/2) Epoch 15, batch 3150, loss[loss=0.2144, simple_loss=0.2588, pruned_loss=0.08503, over 12937.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2571, pruned_loss=0.08085, over 2581351.35 frames. ], batch size: 158, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:36:14,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=265485.0, ans=0.04949747468305833 2024-06-20 20:36:29,505 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.853e+02 1.979e+02 2.200e+02 2.868e+02, threshold=3.958e+02, percent-clipped=0.0 2024-06-20 20:36:29,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=265521.6666666667, ans=0.09899494936611666 2024-06-20 20:36:37,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=265521.6666666667, ans=0.09899494936611666 2024-06-20 20:36:40,313 INFO [train.py:1028] (1/2) Epoch 15, batch 3200, loss[loss=0.2074, simple_loss=0.2563, pruned_loss=0.07924, over 13122.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2561, pruned_loss=0.0803, over 2582036.41 frames. ], batch size: 55, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:36:41,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=265540.0, ans=0.5 2024-06-20 20:36:42,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=265540.0, ans=0.125 2024-06-20 20:37:14,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.34 vs. limit=15.0 2024-06-20 20:37:16,044 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.36 vs. limit=10.0 2024-06-20 20:37:17,742 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.22 vs. limit=22.5 2024-06-20 20:37:25,087 INFO [train.py:1028] (1/2) Epoch 15, batch 3250, loss[loss=0.2089, simple_loss=0.2587, pruned_loss=0.07953, over 13048.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2551, pruned_loss=0.07999, over 2585622.14 frames. ], batch size: 71, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:37:38,190 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:37:57,336 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.65 vs. limit=10.0 2024-06-20 20:38:08,218 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 1.923e+02 2.066e+02 2.257e+02 3.249e+02, threshold=4.133e+02, percent-clipped=0.0 2024-06-20 20:38:12,423 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.72 vs. limit=22.5 2024-06-20 20:38:19,022 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.76 vs. limit=12.0 2024-06-20 20:38:19,413 INFO [train.py:1028] (1/2) Epoch 15, batch 3300, loss[loss=0.2142, simple_loss=0.2537, pruned_loss=0.08734, over 12754.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2543, pruned_loss=0.07948, over 2582748.55 frames. ], batch size: 176, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:38:33,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=265723.3333333333, ans=0.125 2024-06-20 20:38:43,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=265760.0, ans=10.0 2024-06-20 20:38:46,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=265760.0, ans=0.0 2024-06-20 20:38:50,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=265778.3333333333, ans=0.125 2024-06-20 20:39:01,329 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.91 vs. limit=15.0 2024-06-20 20:39:02,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=265796.6666666667, ans=0.125 2024-06-20 20:39:08,930 INFO [train.py:1028] (1/2) Epoch 15, batch 3350, loss[loss=0.2163, simple_loss=0.2544, pruned_loss=0.08915, over 12875.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2542, pruned_loss=0.07999, over 2578009.29 frames. ], batch size: 158, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:39:18,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=265833.3333333333, ans=0.95 2024-06-20 20:39:21,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=265833.3333333333, ans=0.0 2024-06-20 20:39:41,562 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=12.0 2024-06-20 20:39:42,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=265870.0, ans=0.2 2024-06-20 20:39:44,445 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 1.852e+02 2.022e+02 2.197e+02 3.049e+02, threshold=4.045e+02, percent-clipped=0.0 2024-06-20 20:39:46,088 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.36 vs. limit=15.0 2024-06-20 20:39:52,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=265888.3333333333, ans=0.125 2024-06-20 20:39:54,106 INFO [train.py:1028] (1/2) Epoch 15, batch 3400, loss[loss=0.2119, simple_loss=0.2601, pruned_loss=0.08188, over 12562.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.254, pruned_loss=0.08014, over 2576369.66 frames. ], batch size: 22, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:40:01,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=265906.6666666667, ans=0.05 2024-06-20 20:40:03,017 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=18.79 vs. limit=15.0 2024-06-20 20:40:04,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=265925.0, ans=0.125 2024-06-20 20:40:04,647 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.10 vs. limit=10.0 2024-06-20 20:40:06,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=265925.0, ans=0.0 2024-06-20 20:40:35,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=265980.0, ans=0.1 2024-06-20 20:40:36,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=265980.0, ans=0.035 2024-06-20 20:40:37,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=265980.0, ans=0.0 2024-06-20 20:40:45,957 INFO [train.py:1028] (1/2) Epoch 15, batch 3450, loss[loss=0.225, simple_loss=0.2654, pruned_loss=0.09231, over 12746.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2537, pruned_loss=0.07992, over 2578301.09 frames. ], batch size: 176, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:40:47,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=265998.3333333333, ans=0.1 2024-06-20 20:40:56,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=266016.6666666667, ans=0.125 2024-06-20 20:41:02,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=266016.6666666667, ans=0.025 2024-06-20 20:41:25,196 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.05 vs. limit=15.0 2024-06-20 20:41:28,978 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.795e+02 1.918e+02 2.065e+02 2.612e+02, threshold=3.836e+02, percent-clipped=0.0 2024-06-20 20:41:33,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=266071.6666666667, ans=0.025 2024-06-20 20:41:39,788 INFO [train.py:1028] (1/2) Epoch 15, batch 3500, loss[loss=0.1961, simple_loss=0.2466, pruned_loss=0.07277, over 12912.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2533, pruned_loss=0.07947, over 2576486.78 frames. ], batch size: 33, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:41:42,424 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.70 vs. limit=15.0 2024-06-20 20:41:43,904 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.115e+00 2024-06-20 20:41:51,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=266108.3333333333, ans=0.125 2024-06-20 20:41:58,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=266126.6666666667, ans=0.125 2024-06-20 20:42:04,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=266145.0, ans=0.125 2024-06-20 20:42:20,729 INFO [train.py:1028] (1/2) Epoch 15, batch 3550, loss[loss=0.1949, simple_loss=0.2433, pruned_loss=0.07327, over 13125.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2527, pruned_loss=0.07925, over 2577902.94 frames. ], batch size: 95, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:42:22,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=266181.6666666667, ans=0.2 2024-06-20 20:42:23,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=266181.6666666667, ans=0.125 2024-06-20 20:42:45,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=266218.3333333333, ans=0.2 2024-06-20 20:42:50,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=266236.6666666667, ans=0.125 2024-06-20 20:42:56,743 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.804e+02 1.962e+02 2.114e+02 2.570e+02, threshold=3.923e+02, percent-clipped=0.0 2024-06-20 20:43:07,576 INFO [train.py:1028] (1/2) Epoch 15, batch 3600, loss[loss=0.1922, simple_loss=0.2464, pruned_loss=0.06898, over 12999.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2522, pruned_loss=0.07917, over 2580489.09 frames. ], batch size: 48, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:43:21,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=266291.6666666667, ans=0.0 2024-06-20 20:43:33,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=266310.0, ans=0.0 2024-06-20 20:43:41,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=266328.3333333333, ans=0.1 2024-06-20 20:43:46,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=266328.3333333333, ans=0.2 2024-06-20 20:43:50,996 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.96 vs. limit=10.0 2024-06-20 20:43:52,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=266346.6666666667, ans=0.0 2024-06-20 20:44:03,001 INFO [train.py:1028] (1/2) Epoch 15, batch 3650, loss[loss=0.2099, simple_loss=0.2537, pruned_loss=0.08302, over 13065.00 frames. ], tot_loss[loss=0.205, simple_loss=0.252, pruned_loss=0.07896, over 2579887.52 frames. ], batch size: 102, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:44:05,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=266365.0, ans=0.125 2024-06-20 20:44:07,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=266365.0, ans=0.125 2024-06-20 20:44:18,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=266401.6666666667, ans=0.0 2024-06-20 20:44:28,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=266420.0, ans=0.025 2024-06-20 20:44:28,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=266420.0, ans=0.1 2024-06-20 20:44:35,340 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.851e+02 1.940e+02 2.110e+02 2.656e+02, threshold=3.879e+02, percent-clipped=0.0 2024-06-20 20:44:38,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=266438.3333333333, ans=0.025 2024-06-20 20:44:43,417 INFO [train.py:1028] (1/2) Epoch 15, batch 3700, loss[loss=0.2046, simple_loss=0.2585, pruned_loss=0.07531, over 13258.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2513, pruned_loss=0.07862, over 2584131.13 frames. ], batch size: 72, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:44:49,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=266475.0, ans=0.125 2024-06-20 20:44:51,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=266475.0, ans=0.1 2024-06-20 20:45:00,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=266493.3333333333, ans=0.0 2024-06-20 20:45:14,567 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.03 vs. limit=15.0 2024-06-20 20:45:15,942 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=266530.0, ans=0.2 2024-06-20 20:45:17,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266530.0, ans=0.1 2024-06-20 20:45:19,030 INFO [train.py:1028] (1/2) Epoch 15, batch 3750, loss[loss=0.2077, simple_loss=0.2613, pruned_loss=0.07707, over 12396.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2502, pruned_loss=0.07789, over 2585678.44 frames. ], batch size: 22, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:45:22,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=266548.3333333333, ans=0.0 2024-06-20 20:45:30,414 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.47 vs. limit=22.5 2024-06-20 20:45:45,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=266603.3333333333, ans=0.125 2024-06-20 20:45:48,413 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.12 vs. limit=15.0 2024-06-20 20:45:54,339 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.847e+02 2.007e+02 2.248e+02 3.608e+02, threshold=4.015e+02, percent-clipped=0.0 2024-06-20 20:45:57,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=266621.6666666667, ans=0.125 2024-06-20 20:46:10,843 INFO [train.py:1028] (1/2) Epoch 15, batch 3800, loss[loss=0.2247, simple_loss=0.2689, pruned_loss=0.0902, over 13287.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2502, pruned_loss=0.07776, over 2583897.72 frames. ], batch size: 83, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:46:11,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=266640.0, ans=0.1 2024-06-20 20:46:14,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=266640.0, ans=0.125 2024-06-20 20:46:39,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=266695.0, ans=0.125 2024-06-20 20:46:41,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=266695.0, ans=0.125 2024-06-20 20:46:42,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=266695.0, ans=0.2 2024-06-20 20:46:56,419 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.63 vs. limit=15.0 2024-06-20 20:46:56,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=266713.3333333333, ans=0.0 2024-06-20 20:46:57,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=266713.3333333333, ans=0.0 2024-06-20 20:47:04,973 INFO [train.py:1028] (1/2) Epoch 15, batch 3850, loss[loss=0.1985, simple_loss=0.2373, pruned_loss=0.07982, over 13010.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2498, pruned_loss=0.07746, over 2582833.15 frames. ], batch size: 144, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:47:05,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=266731.6666666667, ans=0.5 2024-06-20 20:47:13,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=266750.0, ans=0.125 2024-06-20 20:47:23,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=266768.3333333333, ans=0.125 2024-06-20 20:47:24,186 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.59 vs. limit=10.0 2024-06-20 20:47:27,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=266768.3333333333, ans=0.125 2024-06-20 20:47:28,391 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.85 vs. limit=15.0 2024-06-20 20:47:40,121 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.804e+02 1.942e+02 2.106e+02 2.490e+02, threshold=3.884e+02, percent-clipped=0.0 2024-06-20 20:47:43,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=266805.0, ans=0.125 2024-06-20 20:47:46,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=266805.0, ans=0.0 2024-06-20 20:47:51,255 INFO [train.py:1028] (1/2) Epoch 15, batch 3900, loss[loss=0.2167, simple_loss=0.2545, pruned_loss=0.08944, over 13230.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.249, pruned_loss=0.07715, over 2585561.75 frames. ], batch size: 83, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:48:02,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=266841.6666666667, ans=0.125 2024-06-20 20:48:05,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=266841.6666666667, ans=0.125 2024-06-20 20:48:06,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=266841.6666666667, ans=0.1 2024-06-20 20:48:08,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=266860.0, ans=0.2 2024-06-20 20:48:12,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=266860.0, ans=0.0 2024-06-20 20:48:12,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=266860.0, ans=0.0 2024-06-20 20:48:25,980 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.62 vs. limit=15.0 2024-06-20 20:48:39,457 INFO [train.py:1028] (1/2) Epoch 15, batch 3950, loss[loss=0.1863, simple_loss=0.2253, pruned_loss=0.07365, over 13043.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2483, pruned_loss=0.07663, over 2587017.59 frames. ], batch size: 132, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:48:40,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=266915.0, ans=0.1 2024-06-20 20:48:49,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=266933.3333333333, ans=0.0 2024-06-20 20:48:58,532 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=266933.3333333333, ans=0.125 2024-06-20 20:49:00,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=266951.6666666667, ans=0.125 2024-06-20 20:49:02,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.37 vs. limit=22.5 2024-06-20 20:49:04,638 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.72 vs. limit=15.0 2024-06-20 20:49:25,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=266970.0, ans=0.0 2024-06-20 20:49:25,812 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.908e+02 2.063e+02 2.324e+02 3.106e+02, threshold=4.127e+02, percent-clipped=0.0 2024-06-20 20:49:28,183 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.52 vs. limit=15.0 2024-06-20 20:49:29,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=266988.3333333333, ans=0.125 2024-06-20 20:49:32,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=266988.3333333333, ans=0.0 2024-06-20 20:49:36,416 INFO [train.py:1028] (1/2) Epoch 15, batch 4000, loss[loss=0.1893, simple_loss=0.2446, pruned_loss=0.06697, over 13003.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2483, pruned_loss=0.07676, over 2582522.57 frames. ], batch size: 39, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:49:40,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=267006.6666666667, ans=0.07 2024-06-20 20:49:52,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=267025.0, ans=0.125 2024-06-20 20:50:01,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=267043.3333333333, ans=0.0 2024-06-20 20:50:10,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=267043.3333333333, ans=0.0 2024-06-20 20:50:12,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=267061.6666666667, ans=0.05 2024-06-20 20:50:13,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=267061.6666666667, ans=0.0 2024-06-20 20:50:24,152 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.02 vs. limit=15.0 2024-06-20 20:50:27,595 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.53 vs. limit=22.5 2024-06-20 20:50:30,852 INFO [train.py:1028] (1/2) Epoch 15, batch 4050, loss[loss=0.2268, simple_loss=0.2644, pruned_loss=0.0946, over 11054.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2478, pruned_loss=0.07661, over 2579848.48 frames. ], batch size: 304, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:50:36,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=267098.3333333333, ans=0.1 2024-06-20 20:50:40,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=267116.6666666667, ans=0.025 2024-06-20 20:50:51,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=267135.0, ans=10.0 2024-06-20 20:51:06,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=267153.3333333333, ans=0.125 2024-06-20 20:51:08,889 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.633e+02 1.896e+02 2.070e+02 2.284e+02 3.019e+02, threshold=4.140e+02, percent-clipped=0.0 2024-06-20 20:51:19,756 INFO [train.py:1028] (1/2) Epoch 15, batch 4100, loss[loss=0.1947, simple_loss=0.2439, pruned_loss=0.07279, over 13059.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2473, pruned_loss=0.07661, over 2576076.43 frames. ], batch size: 102, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:51:21,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=267190.0, ans=0.0 2024-06-20 20:51:24,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=267190.0, ans=0.2 2024-06-20 20:51:34,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=267208.3333333333, ans=0.0 2024-06-20 20:51:36,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=267226.6666666667, ans=0.125 2024-06-20 20:51:49,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=267245.0, ans=0.1 2024-06-20 20:51:54,778 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2024-06-20 20:52:11,321 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.48 vs. limit=10.0 2024-06-20 20:52:11,829 INFO [train.py:1028] (1/2) Epoch 15, batch 4150, loss[loss=0.188, simple_loss=0.2344, pruned_loss=0.0708, over 13099.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2469, pruned_loss=0.07637, over 2575655.48 frames. ], batch size: 55, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:52:16,797 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2024-06-20 20:52:35,882 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:52:39,199 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2024-06-20 20:52:41,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=267318.3333333333, ans=0.2 2024-06-20 20:52:48,065 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.05 vs. limit=10.0 2024-06-20 20:52:50,982 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.794e+02 1.959e+02 2.119e+02 3.540e+02, threshold=3.917e+02, percent-clipped=0.0 2024-06-20 20:52:52,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=267355.0, ans=0.125 2024-06-20 20:52:52,768 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.92 vs. limit=15.0 2024-06-20 20:53:08,956 INFO [train.py:1028] (1/2) Epoch 15, batch 4200, loss[loss=0.1857, simple_loss=0.2292, pruned_loss=0.07103, over 13158.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.246, pruned_loss=0.07606, over 2577985.10 frames. ], batch size: 103, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:53:16,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=267391.6666666667, ans=0.125 2024-06-20 20:53:51,879 INFO [train.py:1028] (1/2) Epoch 15, batch 4250, loss[loss=0.1913, simple_loss=0.2435, pruned_loss=0.06954, over 13286.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2468, pruned_loss=0.07661, over 2581121.60 frames. ], batch size: 46, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:54:01,335 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.45 vs. limit=22.5 2024-06-20 20:54:02,557 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:54:20,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=267520.0, ans=0.2 2024-06-20 20:54:23,948 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.869e+02 1.988e+02 2.186e+02 3.425e+02, threshold=3.977e+02, percent-clipped=0.0 2024-06-20 20:54:28,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=267538.3333333333, ans=0.0 2024-06-20 20:54:34,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=267556.6666666667, ans=0.2 2024-06-20 20:54:34,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=267556.6666666667, ans=0.125 2024-06-20 20:54:35,085 INFO [train.py:1028] (1/2) Epoch 15, batch 4300, loss[loss=0.2046, simple_loss=0.2553, pruned_loss=0.07691, over 13185.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2462, pruned_loss=0.07629, over 2580837.06 frames. ], batch size: 59, lr: 3.90e-03, grad_scale: 32.0 2024-06-20 20:54:43,993 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2024-06-20 20:55:27,290 INFO [train.py:1028] (1/2) Epoch 15, batch 4350, loss[loss=0.1989, simple_loss=0.246, pruned_loss=0.07586, over 13215.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2453, pruned_loss=0.07584, over 2585910.94 frames. ], batch size: 59, lr: 3.90e-03, grad_scale: 32.0 2024-06-20 20:55:31,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=267648.3333333333, ans=0.0 2024-06-20 20:55:35,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=267666.6666666667, ans=0.035 2024-06-20 20:55:59,264 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=15.0 2024-06-20 20:56:09,721 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=15.0 2024-06-20 20:56:10,652 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.816e+02 1.995e+02 2.176e+02 2.755e+02, threshold=3.991e+02, percent-clipped=0.0 2024-06-20 20:56:16,259 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.61 vs. limit=22.5 2024-06-20 20:56:20,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=267740.0, ans=0.125 2024-06-20 20:56:20,952 INFO [train.py:1028] (1/2) Epoch 15, batch 4400, loss[loss=0.2018, simple_loss=0.2463, pruned_loss=0.07864, over 13241.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2455, pruned_loss=0.07599, over 2586066.01 frames. ], batch size: 83, lr: 3.90e-03, grad_scale: 32.0 2024-06-20 20:56:35,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=267758.3333333333, ans=0.125 2024-06-20 20:56:43,226 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.97 vs. limit=15.0 2024-06-20 20:56:56,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=267813.3333333333, ans=0.1 2024-06-20 20:57:08,063 INFO [train.py:1028] (1/2) Epoch 15, batch 4450, loss[loss=0.1974, simple_loss=0.2495, pruned_loss=0.07264, over 12911.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2457, pruned_loss=0.07604, over 2580841.00 frames. ], batch size: 33, lr: 3.90e-03, grad_scale: 32.0 2024-06-20 20:57:12,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=267831.6666666667, ans=0.125 2024-06-20 20:57:15,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=267831.6666666667, ans=0.2 2024-06-20 20:57:17,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=267850.0, ans=0.125 2024-06-20 20:57:23,501 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.22 vs. limit=15.0 2024-06-20 20:57:27,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=267868.3333333333, ans=0.125 2024-06-20 20:57:29,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=267868.3333333333, ans=0.0 2024-06-20 20:57:44,573 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.851e+02 1.990e+02 2.169e+02 3.063e+02, threshold=3.981e+02, percent-clipped=0.0 2024-06-20 20:58:03,580 INFO [train.py:1028] (1/2) Epoch 15, batch 4500, loss[loss=0.1731, simple_loss=0.2239, pruned_loss=0.06117, over 13234.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2451, pruned_loss=0.0759, over 2585500.88 frames. ], batch size: 89, lr: 3.90e-03, grad_scale: 32.0 2024-06-20 20:58:05,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=267923.3333333333, ans=0.125 2024-06-20 20:58:12,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=267941.6666666667, ans=0.1 2024-06-20 20:58:56,042 INFO [train.py:1028] (1/2) Epoch 15, batch 4550, loss[loss=0.1797, simple_loss=0.232, pruned_loss=0.06364, over 13209.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2451, pruned_loss=0.07595, over 2588314.32 frames. ], batch size: 52, lr: 3.90e-03, grad_scale: 32.0 2024-06-20 20:59:04,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=268015.0, ans=22.5 2024-06-20 20:59:04,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=268033.3333333333, ans=0.1 2024-06-20 20:59:12,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=268033.3333333333, ans=0.125 2024-06-20 20:59:23,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=268051.6666666667, ans=0.0 2024-06-20 20:59:29,296 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.92 vs. limit=12.0 2024-06-20 20:59:34,356 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 1.781e+02 1.882e+02 1.992e+02 2.555e+02, threshold=3.764e+02, percent-clipped=0.0 2024-06-20 20:59:39,801 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.42 vs. limit=22.5 2024-06-20 20:59:44,883 INFO [train.py:1028] (1/2) Epoch 15, batch 4600, loss[loss=0.2163, simple_loss=0.2565, pruned_loss=0.08802, over 12536.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2455, pruned_loss=0.07611, over 2584441.68 frames. ], batch size: 202, lr: 3.90e-03, grad_scale: 32.0 2024-06-20 21:00:17,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=268161.6666666667, ans=0.0 2024-06-20 21:00:21,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=268180.0, ans=0.125 2024-06-20 21:00:31,982 INFO [train.py:1028] (1/2) Epoch 15, batch 4650, loss[loss=0.1877, simple_loss=0.2272, pruned_loss=0.07407, over 13128.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2446, pruned_loss=0.07584, over 2587970.37 frames. ], batch size: 132, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:00:37,959 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2024-06-20 21:00:47,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=268216.6666666667, ans=0.0 2024-06-20 21:01:03,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=268235.0, ans=0.0 2024-06-20 21:01:07,009 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.98 vs. limit=15.0 2024-06-20 21:01:08,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=268253.3333333333, ans=0.2 2024-06-20 21:01:08,237 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.559e+01 2024-06-20 21:01:11,331 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.31 vs. limit=10.0 2024-06-20 21:01:16,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=268253.3333333333, ans=0.125 2024-06-20 21:01:17,925 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 1.822e+02 1.940e+02 2.148e+02 3.190e+02, threshold=3.880e+02, percent-clipped=0.0 2024-06-20 21:01:18,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=268271.6666666667, ans=0.125 2024-06-20 21:01:20,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=268271.6666666667, ans=0.1 2024-06-20 21:01:23,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=268271.6666666667, ans=0.125 2024-06-20 21:01:28,903 INFO [train.py:1028] (1/2) Epoch 15, batch 4700, loss[loss=0.1857, simple_loss=0.2386, pruned_loss=0.0664, over 12396.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2452, pruned_loss=0.07616, over 2584463.95 frames. ], batch size: 25, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:01:32,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=268290.0, ans=0.125 2024-06-20 21:01:41,847 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.16 vs. limit=10.0 2024-06-20 21:01:43,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=268308.3333333333, ans=0.2 2024-06-20 21:01:51,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=268326.6666666667, ans=0.0 2024-06-20 21:02:08,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=268363.3333333333, ans=0.2 2024-06-20 21:02:13,271 INFO [train.py:1028] (1/2) Epoch 15, batch 4750, loss[loss=0.2161, simple_loss=0.2528, pruned_loss=0.08971, over 12556.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2453, pruned_loss=0.07628, over 2580419.41 frames. ], batch size: 202, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:02:16,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=268381.6666666667, ans=0.1 2024-06-20 21:02:39,817 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 1.868e+02 2.024e+02 2.212e+02 3.155e+02, threshold=4.048e+02, percent-clipped=0.0 2024-06-20 21:02:42,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=268455.0, ans=0.09899494936611666 2024-06-20 21:02:48,165 INFO [train.py:1028] (1/2) Epoch 15, batch 4800, loss[loss=0.1824, simple_loss=0.2344, pruned_loss=0.0652, over 13248.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2451, pruned_loss=0.07627, over 2577260.88 frames. ], batch size: 63, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:02:54,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=268473.3333333333, ans=0.1 2024-06-20 21:03:01,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=268491.6666666667, ans=0.04949747468305833 2024-06-20 21:03:02,354 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2024-06-20 21:03:13,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=268510.0, ans=0.025 2024-06-20 21:03:14,670 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.157e+00 2024-06-20 21:03:16,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=268528.3333333333, ans=0.125 2024-06-20 21:03:21,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=268528.3333333333, ans=0.125 2024-06-20 21:03:31,554 INFO [train.py:1028] (1/2) Epoch 15, batch 4850, loss[loss=0.1876, simple_loss=0.2344, pruned_loss=0.07034, over 13210.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2445, pruned_loss=0.07584, over 2575219.88 frames. ], batch size: 89, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:03:38,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=268565.0, ans=0.125 2024-06-20 21:03:39,580 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2024-06-20 21:03:40,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=268583.3333333333, ans=0.125 2024-06-20 21:03:41,051 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2024-06-20 21:03:46,235 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.30 vs. limit=15.0 2024-06-20 21:03:46,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=268583.3333333333, ans=0.125 2024-06-20 21:03:46,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=268583.3333333333, ans=0.0 2024-06-20 21:03:49,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=268601.6666666667, ans=0.1 2024-06-20 21:03:58,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=268601.6666666667, ans=0.0 2024-06-20 21:04:04,579 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.26 vs. limit=15.0 2024-06-20 21:04:08,149 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.849e+02 1.965e+02 2.191e+02 3.139e+02, threshold=3.929e+02, percent-clipped=0.0 2024-06-20 21:04:13,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=268638.3333333333, ans=0.04949747468305833 2024-06-20 21:04:16,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=268656.6666666667, ans=0.0 2024-06-20 21:04:16,769 INFO [train.py:1028] (1/2) Epoch 15, batch 4900, loss[loss=0.1889, simple_loss=0.2395, pruned_loss=0.06917, over 13242.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.245, pruned_loss=0.07621, over 2575330.36 frames. ], batch size: 59, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:04:19,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=268656.6666666667, ans=0.0 2024-06-20 21:04:24,351 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.65 vs. limit=15.0 2024-06-20 21:04:31,704 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.27 vs. limit=15.0 2024-06-20 21:04:32,551 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.90 vs. limit=12.0 2024-06-20 21:04:56,864 INFO [train.py:1028] (1/2) Epoch 15, batch 4950, loss[loss=0.2122, simple_loss=0.2505, pruned_loss=0.08697, over 10855.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2446, pruned_loss=0.07629, over 2569345.57 frames. ], batch size: 303, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:05:07,809 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.98 vs. limit=15.0 2024-06-20 21:05:12,353 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.41 vs. limit=22.5 2024-06-20 21:05:13,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=268785.0, ans=0.04949747468305833 2024-06-20 21:05:14,630 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.83 vs. limit=22.5 2024-06-20 21:05:28,243 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.964e+02 2.122e+02 2.352e+02 3.045e+02, threshold=4.244e+02, percent-clipped=0.0 2024-06-20 21:05:35,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=268821.6666666667, ans=0.05 2024-06-20 21:05:40,947 INFO [train.py:1028] (1/2) Epoch 15, batch 5000, loss[loss=0.185, simple_loss=0.233, pruned_loss=0.06852, over 13119.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2451, pruned_loss=0.07627, over 2573951.83 frames. ], batch size: 95, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:05:54,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=268858.3333333333, ans=0.1 2024-06-20 21:06:13,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=268895.0, ans=0.2 2024-06-20 21:06:15,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=268913.3333333333, ans=0.125 2024-06-20 21:06:16,471 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.54 vs. limit=12.0 2024-06-20 21:06:22,569 INFO [train.py:1028] (1/2) Epoch 15, batch 5050, loss[loss=0.1927, simple_loss=0.241, pruned_loss=0.07225, over 12871.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2449, pruned_loss=0.07623, over 2572881.62 frames. ], batch size: 36, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:06:34,249 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.17 vs. limit=15.0 2024-06-20 21:06:42,157 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:06:42,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=268968.3333333333, ans=0.125 2024-06-20 21:06:45,748 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.18 vs. limit=15.0 2024-06-20 21:06:48,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=268968.3333333333, ans=0.125 2024-06-20 21:06:49,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=268968.3333333333, ans=0.0 2024-06-20 21:06:52,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=268986.6666666667, ans=0.125 2024-06-20 21:06:58,291 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.813e+02 1.942e+02 2.089e+02 2.974e+02, threshold=3.885e+02, percent-clipped=0.0 2024-06-20 21:07:00,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=269005.0, ans=0.125 2024-06-20 21:07:06,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=269005.0, ans=0.0 2024-06-20 21:07:07,975 INFO [train.py:1028] (1/2) Epoch 15, batch 5100, loss[loss=0.2049, simple_loss=0.253, pruned_loss=0.07843, over 12940.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2448, pruned_loss=0.07654, over 2569835.87 frames. ], batch size: 39, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:07:12,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=269023.3333333333, ans=0.125 2024-06-20 21:07:30,508 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.82 vs. limit=6.0 2024-06-20 21:07:32,033 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=269060.0, ans=0.95 2024-06-20 21:07:37,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269078.3333333333, ans=0.1 2024-06-20 21:07:42,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=269096.6666666667, ans=0.1 2024-06-20 21:07:48,933 INFO [train.py:1028] (1/2) Epoch 15, batch 5150, loss[loss=0.1853, simple_loss=0.2275, pruned_loss=0.07154, over 13032.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2444, pruned_loss=0.07648, over 2571083.74 frames. ], batch size: 132, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:07:52,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=269115.0, ans=0.1 2024-06-20 21:08:05,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=269133.3333333333, ans=0.125 2024-06-20 21:08:05,724 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.56 vs. limit=10.0 2024-06-20 21:08:09,968 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.66 vs. limit=15.0 2024-06-20 21:08:17,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=269170.0, ans=0.125 2024-06-20 21:08:22,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=269170.0, ans=0.2 2024-06-20 21:08:24,102 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.819e+02 1.957e+02 2.123e+02 3.044e+02, threshold=3.914e+02, percent-clipped=0.0 2024-06-20 21:08:32,714 INFO [train.py:1028] (1/2) Epoch 15, batch 5200, loss[loss=0.2314, simple_loss=0.2755, pruned_loss=0.09361, over 13182.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2444, pruned_loss=0.07667, over 2574002.93 frames. ], batch size: 95, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:08:36,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=269206.6666666667, ans=0.125 2024-06-20 21:08:36,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=269206.6666666667, ans=0.0 2024-06-20 21:08:41,179 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=269225.0, ans=0.125 2024-06-20 21:09:14,429 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2024-06-20 21:09:16,457 INFO [train.py:1028] (1/2) Epoch 15, batch 5250, loss[loss=0.1877, simple_loss=0.2375, pruned_loss=0.06889, over 13283.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2449, pruned_loss=0.07689, over 2570306.89 frames. ], batch size: 52, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:09:16,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=269298.3333333333, ans=0.0 2024-06-20 21:09:29,055 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.33 vs. limit=15.0 2024-06-20 21:09:37,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=269335.0, ans=0.0 2024-06-20 21:09:47,755 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.849e+02 1.975e+02 2.214e+02 2.998e+02, threshold=3.951e+02, percent-clipped=0.0 2024-06-20 21:09:50,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269371.6666666667, ans=0.1 2024-06-20 21:09:51,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=269371.6666666667, ans=0.2 2024-06-20 21:09:56,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=269390.0, ans=0.125 2024-06-20 21:09:56,669 INFO [train.py:1028] (1/2) Epoch 15, batch 5300, loss[loss=0.2041, simple_loss=0.243, pruned_loss=0.08261, over 13038.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2444, pruned_loss=0.07653, over 2565904.29 frames. ], batch size: 144, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:10:00,285 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.95 vs. limit=15.0 2024-06-20 21:10:12,216 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.56 vs. limit=10.0 2024-06-20 21:10:41,917 INFO [train.py:1028] (1/2) Epoch 15, batch 5350, loss[loss=0.1819, simple_loss=0.2336, pruned_loss=0.06514, over 11777.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2438, pruned_loss=0.07601, over 2572437.85 frames. ], batch size: 17, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:10:53,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=269500.0, ans=0.025 2024-06-20 21:10:53,485 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.25 vs. limit=6.0 2024-06-20 21:10:56,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=269500.0, ans=0.125 2024-06-20 21:11:13,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=269536.6666666667, ans=0.125 2024-06-20 21:11:16,931 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 1.823e+02 1.901e+02 2.035e+02 3.037e+02, threshold=3.802e+02, percent-clipped=0.0 2024-06-20 21:11:18,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=269555.0, ans=0.2 2024-06-20 21:11:22,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=269555.0, ans=0.0 2024-06-20 21:11:24,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=269555.0, ans=0.125 2024-06-20 21:11:25,743 INFO [train.py:1028] (1/2) Epoch 15, batch 5400, loss[loss=0.2113, simple_loss=0.2483, pruned_loss=0.08715, over 12215.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2443, pruned_loss=0.07661, over 2564962.43 frames. ], batch size: 241, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:11:26,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=269573.3333333333, ans=0.125 2024-06-20 21:11:40,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=269610.0, ans=0.2 2024-06-20 21:11:41,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=269610.0, ans=0.125 2024-06-20 21:11:45,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=269610.0, ans=0.0 2024-06-20 21:11:48,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=269628.3333333333, ans=0.125 2024-06-20 21:11:50,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=269628.3333333333, ans=0.125 2024-06-20 21:11:51,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=269628.3333333333, ans=0.125 2024-06-20 21:11:59,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=269646.6666666667, ans=0.125 2024-06-20 21:12:05,794 INFO [train.py:1028] (1/2) Epoch 15, batch 5450, loss[loss=0.1907, simple_loss=0.2445, pruned_loss=0.06843, over 12984.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.244, pruned_loss=0.07627, over 2569158.39 frames. ], batch size: 26, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:12:12,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=269665.0, ans=0.0 2024-06-20 21:12:18,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=269683.3333333333, ans=0.125 2024-06-20 21:12:20,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=269683.3333333333, ans=0.0 2024-06-20 21:12:23,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=269701.6666666667, ans=0.125 2024-06-20 21:12:25,052 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2024-06-20 21:12:26,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=269701.6666666667, ans=0.0 2024-06-20 21:12:41,290 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.821e+02 1.941e+02 2.083e+02 2.803e+02, threshold=3.881e+02, percent-clipped=0.0 2024-06-20 21:12:50,238 INFO [train.py:1028] (1/2) Epoch 15, batch 5500, loss[loss=0.2187, simple_loss=0.2543, pruned_loss=0.09152, over 12246.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2447, pruned_loss=0.07664, over 2562460.77 frames. ], batch size: 240, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:13:14,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=269811.6666666667, ans=0.125 2024-06-20 21:13:23,029 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=269830.0, ans=0.125 2024-06-20 21:13:29,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=269848.3333333333, ans=0.125 2024-06-20 21:13:29,734 INFO [train.py:1028] (1/2) Epoch 15, batch 5550, loss[loss=0.2133, simple_loss=0.2508, pruned_loss=0.08795, over 13227.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2446, pruned_loss=0.07651, over 2566424.63 frames. ], batch size: 43, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:13:37,791 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.81 vs. limit=15.0 2024-06-20 21:13:39,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=269848.3333333333, ans=0.125 2024-06-20 21:13:41,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=269866.6666666667, ans=0.125 2024-06-20 21:13:49,917 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.73 vs. limit=12.0 2024-06-20 21:14:04,333 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.838e+02 1.948e+02 2.151e+02 3.081e+02, threshold=3.897e+02, percent-clipped=0.0 2024-06-20 21:14:10,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=269921.6666666667, ans=0.0 2024-06-20 21:14:12,992 INFO [train.py:1028] (1/2) Epoch 15, batch 5600, loss[loss=0.1821, simple_loss=0.221, pruned_loss=0.07159, over 13222.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2446, pruned_loss=0.07685, over 2568830.82 frames. ], batch size: 89, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:14:13,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=269940.0, ans=0.0 2024-06-20 21:14:26,866 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.77 vs. limit=22.5 2024-06-20 21:14:35,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=269976.6666666667, ans=0.125 2024-06-20 21:14:36,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=269976.6666666667, ans=0.125 2024-06-20 21:14:43,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269995.0, ans=0.1 2024-06-20 21:14:45,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269995.0, ans=0.1 2024-06-20 21:14:50,196 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.78 vs. limit=15.0 2024-06-20 21:14:54,186 INFO [train.py:1028] (1/2) Epoch 15, batch 5650, loss[loss=0.2221, simple_loss=0.2599, pruned_loss=0.09216, over 12547.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2446, pruned_loss=0.07663, over 2575551.78 frames. ], batch size: 202, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:15:19,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=270068.3333333333, ans=0.0 2024-06-20 21:15:20,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=270068.3333333333, ans=0.125 2024-06-20 21:15:22,691 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:15:28,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=270086.6666666667, ans=0.0 2024-06-20 21:15:30,240 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.800e+02 1.928e+02 2.106e+02 2.711e+02, threshold=3.856e+02, percent-clipped=0.0 2024-06-20 21:15:39,482 INFO [train.py:1028] (1/2) Epoch 15, batch 5700, loss[loss=0.1999, simple_loss=0.2556, pruned_loss=0.07208, over 13272.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2445, pruned_loss=0.07667, over 2578749.90 frames. ], batch size: 63, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:15:40,729 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.57 vs. limit=6.0 2024-06-20 21:15:41,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=270123.3333333333, ans=0.025 2024-06-20 21:15:49,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=270141.6666666667, ans=0.0 2024-06-20 21:15:51,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=270141.6666666667, ans=0.125 2024-06-20 21:15:58,195 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.36 vs. limit=10.0 2024-06-20 21:16:02,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=270160.0, ans=0.0 2024-06-20 21:16:06,802 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2024-06-20 21:16:12,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=270178.3333333333, ans=0.0 2024-06-20 21:16:12,928 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.44 vs. limit=15.0 2024-06-20 21:16:20,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=270196.6666666667, ans=0.1 2024-06-20 21:16:23,835 INFO [train.py:1028] (1/2) Epoch 15, batch 5750, loss[loss=0.2057, simple_loss=0.2448, pruned_loss=0.08329, over 12857.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2452, pruned_loss=0.07681, over 2578285.50 frames. ], batch size: 177, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:16:26,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=270215.0, ans=0.0 2024-06-20 21:16:28,336 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=7.82 vs. limit=12.0 2024-06-20 21:16:37,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=270233.3333333333, ans=0.125 2024-06-20 21:16:43,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=270251.6666666667, ans=0.0 2024-06-20 21:16:48,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=270270.0, ans=0.125 2024-06-20 21:16:55,926 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.842e+02 1.984e+02 2.208e+02 3.218e+02, threshold=3.967e+02, percent-clipped=0.0 2024-06-20 21:17:05,078 INFO [train.py:1028] (1/2) Epoch 15, batch 5800, loss[loss=0.2092, simple_loss=0.2531, pruned_loss=0.08269, over 12668.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2457, pruned_loss=0.07729, over 2577442.32 frames. ], batch size: 176, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:17:05,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=270306.6666666667, ans=0.2 2024-06-20 21:17:06,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=270306.6666666667, ans=0.2 2024-06-20 21:17:06,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=270306.6666666667, ans=0.125 2024-06-20 21:17:08,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=270306.6666666667, ans=0.125 2024-06-20 21:17:35,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=270361.6666666667, ans=0.0 2024-06-20 21:17:39,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=270361.6666666667, ans=0.1 2024-06-20 21:17:39,920 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.77 vs. limit=15.0 2024-06-20 21:17:49,394 INFO [train.py:1028] (1/2) Epoch 15, batch 5850, loss[loss=0.2372, simple_loss=0.2825, pruned_loss=0.09593, over 12572.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2475, pruned_loss=0.07804, over 2575955.37 frames. ], batch size: 202, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:17:58,498 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.13 vs. limit=15.0 2024-06-20 21:18:01,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=270416.6666666667, ans=0.0 2024-06-20 21:18:11,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=270435.0, ans=0.0 2024-06-20 21:18:16,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=270453.3333333333, ans=0.125 2024-06-20 21:18:20,113 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.53 vs. limit=15.0 2024-06-20 21:18:25,653 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.849e+02 1.988e+02 2.143e+02 3.516e+02, threshold=3.975e+02, percent-clipped=0.0 2024-06-20 21:18:30,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=270471.6666666667, ans=0.2 2024-06-20 21:18:34,558 INFO [train.py:1028] (1/2) Epoch 15, batch 5900, loss[loss=0.203, simple_loss=0.2399, pruned_loss=0.083, over 13147.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2496, pruned_loss=0.07867, over 2576997.17 frames. ], batch size: 121, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:18:35,851 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2024-06-20 21:18:36,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=270490.0, ans=0.0 2024-06-20 21:18:46,590 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.10 vs. limit=15.0 2024-06-20 21:18:48,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=270508.3333333333, ans=0.0 2024-06-20 21:19:01,690 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.17 vs. limit=10.0 2024-06-20 21:19:03,501 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2024-06-20 21:19:03,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=270545.0, ans=0.0 2024-06-20 21:19:14,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=270581.6666666667, ans=0.125 2024-06-20 21:19:15,243 INFO [train.py:1028] (1/2) Epoch 15, batch 5950, loss[loss=0.1914, simple_loss=0.2358, pruned_loss=0.07353, over 13138.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2513, pruned_loss=0.07927, over 2581735.77 frames. ], batch size: 121, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:19:41,070 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.49 vs. limit=15.0 2024-06-20 21:19:41,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=270636.6666666667, ans=0.125 2024-06-20 21:19:45,168 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.668e+02 1.948e+02 2.080e+02 2.302e+02 2.912e+02, threshold=4.160e+02, percent-clipped=0.0 2024-06-20 21:19:50,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=270655.0, ans=0.125 2024-06-20 21:19:53,839 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.59 vs. limit=12.0 2024-06-20 21:19:54,027 INFO [train.py:1028] (1/2) Epoch 15, batch 6000, loss[loss=0.2675, simple_loss=0.3022, pruned_loss=0.1164, over 12163.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2523, pruned_loss=0.07965, over 2574960.73 frames. ], batch size: 240, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:19:54,029 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 21:20:04,531 INFO [train.py:1060] (1/2) Epoch 15, validation: loss=0.1895, simple_loss=0.2543, pruned_loss=0.06236, over 351949.00 frames. 2024-06-20 21:20:04,532 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-20 21:20:19,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=270691.6666666667, ans=0.1 2024-06-20 21:20:25,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=270710.0, ans=0.125 2024-06-20 21:20:30,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=270728.3333333333, ans=0.1 2024-06-20 21:20:37,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=270746.6666666667, ans=0.2 2024-06-20 21:20:43,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=270746.6666666667, ans=0.0 2024-06-20 21:20:45,896 INFO [train.py:1028] (1/2) Epoch 15, batch 6050, loss[loss=0.2019, simple_loss=0.2472, pruned_loss=0.07828, over 13013.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2535, pruned_loss=0.07994, over 2577538.37 frames. ], batch size: 39, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:21:09,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=270801.6666666667, ans=0.09899494936611666 2024-06-20 21:21:16,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=270820.0, ans=0.125 2024-06-20 21:21:17,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=270820.0, ans=0.1 2024-06-20 21:21:21,642 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 1.909e+02 2.013e+02 2.202e+02 2.856e+02, threshold=4.025e+02, percent-clipped=0.0 2024-06-20 21:21:26,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=270838.3333333333, ans=0.2 2024-06-20 21:21:29,211 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:21:30,671 INFO [train.py:1028] (1/2) Epoch 15, batch 6100, loss[loss=0.1879, simple_loss=0.2341, pruned_loss=0.0708, over 13099.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2554, pruned_loss=0.08066, over 2579939.23 frames. ], batch size: 121, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:21:36,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=270856.6666666667, ans=0.0 2024-06-20 21:21:41,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=270875.0, ans=0.125 2024-06-20 21:21:50,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=270893.3333333333, ans=0.1 2024-06-20 21:22:10,972 INFO [train.py:1028] (1/2) Epoch 15, batch 6150, loss[loss=0.2414, simple_loss=0.2768, pruned_loss=0.103, over 11040.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2573, pruned_loss=0.08139, over 2578393.15 frames. ], batch size: 303, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:22:11,533 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=12.0 2024-06-20 21:22:13,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=270948.3333333333, ans=0.1 2024-06-20 21:22:14,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.83 vs. limit=15.0 2024-06-20 21:22:16,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=270948.3333333333, ans=0.125 2024-06-20 21:22:18,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=270966.6666666667, ans=0.125 2024-06-20 21:22:19,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=270966.6666666667, ans=0.0 2024-06-20 21:22:46,298 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.925e+02 2.140e+02 2.560e+02 4.089e+02, threshold=4.279e+02, percent-clipped=1.0 2024-06-20 21:22:53,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=271021.6666666667, ans=0.1 2024-06-20 21:22:54,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=271040.0, ans=0.025 2024-06-20 21:22:55,147 INFO [train.py:1028] (1/2) Epoch 15, batch 6200, loss[loss=0.2234, simple_loss=0.2737, pruned_loss=0.08655, over 13279.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2592, pruned_loss=0.08239, over 2575776.15 frames. ], batch size: 89, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:23:01,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=271040.0, ans=0.0 2024-06-20 21:23:05,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=271058.3333333333, ans=0.025 2024-06-20 21:23:08,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=271058.3333333333, ans=0.07 2024-06-20 21:23:11,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=271076.6666666667, ans=0.0 2024-06-20 21:23:26,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=271095.0, ans=0.125 2024-06-20 21:23:33,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=271113.3333333333, ans=0.0 2024-06-20 21:23:40,269 INFO [train.py:1028] (1/2) Epoch 15, batch 6250, loss[loss=0.2226, simple_loss=0.2714, pruned_loss=0.08691, over 13185.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2608, pruned_loss=0.08315, over 2569039.48 frames. ], batch size: 83, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:23:43,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=271131.6666666667, ans=0.0 2024-06-20 21:23:49,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=271150.0, ans=0.0 2024-06-20 21:23:51,063 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.74 vs. limit=12.0 2024-06-20 21:23:52,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=271150.0, ans=0.015 2024-06-20 21:24:04,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=271186.6666666667, ans=0.1 2024-06-20 21:24:06,078 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.51 vs. limit=6.0 2024-06-20 21:24:10,605 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.702e+02 1.994e+02 2.192e+02 2.660e+02 4.379e+02, threshold=4.384e+02, percent-clipped=1.0 2024-06-20 21:24:19,103 INFO [train.py:1028] (1/2) Epoch 15, batch 6300, loss[loss=0.2123, simple_loss=0.2647, pruned_loss=0.07994, over 11459.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2624, pruned_loss=0.08356, over 2563947.65 frames. ], batch size: 17, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:24:21,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=271223.3333333333, ans=0.125 2024-06-20 21:24:36,692 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.21 vs. limit=15.0 2024-06-20 21:24:36,833 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.56 vs. limit=15.0 2024-06-20 21:24:46,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=271278.3333333333, ans=0.0 2024-06-20 21:24:52,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=271278.3333333333, ans=0.2 2024-06-20 21:25:02,372 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.94 vs. limit=22.5 2024-06-20 21:25:02,604 INFO [train.py:1028] (1/2) Epoch 15, batch 6350, loss[loss=0.2658, simple_loss=0.2975, pruned_loss=0.117, over 12470.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2637, pruned_loss=0.08361, over 2573623.57 frames. ], batch size: 202, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:25:03,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=271315.0, ans=0.125 2024-06-20 21:25:08,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=271315.0, ans=0.125 2024-06-20 21:25:25,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=271351.6666666667, ans=0.125 2024-06-20 21:25:27,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=271351.6666666667, ans=0.2 2024-06-20 21:25:35,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=271370.0, ans=0.125 2024-06-20 21:25:38,073 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 1.903e+02 2.037e+02 2.257e+02 3.021e+02, threshold=4.074e+02, percent-clipped=0.0 2024-06-20 21:25:45,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=271388.3333333333, ans=0.035 2024-06-20 21:25:46,685 INFO [train.py:1028] (1/2) Epoch 15, batch 6400, loss[loss=0.2084, simple_loss=0.2641, pruned_loss=0.07636, over 13243.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2653, pruned_loss=0.08425, over 2575378.65 frames. ], batch size: 67, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:26:01,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=271425.0, ans=0.0 2024-06-20 21:26:15,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=271461.6666666667, ans=0.1 2024-06-20 21:26:22,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=271480.0, ans=0.125 2024-06-20 21:26:23,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=271480.0, ans=0.125 2024-06-20 21:26:26,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=271480.0, ans=0.125 2024-06-20 21:26:29,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=271498.3333333333, ans=0.1 2024-06-20 21:26:29,843 INFO [train.py:1028] (1/2) Epoch 15, batch 6450, loss[loss=0.2566, simple_loss=0.2986, pruned_loss=0.1073, over 12521.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2672, pruned_loss=0.08515, over 2581402.57 frames. ], batch size: 202, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:26:43,260 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.39 vs. limit=22.5 2024-06-20 21:26:48,153 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.54 vs. limit=15.0 2024-06-20 21:26:48,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=271535.0, ans=0.2 2024-06-20 21:27:01,186 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 2.012e+02 2.165e+02 2.335e+02 3.505e+02, threshold=4.330e+02, percent-clipped=0.0 2024-06-20 21:27:03,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=271571.6666666667, ans=0.125 2024-06-20 21:27:10,225 INFO [train.py:1028] (1/2) Epoch 15, batch 6500, loss[loss=0.2366, simple_loss=0.2705, pruned_loss=0.1013, over 10733.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2683, pruned_loss=0.08532, over 2585231.13 frames. ], batch size: 304, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:27:22,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=271608.3333333333, ans=0.0 2024-06-20 21:27:42,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=271645.0, ans=0.1 2024-06-20 21:27:42,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=271645.0, ans=0.125 2024-06-20 21:27:52,988 INFO [train.py:1028] (1/2) Epoch 15, batch 6550, loss[loss=0.2214, simple_loss=0.2723, pruned_loss=0.08527, over 12750.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.269, pruned_loss=0.08521, over 2588384.81 frames. ], batch size: 22, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:27:53,373 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.35 vs. limit=15.0 2024-06-20 21:27:57,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=271681.6666666667, ans=0.04949747468305833 2024-06-20 21:28:13,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=271718.3333333333, ans=0.05 2024-06-20 21:28:16,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=271736.6666666667, ans=0.2 2024-06-20 21:28:17,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=271736.6666666667, ans=0.0 2024-06-20 21:28:24,385 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 1.969e+02 2.084e+02 2.276e+02 3.903e+02, threshold=4.168e+02, percent-clipped=0.0 2024-06-20 21:28:36,625 INFO [train.py:1028] (1/2) Epoch 15, batch 6600, loss[loss=0.205, simple_loss=0.2528, pruned_loss=0.07859, over 13259.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2688, pruned_loss=0.08525, over 2590920.61 frames. ], batch size: 72, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:28:36,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=271773.3333333333, ans=0.1 2024-06-20 21:28:57,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=271810.0, ans=10.0 2024-06-20 21:29:03,035 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.35 vs. limit=15.0 2024-06-20 21:29:04,276 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:29:18,224 INFO [train.py:1028] (1/2) Epoch 15, batch 6650, loss[loss=0.2402, simple_loss=0.2864, pruned_loss=0.09701, over 12957.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2704, pruned_loss=0.08597, over 2584608.13 frames. ], batch size: 158, lr: 3.87e-03, grad_scale: 128.0 2024-06-20 21:29:28,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=271883.3333333333, ans=0.125 2024-06-20 21:29:33,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=271883.3333333333, ans=0.125 2024-06-20 21:29:34,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=271901.6666666667, ans=0.025 2024-06-20 21:29:36,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=271901.6666666667, ans=0.025 2024-06-20 21:29:36,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=271901.6666666667, ans=0.125 2024-06-20 21:29:48,866 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.678e+02 2.001e+02 2.220e+02 2.440e+02 3.486e+02, threshold=4.441e+02, percent-clipped=0.0 2024-06-20 21:29:55,586 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.62 vs. limit=10.0 2024-06-20 21:29:58,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=271938.3333333333, ans=0.015 2024-06-20 21:30:01,810 INFO [train.py:1028] (1/2) Epoch 15, batch 6700, loss[loss=0.229, simple_loss=0.2735, pruned_loss=0.09228, over 12815.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2712, pruned_loss=0.08615, over 2584423.82 frames. ], batch size: 177, lr: 3.87e-03, grad_scale: 128.0 2024-06-20 21:30:05,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=271956.6666666667, ans=0.0 2024-06-20 21:30:06,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=271956.6666666667, ans=0.0 2024-06-20 21:30:12,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=271975.0, ans=0.0 2024-06-20 21:30:30,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=272011.6666666667, ans=0.015 2024-06-20 21:30:42,995 INFO [train.py:1028] (1/2) Epoch 15, batch 6750, loss[loss=0.3084, simple_loss=0.3375, pruned_loss=0.1396, over 12224.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2719, pruned_loss=0.08649, over 2578915.21 frames. ], batch size: 241, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:31:00,941 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.67 vs. limit=12.0 2024-06-20 21:31:01,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=272066.6666666667, ans=0.125 2024-06-20 21:31:05,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=272085.0, ans=0.125 2024-06-20 21:31:18,996 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 1.979e+02 2.127e+02 2.240e+02 3.296e+02, threshold=4.254e+02, percent-clipped=0.0 2024-06-20 21:31:27,002 INFO [train.py:1028] (1/2) Epoch 15, batch 6800, loss[loss=0.2234, simple_loss=0.2758, pruned_loss=0.0855, over 13173.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2734, pruned_loss=0.0869, over 2579839.34 frames. ], batch size: 67, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:31:28,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=272140.0, ans=0.1 2024-06-20 21:31:29,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=272140.0, ans=0.025 2024-06-20 21:32:06,736 INFO [train.py:1028] (1/2) Epoch 15, batch 6850, loss[loss=0.2356, simple_loss=0.3001, pruned_loss=0.0856, over 13302.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2741, pruned_loss=0.08696, over 2583835.38 frames. ], batch size: 63, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:32:11,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=272231.6666666667, ans=0.125 2024-06-20 21:32:16,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=272250.0, ans=0.125 2024-06-20 21:32:24,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=272268.3333333333, ans=0.025 2024-06-20 21:32:39,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=272286.6666666667, ans=0.0 2024-06-20 21:32:39,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=272286.6666666667, ans=0.0 2024-06-20 21:32:42,200 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 1.945e+02 2.117e+02 2.305e+02 3.142e+02, threshold=4.235e+02, percent-clipped=0.0 2024-06-20 21:32:50,659 INFO [train.py:1028] (1/2) Epoch 15, batch 6900, loss[loss=0.2267, simple_loss=0.2824, pruned_loss=0.08554, over 13280.00 frames. ], tot_loss[loss=0.225, simple_loss=0.275, pruned_loss=0.08755, over 2585978.91 frames. ], batch size: 49, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:33:05,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=272341.6666666667, ans=0.125 2024-06-20 21:33:06,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=272360.0, ans=0.0 2024-06-20 21:33:17,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=272378.3333333333, ans=0.125 2024-06-20 21:33:20,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=272378.3333333333, ans=0.04949747468305833 2024-06-20 21:33:23,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=272396.6666666667, ans=0.2 2024-06-20 21:33:27,441 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=15.0 2024-06-20 21:33:35,300 INFO [train.py:1028] (1/2) Epoch 15, batch 6950, loss[loss=0.1994, simple_loss=0.2461, pruned_loss=0.07633, over 11215.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2754, pruned_loss=0.08735, over 2579640.31 frames. ], batch size: 16, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:33:47,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=272433.3333333333, ans=0.125 2024-06-20 21:34:08,475 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.794e+02 1.996e+02 2.117e+02 2.330e+02 3.249e+02, threshold=4.233e+02, percent-clipped=0.0 2024-06-20 21:34:09,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=272488.3333333333, ans=0.125 2024-06-20 21:34:14,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=272488.3333333333, ans=0.0 2024-06-20 21:34:16,132 INFO [train.py:1028] (1/2) Epoch 15, batch 7000, loss[loss=0.2337, simple_loss=0.2804, pruned_loss=0.09348, over 12961.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2762, pruned_loss=0.08736, over 2575122.28 frames. ], batch size: 158, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:34:33,058 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.32 vs. limit=15.0 2024-06-20 21:34:53,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=272580.0, ans=0.0 2024-06-20 21:34:55,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=272580.0, ans=0.0 2024-06-20 21:34:57,120 INFO [train.py:1028] (1/2) Epoch 15, batch 7050, loss[loss=0.2441, simple_loss=0.2943, pruned_loss=0.0969, over 12806.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2766, pruned_loss=0.08731, over 2581994.73 frames. ], batch size: 176, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:35:22,699 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.62 vs. limit=15.0 2024-06-20 21:35:32,810 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 1.992e+02 2.131e+02 2.309e+02 2.932e+02, threshold=4.261e+02, percent-clipped=0.0 2024-06-20 21:35:33,347 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.82 vs. limit=15.0 2024-06-20 21:35:33,408 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.11 vs. limit=10.0 2024-06-20 21:35:33,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=272671.6666666667, ans=0.025 2024-06-20 21:35:37,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=272671.6666666667, ans=0.125 2024-06-20 21:35:39,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=272671.6666666667, ans=0.2 2024-06-20 21:35:40,719 INFO [train.py:1028] (1/2) Epoch 15, batch 7100, loss[loss=0.2443, simple_loss=0.2929, pruned_loss=0.09784, over 13177.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2773, pruned_loss=0.08788, over 2574869.60 frames. ], batch size: 112, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:35:49,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=272708.3333333333, ans=0.0 2024-06-20 21:35:55,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=272726.6666666667, ans=0.125 2024-06-20 21:35:58,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=272726.6666666667, ans=0.125 2024-06-20 21:35:59,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=272726.6666666667, ans=0.1 2024-06-20 21:36:01,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=272726.6666666667, ans=0.025 2024-06-20 21:36:10,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=272745.0, ans=0.2 2024-06-20 21:36:16,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=272763.3333333333, ans=10.0 2024-06-20 21:36:23,937 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.28 vs. limit=15.0 2024-06-20 21:36:24,208 INFO [train.py:1028] (1/2) Epoch 15, batch 7150, loss[loss=0.2639, simple_loss=0.311, pruned_loss=0.1084, over 12574.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2783, pruned_loss=0.08775, over 2573199.45 frames. ], batch size: 202, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:36:26,613 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.71 vs. limit=12.0 2024-06-20 21:36:38,685 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.48 vs. limit=15.0 2024-06-20 21:36:45,044 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.91 vs. limit=15.0 2024-06-20 21:36:56,617 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 1.969e+02 2.138e+02 2.340e+02 3.254e+02, threshold=4.275e+02, percent-clipped=0.0 2024-06-20 21:37:03,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=272873.3333333333, ans=0.1 2024-06-20 21:37:04,527 INFO [train.py:1028] (1/2) Epoch 15, batch 7200, loss[loss=0.2634, simple_loss=0.312, pruned_loss=0.1073, over 13121.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2801, pruned_loss=0.08856, over 2578713.61 frames. ], batch size: 112, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:37:16,351 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.80 vs. limit=22.5 2024-06-20 21:37:19,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=272891.6666666667, ans=0.125 2024-06-20 21:37:26,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=272910.0, ans=0.0 2024-06-20 21:37:40,342 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.55 vs. limit=6.0 2024-06-20 21:37:48,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=272965.0, ans=0.125 2024-06-20 21:37:48,829 INFO [train.py:1028] (1/2) Epoch 15, batch 7250, loss[loss=0.2042, simple_loss=0.2587, pruned_loss=0.0748, over 12937.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2804, pruned_loss=0.08857, over 2580460.13 frames. ], batch size: 36, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:37:49,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=272965.0, ans=0.0 2024-06-20 21:37:52,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=272965.0, ans=0.025 2024-06-20 21:37:58,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=272983.3333333333, ans=22.5 2024-06-20 21:38:00,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=272983.3333333333, ans=0.125 2024-06-20 21:38:00,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=272983.3333333333, ans=0.125 2024-06-20 21:38:12,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273020.0, ans=0.1 2024-06-20 21:38:14,386 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.24 vs. limit=15.0 2024-06-20 21:38:20,695 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 2.004e+02 2.165e+02 2.430e+02 2.934e+02, threshold=4.330e+02, percent-clipped=0.0 2024-06-20 21:38:28,233 INFO [train.py:1028] (1/2) Epoch 15, batch 7300, loss[loss=0.2517, simple_loss=0.3104, pruned_loss=0.09648, over 13023.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2816, pruned_loss=0.0891, over 2580399.38 frames. ], batch size: 36, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:38:40,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=273075.0, ans=0.0 2024-06-20 21:38:51,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273093.3333333333, ans=0.1 2024-06-20 21:38:55,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=273111.6666666667, ans=0.0 2024-06-20 21:39:02,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=273111.6666666667, ans=0.95 2024-06-20 21:39:05,038 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.15 vs. limit=15.0 2024-06-20 21:39:12,274 INFO [train.py:1028] (1/2) Epoch 15, batch 7350, loss[loss=0.2489, simple_loss=0.299, pruned_loss=0.09945, over 13348.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2817, pruned_loss=0.08914, over 2582019.31 frames. ], batch size: 46, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:39:19,710 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2024-06-20 21:39:24,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=273166.6666666667, ans=0.125 2024-06-20 21:39:43,951 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.976e+02 2.081e+02 2.234e+02 3.296e+02, threshold=4.163e+02, percent-clipped=0.0 2024-06-20 21:39:47,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=273221.6666666667, ans=0.125 2024-06-20 21:39:48,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=273221.6666666667, ans=0.0 2024-06-20 21:39:48,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=273221.6666666667, ans=0.125 2024-06-20 21:39:51,895 INFO [train.py:1028] (1/2) Epoch 15, batch 7400, loss[loss=0.2275, simple_loss=0.2901, pruned_loss=0.08243, over 13281.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.282, pruned_loss=0.08923, over 2587523.92 frames. ], batch size: 63, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:39:56,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=273240.0, ans=0.1 2024-06-20 21:40:00,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=273258.3333333333, ans=0.125 2024-06-20 21:40:03,187 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=273258.3333333333, ans=0.125 2024-06-20 21:40:23,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273295.0, ans=0.1 2024-06-20 21:40:24,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=273295.0, ans=0.5 2024-06-20 21:40:28,961 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.95 vs. limit=15.0 2024-06-20 21:40:32,085 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.56 vs. limit=15.0 2024-06-20 21:40:36,612 INFO [train.py:1028] (1/2) Epoch 15, batch 7450, loss[loss=0.2258, simple_loss=0.2795, pruned_loss=0.08609, over 12744.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2817, pruned_loss=0.08911, over 2580243.59 frames. ], batch size: 29, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:40:39,593 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:41:00,643 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2024-06-20 21:41:09,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=273386.6666666667, ans=0.125 2024-06-20 21:41:11,680 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2024-06-20 21:41:12,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=273386.6666666667, ans=0.1 2024-06-20 21:41:13,587 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.136e+02 2.454e+02 2.818e+02 4.281e+02, threshold=4.907e+02, percent-clipped=1.0 2024-06-20 21:41:19,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=273405.0, ans=0.1 2024-06-20 21:41:21,344 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.486e+00 2024-06-20 21:41:21,782 INFO [train.py:1028] (1/2) Epoch 15, batch 7500, loss[loss=0.2405, simple_loss=0.2838, pruned_loss=0.0986, over 10660.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2832, pruned_loss=0.08984, over 2577659.91 frames. ], batch size: 303, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:41:22,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=273423.3333333333, ans=0.0 2024-06-20 21:41:34,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=273441.6666666667, ans=0.125 2024-06-20 21:41:35,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=273441.6666666667, ans=0.125 2024-06-20 21:41:44,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=273460.0, ans=0.125 2024-06-20 21:41:54,867 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.013e+00 2024-06-20 21:42:01,031 INFO [train.py:1028] (1/2) Epoch 15, batch 7550, loss[loss=0.2283, simple_loss=0.2774, pruned_loss=0.08961, over 12980.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.2838, pruned_loss=0.09044, over 2577108.37 frames. ], batch size: 158, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:42:01,547 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.96 vs. limit=10.0 2024-06-20 21:42:28,353 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.02 vs. limit=15.0 2024-06-20 21:42:31,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=273570.0, ans=0.0 2024-06-20 21:42:36,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=273570.0, ans=0.05 2024-06-20 21:42:37,466 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 1.971e+02 2.074e+02 2.300e+02 2.935e+02, threshold=4.147e+02, percent-clipped=0.0 2024-06-20 21:42:45,771 INFO [train.py:1028] (1/2) Epoch 15, batch 7600, loss[loss=0.2279, simple_loss=0.2894, pruned_loss=0.08325, over 13245.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.2843, pruned_loss=0.09079, over 2576315.06 frames. ], batch size: 83, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:42:46,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=273606.6666666667, ans=0.2 2024-06-20 21:43:03,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=273643.3333333333, ans=0.05 2024-06-20 21:43:05,149 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.755e+01 2024-06-20 21:43:13,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=273661.6666666667, ans=0.125 2024-06-20 21:43:30,486 INFO [train.py:1028] (1/2) Epoch 15, batch 7650, loss[loss=0.2395, simple_loss=0.2838, pruned_loss=0.0976, over 12934.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.2844, pruned_loss=0.09088, over 2572803.87 frames. ], batch size: 33, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:43:33,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=273698.3333333333, ans=0.0 2024-06-20 21:43:33,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=273698.3333333333, ans=0.2 2024-06-20 21:43:55,706 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.64 vs. limit=15.0 2024-06-20 21:43:56,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=273753.3333333333, ans=0.125 2024-06-20 21:44:04,031 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 2.018e+02 2.299e+02 2.611e+02 3.832e+02, threshold=4.598e+02, percent-clipped=0.0 2024-06-20 21:44:12,015 INFO [train.py:1028] (1/2) Epoch 15, batch 7700, loss[loss=0.2434, simple_loss=0.3097, pruned_loss=0.0885, over 13253.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.285, pruned_loss=0.09111, over 2569253.08 frames. ], batch size: 63, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:44:15,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=273790.0, ans=0.2 2024-06-20 21:44:18,828 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.496e+01 2024-06-20 21:44:29,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=273826.6666666667, ans=0.125 2024-06-20 21:44:30,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=273826.6666666667, ans=0.125 2024-06-20 21:44:43,271 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.89 vs. limit=15.0 2024-06-20 21:44:46,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=273863.3333333333, ans=0.125 2024-06-20 21:44:48,751 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:44:49,840 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.06 vs. limit=22.5 2024-06-20 21:44:51,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=273863.3333333333, ans=0.125 2024-06-20 21:44:54,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=273881.6666666667, ans=0.125 2024-06-20 21:44:54,751 INFO [train.py:1028] (1/2) Epoch 15, batch 7750, loss[loss=0.2242, simple_loss=0.2828, pruned_loss=0.08281, over 13045.00 frames. ], tot_loss[loss=0.235, simple_loss=0.2861, pruned_loss=0.09194, over 2572965.47 frames. ], batch size: 71, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:45:07,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=273900.0, ans=0.1 2024-06-20 21:45:11,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=273918.3333333333, ans=0.025 2024-06-20 21:45:11,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=273918.3333333333, ans=0.0 2024-06-20 21:45:26,857 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.023e+02 2.214e+02 2.393e+02 2.988e+02, threshold=4.429e+02, percent-clipped=0.0 2024-06-20 21:45:29,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=273955.0, ans=0.0 2024-06-20 21:45:30,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=273955.0, ans=0.0 2024-06-20 21:45:34,898 INFO [train.py:1028] (1/2) Epoch 15, batch 7800, loss[loss=0.2375, simple_loss=0.2932, pruned_loss=0.09093, over 13127.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.2869, pruned_loss=0.09199, over 2576402.62 frames. ], batch size: 95, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:45:38,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=273973.3333333333, ans=0.0 2024-06-20 21:45:39,405 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.56 vs. limit=15.0 2024-06-20 21:45:48,118 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:46:00,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=274010.0, ans=0.2 2024-06-20 21:46:07,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=274028.3333333333, ans=0.2 2024-06-20 21:46:20,028 INFO [train.py:1028] (1/2) Epoch 15, batch 7850, loss[loss=0.201, simple_loss=0.2512, pruned_loss=0.07547, over 11853.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.2871, pruned_loss=0.09214, over 2570888.90 frames. ], batch size: 17, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:46:32,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=274083.3333333333, ans=0.125 2024-06-20 21:46:37,821 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.07 vs. limit=22.5 2024-06-20 21:46:42,255 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.30 vs. limit=10.0 2024-06-20 21:46:51,514 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.018e+02 2.217e+02 2.411e+02 3.643e+02, threshold=4.434e+02, percent-clipped=0.0 2024-06-20 21:47:01,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=274138.3333333333, ans=0.0 2024-06-20 21:47:03,312 INFO [train.py:1028] (1/2) Epoch 15, batch 7900, loss[loss=0.2405, simple_loss=0.2881, pruned_loss=0.09645, over 13191.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.2868, pruned_loss=0.0919, over 2569995.48 frames. ], batch size: 77, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:47:09,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=274156.6666666667, ans=0.2 2024-06-20 21:47:36,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=274230.0, ans=0.0 2024-06-20 21:47:42,859 INFO [train.py:1028] (1/2) Epoch 15, batch 7950, loss[loss=0.2197, simple_loss=0.2655, pruned_loss=0.08694, over 10592.00 frames. ], tot_loss[loss=0.235, simple_loss=0.2868, pruned_loss=0.09157, over 2574236.64 frames. ], batch size: 304, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:47:47,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=274248.3333333333, ans=0.1 2024-06-20 21:47:53,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=274266.6666666667, ans=0.125 2024-06-20 21:48:09,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=274285.0, ans=0.125 2024-06-20 21:48:09,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=274285.0, ans=0.125 2024-06-20 21:48:15,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=274303.3333333333, ans=0.125 2024-06-20 21:48:19,240 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 1.992e+02 2.217e+02 2.442e+02 3.835e+02, threshold=4.433e+02, percent-clipped=0.0 2024-06-20 21:48:24,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=274321.6666666667, ans=0.125 2024-06-20 21:48:27,217 INFO [train.py:1028] (1/2) Epoch 15, batch 8000, loss[loss=0.2039, simple_loss=0.2638, pruned_loss=0.07196, over 12994.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.2878, pruned_loss=0.09182, over 2572575.51 frames. ], batch size: 30, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:48:28,775 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.12 vs. limit=12.0 2024-06-20 21:48:29,473 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.45 vs. limit=6.0 2024-06-20 21:48:32,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=274340.0, ans=0.125 2024-06-20 21:49:01,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=274413.3333333333, ans=0.0 2024-06-20 21:49:07,636 INFO [train.py:1028] (1/2) Epoch 15, batch 8050, loss[loss=0.2344, simple_loss=0.282, pruned_loss=0.09338, over 13218.00 frames. ], tot_loss[loss=0.235, simple_loss=0.2872, pruned_loss=0.09142, over 2572462.18 frames. ], batch size: 83, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:49:24,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=274468.3333333333, ans=10.0 2024-06-20 21:49:25,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=274468.3333333333, ans=0.125 2024-06-20 21:49:37,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=274486.6666666667, ans=0.0 2024-06-20 21:49:37,747 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.84 vs. limit=15.0 2024-06-20 21:49:38,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=274486.6666666667, ans=0.125 2024-06-20 21:49:38,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=274486.6666666667, ans=0.1 2024-06-20 21:49:42,911 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.137e+02 2.396e+02 2.645e+02 4.063e+02, threshold=4.791e+02, percent-clipped=0.0 2024-06-20 21:49:45,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=274505.0, ans=0.05 2024-06-20 21:49:46,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=274505.0, ans=0.0 2024-06-20 21:49:50,756 INFO [train.py:1028] (1/2) Epoch 15, batch 8100, loss[loss=0.2429, simple_loss=0.2918, pruned_loss=0.09703, over 13168.00 frames. ], tot_loss[loss=0.236, simple_loss=0.2881, pruned_loss=0.09191, over 2576248.82 frames. ], batch size: 112, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:49:59,987 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.180e+01 2024-06-20 21:50:06,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=274560.0, ans=0.125 2024-06-20 21:50:10,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=274560.0, ans=0.1 2024-06-20 21:50:31,394 INFO [train.py:1028] (1/2) Epoch 15, batch 8150, loss[loss=0.2145, simple_loss=0.2607, pruned_loss=0.08412, over 13064.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.2875, pruned_loss=0.09095, over 2580391.77 frames. ], batch size: 121, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:50:31,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=274615.0, ans=0.125 2024-06-20 21:50:33,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=274615.0, ans=0.2 2024-06-20 21:50:34,444 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.07 vs. limit=15.0 2024-06-20 21:50:54,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=274651.6666666667, ans=0.0 2024-06-20 21:51:01,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=274670.0, ans=0.1 2024-06-20 21:51:04,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=274670.0, ans=0.1 2024-06-20 21:51:06,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=274688.3333333333, ans=0.125 2024-06-20 21:51:07,144 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.749e+02 2.042e+02 2.184e+02 2.355e+02 3.075e+02, threshold=4.368e+02, percent-clipped=0.0 2024-06-20 21:51:09,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=274688.3333333333, ans=0.025 2024-06-20 21:51:15,219 INFO [train.py:1028] (1/2) Epoch 15, batch 8200, loss[loss=0.228, simple_loss=0.2865, pruned_loss=0.08475, over 13150.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.2879, pruned_loss=0.09125, over 2583862.59 frames. ], batch size: 112, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:51:15,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=274706.6666666667, ans=0.125 2024-06-20 21:51:16,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=274706.6666666667, ans=0.0 2024-06-20 21:51:42,175 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2024-06-20 21:51:45,499 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.24 vs. limit=6.0 2024-06-20 21:51:46,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=274761.6666666667, ans=0.125 2024-06-20 21:51:46,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=274761.6666666667, ans=0.09899494936611666 2024-06-20 21:51:59,881 INFO [train.py:1028] (1/2) Epoch 15, batch 8250, loss[loss=0.2176, simple_loss=0.2783, pruned_loss=0.07841, over 13311.00 frames. ], tot_loss[loss=0.236, simple_loss=0.2884, pruned_loss=0.09175, over 2583469.69 frames. ], batch size: 52, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:52:08,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=274816.6666666667, ans=0.1 2024-06-20 21:52:28,890 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.29 vs. limit=15.0 2024-06-20 21:52:29,825 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.012e+02 2.138e+02 2.291e+02 3.037e+02, threshold=4.275e+02, percent-clipped=0.0 2024-06-20 21:52:33,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=274871.6666666667, ans=0.2 2024-06-20 21:52:36,497 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.83 vs. limit=12.0 2024-06-20 21:52:37,817 INFO [train.py:1028] (1/2) Epoch 15, batch 8300, loss[loss=0.2357, simple_loss=0.2798, pruned_loss=0.09581, over 13048.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.2876, pruned_loss=0.09151, over 2580472.82 frames. ], batch size: 102, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:52:42,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=274890.0, ans=0.2 2024-06-20 21:52:45,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=274908.3333333333, ans=10.0 2024-06-20 21:52:47,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=274908.3333333333, ans=0.1 2024-06-20 21:52:48,649 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.20 vs. limit=15.0 2024-06-20 21:52:55,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=274926.6666666667, ans=0.2 2024-06-20 21:53:06,919 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.53 vs. limit=6.0 2024-06-20 21:53:08,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=274945.0, ans=0.1 2024-06-20 21:53:15,365 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=8.44 vs. limit=12.0 2024-06-20 21:53:18,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=274963.3333333333, ans=0.1 2024-06-20 21:53:21,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=274981.6666666667, ans=0.125 2024-06-20 21:53:22,063 INFO [train.py:1028] (1/2) Epoch 15, batch 8350, loss[loss=0.2382, simple_loss=0.2915, pruned_loss=0.09242, over 13142.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.288, pruned_loss=0.09131, over 2581084.01 frames. ], batch size: 112, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:53:23,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=274981.6666666667, ans=0.125 2024-06-20 21:53:32,727 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2024-06-20 21:53:39,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=275018.3333333333, ans=0.125 2024-06-20 21:53:47,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=275036.6666666667, ans=0.125 2024-06-20 21:53:49,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=275036.6666666667, ans=0.125 2024-06-20 21:53:54,232 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.031e+02 2.182e+02 2.380e+02 2.962e+02, threshold=4.364e+02, percent-clipped=0.0 2024-06-20 21:53:57,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275055.0, ans=0.1 2024-06-20 21:54:01,991 INFO [train.py:1028] (1/2) Epoch 15, batch 8400, loss[loss=0.1872, simple_loss=0.2419, pruned_loss=0.0662, over 12892.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.2876, pruned_loss=0.09109, over 2579300.40 frames. ], batch size: 39, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:54:31,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=275128.3333333333, ans=0.0 2024-06-20 21:54:45,631 INFO [train.py:1028] (1/2) Epoch 15, batch 8450, loss[loss=0.2426, simple_loss=0.2984, pruned_loss=0.09337, over 13189.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.2884, pruned_loss=0.09145, over 2580917.32 frames. ], batch size: 112, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:54:54,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=275183.3333333333, ans=0.0 2024-06-20 21:55:09,389 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.65 vs. limit=15.0 2024-06-20 21:55:17,668 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.059e+02 2.169e+02 2.348e+02 3.167e+02, threshold=4.338e+02, percent-clipped=0.0 2024-06-20 21:55:29,533 INFO [train.py:1028] (1/2) Epoch 15, batch 8500, loss[loss=0.2208, simple_loss=0.2688, pruned_loss=0.08643, over 12730.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.2895, pruned_loss=0.0921, over 2577972.05 frames. ], batch size: 29, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:55:35,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=275256.6666666667, ans=0.125 2024-06-20 21:55:45,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=275293.3333333333, ans=15.0 2024-06-20 21:55:46,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=275293.3333333333, ans=0.125 2024-06-20 21:55:55,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=275311.6666666667, ans=0.125 2024-06-20 21:55:55,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=275311.6666666667, ans=0.0 2024-06-20 21:55:59,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=275311.6666666667, ans=0.2 2024-06-20 21:56:09,964 INFO [train.py:1028] (1/2) Epoch 15, batch 8550, loss[loss=0.2444, simple_loss=0.2992, pruned_loss=0.09479, over 12527.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.2893, pruned_loss=0.09195, over 2575179.99 frames. ], batch size: 22, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:56:24,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=275366.6666666667, ans=0.2 2024-06-20 21:56:42,291 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 2.005e+02 2.149e+02 2.347e+02 2.864e+02, threshold=4.298e+02, percent-clipped=0.0 2024-06-20 21:56:53,997 INFO [train.py:1028] (1/2) Epoch 15, batch 8600, loss[loss=0.2384, simple_loss=0.2853, pruned_loss=0.09578, over 13098.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.2898, pruned_loss=0.09198, over 2573285.31 frames. ], batch size: 121, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:56:54,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=275440.0, ans=0.125 2024-06-20 21:56:59,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=275440.0, ans=0.125 2024-06-20 21:57:03,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=275458.3333333333, ans=0.0 2024-06-20 21:57:12,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=275476.6666666667, ans=0.95 2024-06-20 21:57:30,462 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.43 vs. limit=15.0 2024-06-20 21:57:31,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=275513.3333333333, ans=0.2 2024-06-20 21:57:31,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=275513.3333333333, ans=0.1 2024-06-20 21:57:34,780 INFO [train.py:1028] (1/2) Epoch 15, batch 8650, loss[loss=0.2094, simple_loss=0.2605, pruned_loss=0.07915, over 13003.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.2895, pruned_loss=0.0918, over 2576209.35 frames. ], batch size: 102, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:57:43,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=275550.0, ans=0.125 2024-06-20 21:58:03,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=275586.6666666667, ans=0.2 2024-06-20 21:58:09,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=275586.6666666667, ans=0.025 2024-06-20 21:58:10,335 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.011e+02 2.128e+02 2.348e+02 2.972e+02, threshold=4.256e+02, percent-clipped=0.0 2024-06-20 21:58:13,922 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.98 vs. limit=15.0 2024-06-20 21:58:18,212 INFO [train.py:1028] (1/2) Epoch 15, batch 8700, loss[loss=0.2441, simple_loss=0.3045, pruned_loss=0.09186, over 13180.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.2894, pruned_loss=0.0919, over 2573703.79 frames. ], batch size: 59, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:58:46,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=275678.3333333333, ans=0.125 2024-06-20 21:58:58,403 INFO [train.py:1028] (1/2) Epoch 15, batch 8750, loss[loss=0.2339, simple_loss=0.2838, pruned_loss=0.09196, over 13078.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.2896, pruned_loss=0.09205, over 2569531.34 frames. ], batch size: 121, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:59:02,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=275715.0, ans=10.0 2024-06-20 21:59:34,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=275788.3333333333, ans=0.0 2024-06-20 21:59:35,492 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 1.974e+02 2.145e+02 2.289e+02 3.457e+02, threshold=4.290e+02, percent-clipped=0.0 2024-06-20 21:59:37,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275788.3333333333, ans=0.1 2024-06-20 21:59:38,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=275788.3333333333, ans=0.125 2024-06-20 21:59:42,782 INFO [train.py:1028] (1/2) Epoch 15, batch 8800, loss[loss=0.2576, simple_loss=0.3151, pruned_loss=0.1001, over 13217.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.2903, pruned_loss=0.09272, over 2574564.19 frames. ], batch size: 72, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:59:42,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=275806.6666666667, ans=0.0 2024-06-20 21:59:43,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=275806.6666666667, ans=0.0 2024-06-20 22:00:12,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=275861.6666666667, ans=0.0 2024-06-20 22:00:17,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=275861.6666666667, ans=0.1 2024-06-20 22:00:22,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275880.0, ans=0.1 2024-06-20 22:00:27,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=275898.3333333333, ans=0.125 2024-06-20 22:00:27,936 INFO [train.py:1028] (1/2) Epoch 15, batch 8850, loss[loss=0.2737, simple_loss=0.315, pruned_loss=0.1162, over 12619.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.2901, pruned_loss=0.09287, over 2562851.39 frames. ], batch size: 202, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 22:00:32,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=275898.3333333333, ans=0.0 2024-06-20 22:00:36,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=275916.6666666667, ans=0.1 2024-06-20 22:00:40,171 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.41 vs. limit=15.0 2024-06-20 22:00:42,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=275916.6666666667, ans=0.2 2024-06-20 22:00:44,112 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.43 vs. limit=15.0 2024-06-20 22:01:01,006 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.060e+02 2.235e+02 2.392e+02 3.421e+02, threshold=4.470e+02, percent-clipped=0.0 2024-06-20 22:01:02,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=275971.6666666667, ans=0.125 2024-06-20 22:01:08,510 INFO [train.py:1028] (1/2) Epoch 15, batch 8900, loss[loss=0.2635, simple_loss=0.3134, pruned_loss=0.1068, over 12914.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.2917, pruned_loss=0.09359, over 2561402.82 frames. ], batch size: 33, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:01:14,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=275990.0, ans=0.0 2024-06-20 22:01:37,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=276045.0, ans=0.0 2024-06-20 22:01:39,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=276045.0, ans=0.125 2024-06-20 22:01:44,530 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.00 vs. limit=15.0 2024-06-20 22:01:47,486 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:01:52,493 INFO [train.py:1028] (1/2) Epoch 15, batch 8950, loss[loss=0.2613, simple_loss=0.3099, pruned_loss=0.1064, over 12509.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.292, pruned_loss=0.09359, over 2559921.93 frames. ], batch size: 202, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:01:57,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=276081.6666666667, ans=0.125 2024-06-20 22:02:04,953 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=12.0 2024-06-20 22:02:08,062 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.95 vs. limit=22.5 2024-06-20 22:02:15,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=276118.3333333333, ans=0.2 2024-06-20 22:02:25,916 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 2.033e+02 2.229e+02 2.538e+02 3.662e+02, threshold=4.459e+02, percent-clipped=0.0 2024-06-20 22:02:33,188 INFO [train.py:1028] (1/2) Epoch 15, batch 9000, loss[loss=0.2201, simple_loss=0.2793, pruned_loss=0.08041, over 13300.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.2925, pruned_loss=0.09347, over 2565870.47 frames. ], batch size: 46, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:02:33,189 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 22:02:46,187 INFO [train.py:1060] (1/2) Epoch 15, validation: loss=0.1894, simple_loss=0.2539, pruned_loss=0.06241, over 351949.00 frames. 2024-06-20 22:02:46,188 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-20 22:03:03,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=276210.0, ans=0.2 2024-06-20 22:03:06,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=276210.0, ans=0.125 2024-06-20 22:03:11,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=276228.3333333333, ans=0.125 2024-06-20 22:03:13,025 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.15 vs. limit=22.5 2024-06-20 22:03:25,684 INFO [train.py:1028] (1/2) Epoch 15, batch 9050, loss[loss=0.2302, simple_loss=0.2858, pruned_loss=0.08733, over 12001.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.2928, pruned_loss=0.09372, over 2566393.84 frames. ], batch size: 18, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:03:27,485 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:03:35,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=276283.3333333333, ans=0.125 2024-06-20 22:03:39,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=276283.3333333333, ans=15.0 2024-06-20 22:03:39,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=276283.3333333333, ans=0.125 2024-06-20 22:03:58,018 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.067e+02 2.184e+02 2.432e+02 3.135e+02, threshold=4.369e+02, percent-clipped=0.0 2024-06-20 22:04:01,681 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.77 vs. limit=22.5 2024-06-20 22:04:04,916 INFO [train.py:1028] (1/2) Epoch 15, batch 9100, loss[loss=0.2255, simple_loss=0.2848, pruned_loss=0.08304, over 13275.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.2936, pruned_loss=0.09391, over 2566494.61 frames. ], batch size: 72, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:04:09,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=276356.6666666667, ans=0.0 2024-06-20 22:04:12,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=276375.0, ans=0.2 2024-06-20 22:04:14,107 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.08 vs. limit=10.0 2024-06-20 22:04:43,085 INFO [train.py:1028] (1/2) Epoch 15, batch 9150, loss[loss=0.2367, simple_loss=0.2922, pruned_loss=0.09058, over 13174.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.293, pruned_loss=0.09343, over 2566955.70 frames. ], batch size: 77, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:04:55,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=276466.6666666667, ans=0.09899494936611666 2024-06-20 22:05:07,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=276485.0, ans=0.025 2024-06-20 22:05:09,262 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2024-06-20 22:05:13,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=276503.3333333333, ans=0.2 2024-06-20 22:05:19,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=276521.6666666667, ans=0.0 2024-06-20 22:05:20,081 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.057e+02 2.212e+02 2.470e+02 3.489e+02, threshold=4.423e+02, percent-clipped=0.0 2024-06-20 22:05:25,974 INFO [train.py:1028] (1/2) Epoch 15, batch 9200, loss[loss=0.2282, simple_loss=0.2795, pruned_loss=0.08845, over 12972.00 frames. ], tot_loss[loss=0.239, simple_loss=0.2923, pruned_loss=0.09284, over 2571496.31 frames. ], batch size: 36, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:05:31,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=276540.0, ans=0.0 2024-06-20 22:05:45,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=276576.6666666667, ans=0.0 2024-06-20 22:05:58,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=276613.3333333333, ans=0.025 2024-06-20 22:05:58,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=276613.3333333333, ans=15.0 2024-06-20 22:06:03,607 INFO [train.py:1028] (1/2) Epoch 15, batch 9250, loss[loss=0.2184, simple_loss=0.2764, pruned_loss=0.08025, over 13237.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.2919, pruned_loss=0.09232, over 2573464.26 frames. ], batch size: 67, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:06:04,189 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.28 vs. limit=22.5 2024-06-20 22:06:05,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=276631.6666666667, ans=0.0 2024-06-20 22:06:05,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=276631.6666666667, ans=0.1 2024-06-20 22:06:06,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=276631.6666666667, ans=0.0 2024-06-20 22:06:20,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=276668.3333333333, ans=0.125 2024-06-20 22:06:26,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=276686.6666666667, ans=0.2 2024-06-20 22:06:27,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=276686.6666666667, ans=0.0 2024-06-20 22:06:33,679 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 1.992e+02 2.150e+02 2.291e+02 3.421e+02, threshold=4.299e+02, percent-clipped=0.0 2024-06-20 22:06:34,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=276705.0, ans=0.0 2024-06-20 22:06:39,330 INFO [train.py:1028] (1/2) Epoch 15, batch 9300, loss[loss=0.2402, simple_loss=0.2887, pruned_loss=0.09588, over 13238.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.2917, pruned_loss=0.09243, over 2571265.88 frames. ], batch size: 40, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:06:40,397 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:06:45,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=276723.3333333333, ans=0.0 2024-06-20 22:06:45,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=276723.3333333333, ans=0.125 2024-06-20 22:06:48,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=276741.6666666667, ans=0.1 2024-06-20 22:06:54,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=276760.0, ans=0.07 2024-06-20 22:07:02,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=276778.3333333333, ans=0.1 2024-06-20 22:07:16,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=276796.6666666667, ans=0.125 2024-06-20 22:07:19,764 INFO [train.py:1028] (1/2) Epoch 15, batch 9350, loss[loss=0.2354, simple_loss=0.2907, pruned_loss=0.09002, over 12680.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.2919, pruned_loss=0.09275, over 2568460.83 frames. ], batch size: 22, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:07:26,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=276833.3333333333, ans=0.125 2024-06-20 22:07:28,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=276833.3333333333, ans=0.125 2024-06-20 22:07:33,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=276851.6666666667, ans=0.0 2024-06-20 22:07:40,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=276851.6666666667, ans=0.1 2024-06-20 22:07:50,218 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.790e+02 2.059e+02 2.200e+02 2.428e+02 5.384e+02, threshold=4.400e+02, percent-clipped=1.0 2024-06-20 22:07:56,881 INFO [train.py:1028] (1/2) Epoch 15, batch 9400, loss[loss=0.2425, simple_loss=0.2928, pruned_loss=0.09614, over 13207.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.2921, pruned_loss=0.09258, over 2567779.58 frames. ], batch size: 52, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:08:21,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=276961.6666666667, ans=0.125 2024-06-20 22:08:22,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=276961.6666666667, ans=0.125 2024-06-20 22:08:30,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=276980.0, ans=0.125 2024-06-20 22:08:33,793 INFO [train.py:1028] (1/2) Epoch 15, batch 9450, loss[loss=0.2411, simple_loss=0.2872, pruned_loss=0.09745, over 12468.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.2931, pruned_loss=0.09328, over 2567839.92 frames. ], batch size: 22, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:08:37,085 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:08:37,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.48 vs. limit=6.0 2024-06-20 22:09:06,397 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.817e+02 2.041e+02 2.152e+02 2.361e+02 3.165e+02, threshold=4.304e+02, percent-clipped=0.0 2024-06-20 22:09:13,013 INFO [train.py:1028] (1/2) Epoch 15, batch 9500, loss[loss=0.2207, simple_loss=0.277, pruned_loss=0.08217, over 13264.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.2926, pruned_loss=0.09286, over 2577100.64 frames. ], batch size: 43, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:09:14,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=277090.0, ans=0.0 2024-06-20 22:09:14,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=277090.0, ans=0.0 2024-06-20 22:09:17,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=277090.0, ans=0.125 2024-06-20 22:09:27,229 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:09:28,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=277126.6666666667, ans=0.07 2024-06-20 22:09:30,731 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.45 vs. limit=10.0 2024-06-20 22:09:40,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=277145.0, ans=0.125 2024-06-20 22:09:50,640 INFO [train.py:1028] (1/2) Epoch 15, batch 9550, loss[loss=0.2359, simple_loss=0.2896, pruned_loss=0.09113, over 12921.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.2926, pruned_loss=0.09311, over 2572203.20 frames. ], batch size: 39, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:10:03,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=277200.0, ans=0.05 2024-06-20 22:10:11,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=277236.6666666667, ans=0.025 2024-06-20 22:10:13,545 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.43 vs. limit=15.0 2024-06-20 22:10:13,647 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.59 vs. limit=22.5 2024-06-20 22:10:14,033 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=277236.6666666667, ans=0.0 2024-06-20 22:10:14,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=277236.6666666667, ans=0.125 2024-06-20 22:10:20,223 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.040e+02 2.170e+02 2.424e+02 3.926e+02, threshold=4.339e+02, percent-clipped=0.0 2024-06-20 22:10:23,455 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.48 vs. limit=12.0 2024-06-20 22:10:26,624 INFO [train.py:1028] (1/2) Epoch 15, batch 9600, loss[loss=0.2459, simple_loss=0.2879, pruned_loss=0.1019, over 10677.00 frames. ], tot_loss[loss=0.239, simple_loss=0.2922, pruned_loss=0.09287, over 2570249.59 frames. ], batch size: 303, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:10:29,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=277273.3333333333, ans=0.1 2024-06-20 22:10:36,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=277291.6666666667, ans=0.125 2024-06-20 22:10:59,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277346.6666666667, ans=0.1 2024-06-20 22:11:03,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277346.6666666667, ans=0.1 2024-06-20 22:11:06,114 INFO [train.py:1028] (1/2) Epoch 15, batch 9650, loss[loss=0.2217, simple_loss=0.2721, pruned_loss=0.08568, over 13115.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.292, pruned_loss=0.09323, over 2561638.31 frames. ], batch size: 132, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:11:12,671 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.56 vs. limit=15.0 2024-06-20 22:11:15,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=277383.3333333333, ans=0.2 2024-06-20 22:11:27,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=277420.0, ans=0.04949747468305833 2024-06-20 22:11:36,277 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 2.040e+02 2.243e+02 2.511e+02 3.188e+02, threshold=4.485e+02, percent-clipped=0.0 2024-06-20 22:11:44,997 INFO [train.py:1028] (1/2) Epoch 15, batch 9700, loss[loss=0.2417, simple_loss=0.2871, pruned_loss=0.09818, over 13011.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.2915, pruned_loss=0.09319, over 2556595.10 frames. ], batch size: 144, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:12:11,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=277511.6666666667, ans=0.2 2024-06-20 22:12:11,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277511.6666666667, ans=0.1 2024-06-20 22:12:21,575 INFO [train.py:1028] (1/2) Epoch 15, batch 9750, loss[loss=0.2134, simple_loss=0.27, pruned_loss=0.07842, over 13119.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.2904, pruned_loss=0.09243, over 2553999.00 frames. ], batch size: 132, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:12:41,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=277585.0, ans=0.1 2024-06-20 22:12:46,097 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2024-06-20 22:12:48,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=277603.3333333333, ans=0.125 2024-06-20 22:12:51,604 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.770e+02 2.042e+02 2.175e+02 2.429e+02 3.761e+02, threshold=4.351e+02, percent-clipped=0.0 2024-06-20 22:12:59,796 INFO [train.py:1028] (1/2) Epoch 15, batch 9800, loss[loss=0.202, simple_loss=0.2537, pruned_loss=0.07514, over 12911.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.2897, pruned_loss=0.09193, over 2547437.40 frames. ], batch size: 39, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:13:06,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=277658.3333333333, ans=0.05 2024-06-20 22:13:08,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=277658.3333333333, ans=0.035 2024-06-20 22:13:09,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=277658.3333333333, ans=0.025 2024-06-20 22:13:12,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=277658.3333333333, ans=0.125 2024-06-20 22:13:22,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=277695.0, ans=0.025 2024-06-20 22:13:31,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277713.3333333333, ans=0.1 2024-06-20 22:13:36,439 INFO [train.py:1028] (1/2) Epoch 15, batch 9850, loss[loss=0.2322, simple_loss=0.2805, pruned_loss=0.09199, over 13079.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.2892, pruned_loss=0.09164, over 2539722.20 frames. ], batch size: 102, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:13:37,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=277731.6666666667, ans=0.0 2024-06-20 22:13:50,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=277750.0, ans=0.1 2024-06-20 22:13:51,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=277768.3333333333, ans=0.0 2024-06-20 22:13:52,455 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.41 vs. limit=10.0 2024-06-20 22:13:56,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=277768.3333333333, ans=0.125 2024-06-20 22:13:56,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277768.3333333333, ans=0.1 2024-06-20 22:13:56,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=277768.3333333333, ans=0.0 2024-06-20 22:13:57,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=277768.3333333333, ans=0.0 2024-06-20 22:14:02,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=277786.6666666667, ans=0.2 2024-06-20 22:14:07,232 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 2.038e+02 2.147e+02 2.275e+02 2.687e+02, threshold=4.295e+02, percent-clipped=0.0 2024-06-20 22:14:13,726 INFO [train.py:1028] (1/2) Epoch 15, batch 9900, loss[loss=0.2383, simple_loss=0.291, pruned_loss=0.0928, over 12977.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.2884, pruned_loss=0.09171, over 2531731.82 frames. ], batch size: 39, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:14:14,171 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.79 vs. limit=15.0 2024-06-20 22:14:22,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=277841.6666666667, ans=0.0 2024-06-20 22:14:28,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=277841.6666666667, ans=0.125 2024-06-20 22:14:29,129 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.05 vs. limit=15.0 2024-06-20 22:14:44,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=277896.6666666667, ans=0.1 2024-06-20 22:14:47,930 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.97 vs. limit=15.0 2024-06-20 22:14:51,292 INFO [train.py:1028] (1/2) Epoch 15, batch 9950, loss[loss=0.2568, simple_loss=0.307, pruned_loss=0.1033, over 12574.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.2876, pruned_loss=0.09203, over 2523196.71 frames. ], batch size: 29, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:14:53,867 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.18 vs. limit=22.5 2024-06-20 22:14:56,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=277915.0, ans=0.0 2024-06-20 22:15:15,066 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.75 vs. limit=15.0 2024-06-20 22:15:22,678 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.122e+02 2.241e+02 2.471e+02 3.287e+02, threshold=4.483e+02, percent-clipped=0.0 2024-06-20 22:15:26,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=277988.3333333333, ans=0.0 2024-06-20 22:15:29,379 INFO [train.py:1028] (1/2) Epoch 15, batch 10000, loss[loss=0.2433, simple_loss=0.3057, pruned_loss=0.0904, over 12506.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.2887, pruned_loss=0.09274, over 2485686.73 frames. ], batch size: 22, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:15:40,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=278025.0, ans=0.07 2024-06-20 22:15:52,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=278061.6666666667, ans=0.125 2024-06-20 22:15:53,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=278061.6666666667, ans=0.2 2024-06-20 22:15:55,009 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.64 vs. limit=12.0 2024-06-20 22:15:57,916 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.47 vs. limit=6.0 2024-06-20 22:16:03,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=278080.0, ans=0.125 2024-06-20 22:16:06,605 INFO [train.py:1028] (1/2) Epoch 15, batch 10050, loss[loss=0.2434, simple_loss=0.2971, pruned_loss=0.09487, over 12618.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.2888, pruned_loss=0.09342, over 2445367.80 frames. ], batch size: 22, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:16:14,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=278116.6666666667, ans=0.2 2024-06-20 22:16:21,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=278135.0, ans=0.125 2024-06-20 22:16:28,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=278153.3333333333, ans=0.1 2024-06-20 22:16:35,826 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 2.061e+02 2.240e+02 2.510e+02 4.001e+02, threshold=4.479e+02, percent-clipped=0.0 2024-06-20 22:16:42,699 INFO [train.py:1028] (1/2) Epoch 15, batch 10100, loss[loss=0.2032, simple_loss=0.2599, pruned_loss=0.07324, over 10811.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.2881, pruned_loss=0.09262, over 2428856.58 frames. ], batch size: 16, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:16:47,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=278190.0, ans=0.0 2024-06-20 22:16:50,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=278208.3333333333, ans=0.1 2024-06-20 22:16:50,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=278208.3333333333, ans=0.0 2024-06-20 22:19:19,157 INFO [train.py:1028] (1/2) Epoch 16, batch 0, loss[loss=0.1922, simple_loss=0.2488, pruned_loss=0.06781, over 12864.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2488, pruned_loss=0.06781, over 12864.00 frames. ], batch size: 36, lr: 3.71e-03, grad_scale: 64.0 2024-06-20 22:19:19,158 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 22:19:27,087 INFO [train.py:1060] (1/2) Epoch 16, validation: loss=0.1901, simple_loss=0.255, pruned_loss=0.06255, over 351949.00 frames. 2024-06-20 22:19:27,087 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-20 22:19:27,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=278221.1666666667, ans=0.125 2024-06-20 22:19:35,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=278221.1666666667, ans=0.125 2024-06-20 22:19:37,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=278221.1666666667, ans=0.125 2024-06-20 22:20:02,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=278276.1666666667, ans=0.125 2024-06-20 22:20:11,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=278294.5, ans=0.2 2024-06-20 22:20:14,523 INFO [train.py:1028] (1/2) Epoch 16, batch 50, loss[loss=0.2312, simple_loss=0.2855, pruned_loss=0.08846, over 12660.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2696, pruned_loss=0.08515, over 574200.44 frames. ], batch size: 29, lr: 3.71e-03, grad_scale: 64.0 2024-06-20 22:20:14,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=278312.8333333333, ans=0.125 2024-06-20 22:20:16,634 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.38 vs. limit=15.0 2024-06-20 22:20:17,246 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.79 vs. limit=10.0 2024-06-20 22:20:30,744 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.24 vs. limit=15.0 2024-06-20 22:20:32,611 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.944e+02 2.090e+02 2.257e+02 3.301e+02, threshold=4.180e+02, percent-clipped=0.0 2024-06-20 22:20:34,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=278349.5, ans=0.125 2024-06-20 22:20:44,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=278386.1666666667, ans=0.125 2024-06-20 22:20:52,082 INFO [train.py:1028] (1/2) Epoch 16, batch 100, loss[loss=0.2029, simple_loss=0.2608, pruned_loss=0.07253, over 13283.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2672, pruned_loss=0.08367, over 1016292.51 frames. ], batch size: 46, lr: 3.71e-03, grad_scale: 64.0 2024-06-20 22:21:03,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=278422.8333333333, ans=0.125 2024-06-20 22:21:04,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=278422.8333333333, ans=0.125 2024-06-20 22:21:05,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=278422.8333333333, ans=0.04949747468305833 2024-06-20 22:21:12,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=278441.1666666667, ans=0.025 2024-06-20 22:21:12,827 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2024-06-20 22:21:14,554 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2024-06-20 22:21:29,936 INFO [train.py:1028] (1/2) Epoch 16, batch 150, loss[loss=0.2155, simple_loss=0.2695, pruned_loss=0.08071, over 12704.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2672, pruned_loss=0.08294, over 1365171.85 frames. ], batch size: 29, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:21:35,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=278496.1666666667, ans=0.2 2024-06-20 22:21:46,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=278532.8333333333, ans=0.025 2024-06-20 22:21:48,554 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 1.892e+02 2.012e+02 2.178e+02 2.587e+02, threshold=4.024e+02, percent-clipped=0.0 2024-06-20 22:21:53,077 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2024-06-20 22:22:01,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=278551.1666666667, ans=0.125 2024-06-20 22:22:03,996 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.11 vs. limit=22.5 2024-06-20 22:22:10,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=278569.5, ans=0.0 2024-06-20 22:22:15,752 INFO [train.py:1028] (1/2) Epoch 16, batch 200, loss[loss=0.2268, simple_loss=0.27, pruned_loss=0.09181, over 12597.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2674, pruned_loss=0.08315, over 1636090.69 frames. ], batch size: 203, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:22:17,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=278587.8333333333, ans=0.125 2024-06-20 22:22:19,429 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.81 vs. limit=15.0 2024-06-20 22:22:20,171 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.61 vs. limit=15.0 2024-06-20 22:22:21,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=278587.8333333333, ans=0.1 2024-06-20 22:22:34,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=278624.5, ans=0.125 2024-06-20 22:22:37,565 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.77 vs. limit=15.0 2024-06-20 22:22:42,152 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.71 vs. limit=15.0 2024-06-20 22:22:55,760 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.27 vs. limit=12.0 2024-06-20 22:22:55,770 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.68 vs. limit=15.0 2024-06-20 22:22:59,939 INFO [train.py:1028] (1/2) Epoch 16, batch 250, loss[loss=0.2068, simple_loss=0.2544, pruned_loss=0.07963, over 13037.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2673, pruned_loss=0.08298, over 1847069.05 frames. ], batch size: 144, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:23:04,501 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.12 vs. limit=15.0 2024-06-20 22:23:06,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=278679.5, ans=0.125 2024-06-20 22:23:10,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=278697.8333333333, ans=0.0 2024-06-20 22:23:11,921 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.45 vs. limit=15.0 2024-06-20 22:23:15,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=278716.1666666667, ans=0.025 2024-06-20 22:23:18,970 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 1.911e+02 2.040e+02 2.212e+02 2.942e+02, threshold=4.079e+02, percent-clipped=0.0 2024-06-20 22:23:23,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=278734.5, ans=0.125 2024-06-20 22:23:29,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=278734.5, ans=0.0 2024-06-20 22:23:32,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=278752.8333333333, ans=0.1 2024-06-20 22:23:34,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=278752.8333333333, ans=0.0 2024-06-20 22:23:38,953 INFO [train.py:1028] (1/2) Epoch 16, batch 300, loss[loss=0.2094, simple_loss=0.2561, pruned_loss=0.08134, over 13208.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2673, pruned_loss=0.08295, over 2009527.87 frames. ], batch size: 112, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:23:45,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=278771.1666666667, ans=0.1 2024-06-20 22:24:10,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=278844.5, ans=0.09899494936611666 2024-06-20 22:24:16,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=278862.8333333333, ans=0.125 2024-06-20 22:24:17,096 INFO [train.py:1028] (1/2) Epoch 16, batch 350, loss[loss=0.2125, simple_loss=0.2621, pruned_loss=0.08148, over 12992.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2666, pruned_loss=0.08248, over 2139311.08 frames. ], batch size: 33, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:24:22,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=278862.8333333333, ans=0.125 2024-06-20 22:24:39,273 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.89 vs. limit=10.0 2024-06-20 22:24:42,192 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 1.915e+02 2.036e+02 2.237e+02 2.649e+02, threshold=4.072e+02, percent-clipped=0.0 2024-06-20 22:24:49,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=278917.8333333333, ans=0.025 2024-06-20 22:24:58,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=278936.1666666667, ans=0.125 2024-06-20 22:25:00,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=278936.1666666667, ans=0.125 2024-06-20 22:25:02,371 INFO [train.py:1028] (1/2) Epoch 16, batch 400, loss[loss=0.2215, simple_loss=0.2772, pruned_loss=0.08291, over 13258.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2668, pruned_loss=0.08232, over 2239370.12 frames. ], batch size: 63, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:25:15,975 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.15 vs. limit=15.0 2024-06-20 22:25:22,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=278991.1666666667, ans=0.025 2024-06-20 22:25:32,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=279027.8333333333, ans=0.04949747468305833 2024-06-20 22:25:34,370 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.95 vs. limit=15.0 2024-06-20 22:25:38,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=279027.8333333333, ans=0.1 2024-06-20 22:25:40,974 INFO [train.py:1028] (1/2) Epoch 16, batch 450, loss[loss=0.2065, simple_loss=0.2639, pruned_loss=0.07458, over 13265.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.267, pruned_loss=0.08237, over 2313523.99 frames. ], batch size: 67, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:25:59,652 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.899e+02 2.010e+02 2.162e+02 3.318e+02, threshold=4.019e+02, percent-clipped=0.0 2024-06-20 22:26:20,059 INFO [train.py:1028] (1/2) Epoch 16, batch 500, loss[loss=0.2127, simple_loss=0.2593, pruned_loss=0.08302, over 13111.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2678, pruned_loss=0.08237, over 2375216.64 frames. ], batch size: 121, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:26:32,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=279156.1666666667, ans=0.125 2024-06-20 22:26:49,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=279192.8333333333, ans=0.125 2024-06-20 22:27:00,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=279211.1666666667, ans=0.0 2024-06-20 22:27:02,117 INFO [train.py:1028] (1/2) Epoch 16, batch 550, loss[loss=0.2001, simple_loss=0.2441, pruned_loss=0.07803, over 12965.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2676, pruned_loss=0.08232, over 2420016.06 frames. ], batch size: 158, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:27:05,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=279229.5, ans=0.125 2024-06-20 22:27:16,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=279247.8333333333, ans=0.0 2024-06-20 22:27:21,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=279266.1666666667, ans=0.125 2024-06-20 22:27:23,655 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 1.945e+02 2.048e+02 2.295e+02 3.139e+02, threshold=4.096e+02, percent-clipped=0.0 2024-06-20 22:27:32,966 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.22 vs. limit=15.0 2024-06-20 22:27:37,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=279302.8333333333, ans=0.125 2024-06-20 22:27:40,488 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.58 vs. limit=15.0 2024-06-20 22:27:43,000 INFO [train.py:1028] (1/2) Epoch 16, batch 600, loss[loss=0.196, simple_loss=0.2417, pruned_loss=0.07518, over 13074.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2674, pruned_loss=0.08239, over 2457984.37 frames. ], batch size: 144, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:27:54,104 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.20 vs. limit=15.0 2024-06-20 22:28:05,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=279376.1666666667, ans=0.0 2024-06-20 22:28:08,697 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.34 vs. limit=15.0 2024-06-20 22:28:14,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=279394.5, ans=0.125 2024-06-20 22:28:15,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=279394.5, ans=0.125 2024-06-20 22:28:15,408 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.07 vs. limit=15.0 2024-06-20 22:28:15,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=279394.5, ans=0.1 2024-06-20 22:28:22,273 INFO [train.py:1028] (1/2) Epoch 16, batch 650, loss[loss=0.1883, simple_loss=0.2408, pruned_loss=0.06787, over 13189.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2676, pruned_loss=0.08239, over 2489579.33 frames. ], batch size: 59, lr: 3.70e-03, grad_scale: 128.0 2024-06-20 22:28:25,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=279412.8333333333, ans=0.0 2024-06-20 22:28:30,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=279431.1666666667, ans=0.0 2024-06-20 22:28:32,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=279431.1666666667, ans=0.125 2024-06-20 22:28:41,347 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 1.922e+02 2.044e+02 2.209e+02 3.211e+02, threshold=4.087e+02, percent-clipped=0.0 2024-06-20 22:28:45,484 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=279467.8333333333, ans=0.1 2024-06-20 22:28:52,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=279467.8333333333, ans=0.125 2024-06-20 22:28:58,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=279486.1666666667, ans=0.1 2024-06-20 22:29:00,908 INFO [train.py:1028] (1/2) Epoch 16, batch 700, loss[loss=0.2153, simple_loss=0.2638, pruned_loss=0.08338, over 13296.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2671, pruned_loss=0.08235, over 2512235.01 frames. ], batch size: 46, lr: 3.70e-03, grad_scale: 128.0 2024-06-20 22:29:03,771 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.50 vs. limit=12.0 2024-06-20 22:29:25,370 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.63 vs. limit=6.0 2024-06-20 22:29:31,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=279559.5, ans=0.125 2024-06-20 22:29:31,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=279559.5, ans=0.125 2024-06-20 22:29:42,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=279577.8333333333, ans=0.0 2024-06-20 22:29:46,216 INFO [train.py:1028] (1/2) Epoch 16, batch 750, loss[loss=0.2035, simple_loss=0.261, pruned_loss=0.07298, over 13297.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.268, pruned_loss=0.08247, over 2527121.10 frames. ], batch size: 63, lr: 3.70e-03, grad_scale: 128.0 2024-06-20 22:29:47,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=279596.1666666667, ans=0.1 2024-06-20 22:30:04,660 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 1.923e+02 2.020e+02 2.162e+02 2.795e+02, threshold=4.041e+02, percent-clipped=0.0 2024-06-20 22:30:08,513 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.51 vs. limit=15.0 2024-06-20 22:30:13,458 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.70 vs. limit=22.5 2024-06-20 22:30:22,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=279669.5, ans=0.125 2024-06-20 22:30:25,632 INFO [train.py:1028] (1/2) Epoch 16, batch 800, loss[loss=0.2073, simple_loss=0.2565, pruned_loss=0.07902, over 12872.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2681, pruned_loss=0.08259, over 2539455.06 frames. ], batch size: 36, lr: 3.70e-03, grad_scale: 128.0 2024-06-20 22:30:36,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=279706.1666666667, ans=0.125 2024-06-20 22:30:43,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=279724.5, ans=0.1 2024-06-20 22:30:58,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=279761.1666666667, ans=0.0 2024-06-20 22:31:05,174 INFO [train.py:1028] (1/2) Epoch 16, batch 850, loss[loss=0.2235, simple_loss=0.2654, pruned_loss=0.09078, over 13126.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2679, pruned_loss=0.08241, over 2550321.99 frames. ], batch size: 95, lr: 3.70e-03, grad_scale: 128.0 2024-06-20 22:31:13,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=279797.8333333333, ans=0.0 2024-06-20 22:31:14,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=279797.8333333333, ans=0.125 2024-06-20 22:31:23,635 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 1.923e+02 2.128e+02 2.317e+02 2.954e+02, threshold=4.255e+02, percent-clipped=0.0 2024-06-20 22:31:25,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.11 vs. limit=10.0 2024-06-20 22:31:28,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=279834.5, ans=0.0 2024-06-20 22:31:30,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=279834.5, ans=0.125 2024-06-20 22:31:35,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=279852.8333333333, ans=0.125 2024-06-20 22:31:39,663 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.70 vs. limit=6.0 2024-06-20 22:31:43,607 INFO [train.py:1028] (1/2) Epoch 16, batch 900, loss[loss=0.2325, simple_loss=0.2784, pruned_loss=0.09324, over 12928.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2678, pruned_loss=0.08269, over 2555954.47 frames. ], batch size: 36, lr: 3.70e-03, grad_scale: 128.0 2024-06-20 22:32:18,246 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.76 vs. limit=22.5 2024-06-20 22:32:22,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=279944.5, ans=0.125 2024-06-20 22:32:23,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=279944.5, ans=0.025 2024-06-20 22:32:28,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=279944.5, ans=0.125 2024-06-20 22:32:30,193 INFO [train.py:1028] (1/2) Epoch 16, batch 950, loss[loss=0.2419, simple_loss=0.293, pruned_loss=0.09544, over 12964.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2678, pruned_loss=0.08243, over 2558725.35 frames. ], batch size: 39, lr: 3.70e-03, grad_scale: 128.0 2024-06-20 22:32:30,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=279962.8333333333, ans=0.04949747468305833 2024-06-20 22:32:39,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=279981.1666666667, ans=0.125 2024-06-20 22:32:47,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=279999.5, ans=0.1 2024-06-20 22:32:48,422 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 1.971e+02 2.089e+02 2.285e+02 3.304e+02, threshold=4.178e+02, percent-clipped=0.0 2024-06-20 22:32:51,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=279999.5, ans=0.125 2024-06-20 22:32:56,548 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2024-06-20 22:33:06,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=280036.1666666667, ans=0.125 2024-06-20 22:33:08,342 INFO [train.py:1028] (1/2) Epoch 16, batch 1000, loss[loss=0.2336, simple_loss=0.2893, pruned_loss=0.089, over 13290.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2679, pruned_loss=0.08266, over 2561282.32 frames. ], batch size: 49, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:33:18,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=280072.8333333333, ans=0.125 2024-06-20 22:33:18,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=280072.8333333333, ans=15.0 2024-06-20 22:33:22,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=280072.8333333333, ans=0.125 2024-06-20 22:33:24,490 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2024-06-20 22:33:33,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=280109.5, ans=0.0 2024-06-20 22:33:41,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=280127.8333333333, ans=0.2 2024-06-20 22:33:47,139 INFO [train.py:1028] (1/2) Epoch 16, batch 1050, loss[loss=0.2076, simple_loss=0.2592, pruned_loss=0.07802, over 13207.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2681, pruned_loss=0.0827, over 2564917.36 frames. ], batch size: 77, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:33:48,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=280146.1666666667, ans=0.125 2024-06-20 22:33:48,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=280146.1666666667, ans=0.125 2024-06-20 22:33:50,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=280146.1666666667, ans=0.125 2024-06-20 22:33:52,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=280146.1666666667, ans=0.1 2024-06-20 22:33:55,966 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.471e+01 2024-06-20 22:33:57,788 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.61 vs. limit=15.0 2024-06-20 22:33:58,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=280164.5, ans=0.125 2024-06-20 22:34:05,392 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 1.925e+02 2.055e+02 2.218e+02 2.787e+02, threshold=4.110e+02, percent-clipped=0.0 2024-06-20 22:34:08,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=280182.8333333333, ans=0.125 2024-06-20 22:34:18,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=280201.1666666667, ans=0.125 2024-06-20 22:34:30,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=280219.5, ans=0.125 2024-06-20 22:34:31,905 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.96 vs. limit=15.0 2024-06-20 22:34:32,138 INFO [train.py:1028] (1/2) Epoch 16, batch 1100, loss[loss=0.1972, simple_loss=0.2498, pruned_loss=0.07229, over 13211.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2675, pruned_loss=0.08231, over 2569371.25 frames. ], batch size: 52, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:34:32,500 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.29 vs. limit=15.0 2024-06-20 22:34:41,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=280256.1666666667, ans=0.1 2024-06-20 22:35:03,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=280311.1666666667, ans=0.1 2024-06-20 22:35:11,223 INFO [train.py:1028] (1/2) Epoch 16, batch 1150, loss[loss=0.2153, simple_loss=0.2745, pruned_loss=0.07799, over 13256.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2682, pruned_loss=0.08241, over 2571026.97 frames. ], batch size: 52, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:35:16,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=280329.5, ans=0.125 2024-06-20 22:35:22,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=280347.8333333333, ans=0.1 2024-06-20 22:35:25,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=280366.1666666667, ans=0.125 2024-06-20 22:35:29,257 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 1.924e+02 2.047e+02 2.211e+02 3.089e+02, threshold=4.094e+02, percent-clipped=0.0 2024-06-20 22:35:30,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=280366.1666666667, ans=0.125 2024-06-20 22:35:31,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=280366.1666666667, ans=0.1 2024-06-20 22:35:36,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=280384.5, ans=0.1 2024-06-20 22:35:37,213 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.83 vs. limit=12.0 2024-06-20 22:35:42,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=280402.8333333333, ans=0.125 2024-06-20 22:35:44,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=280402.8333333333, ans=0.0 2024-06-20 22:35:45,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=280402.8333333333, ans=0.125 2024-06-20 22:35:47,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=280402.8333333333, ans=0.2 2024-06-20 22:35:49,130 INFO [train.py:1028] (1/2) Epoch 16, batch 1200, loss[loss=0.2082, simple_loss=0.2584, pruned_loss=0.07904, over 13169.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2682, pruned_loss=0.08282, over 2573401.63 frames. ], batch size: 77, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:36:00,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=280439.5, ans=0.0 2024-06-20 22:36:05,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=280457.8333333333, ans=0.125 2024-06-20 22:36:18,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=280494.5, ans=0.125 2024-06-20 22:36:21,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=280494.5, ans=0.0 2024-06-20 22:36:27,339 INFO [train.py:1028] (1/2) Epoch 16, batch 1250, loss[loss=0.2212, simple_loss=0.27, pruned_loss=0.08613, over 13197.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2681, pruned_loss=0.08294, over 2583302.83 frames. ], batch size: 112, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:36:48,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=280531.1666666667, ans=0.2 2024-06-20 22:36:53,053 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.721e+02 1.978e+02 2.161e+02 2.436e+02 3.055e+02, threshold=4.323e+02, percent-clipped=0.0 2024-06-20 22:36:54,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=280549.5, ans=0.05 2024-06-20 22:37:03,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=280567.8333333333, ans=0.125 2024-06-20 22:37:10,535 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.28 vs. limit=22.5 2024-06-20 22:37:11,979 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.81 vs. limit=15.0 2024-06-20 22:37:12,256 INFO [train.py:1028] (1/2) Epoch 16, batch 1300, loss[loss=0.2289, simple_loss=0.2692, pruned_loss=0.09429, over 12748.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2676, pruned_loss=0.08265, over 2582906.10 frames. ], batch size: 176, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:37:12,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=280604.5, ans=0.2 2024-06-20 22:37:34,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=280641.1666666667, ans=0.0 2024-06-20 22:37:37,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=280659.5, ans=0.125 2024-06-20 22:37:51,606 INFO [train.py:1028] (1/2) Epoch 16, batch 1350, loss[loss=0.2134, simple_loss=0.2733, pruned_loss=0.07671, over 13246.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2678, pruned_loss=0.08236, over 2583904.19 frames. ], batch size: 59, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:38:07,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=280732.8333333333, ans=0.2 2024-06-20 22:38:10,155 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.836e+02 2.013e+02 2.171e+02 2.458e+02 3.108e+02, threshold=4.342e+02, percent-clipped=0.0 2024-06-20 22:38:11,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=280732.8333333333, ans=0.07 2024-06-20 22:38:19,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=280751.1666666667, ans=0.0 2024-06-20 22:38:21,112 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.27 vs. limit=10.0 2024-06-20 22:38:27,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=280769.5, ans=0.125 2024-06-20 22:38:30,765 INFO [train.py:1028] (1/2) Epoch 16, batch 1400, loss[loss=0.2446, simple_loss=0.2932, pruned_loss=0.09797, over 12598.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2682, pruned_loss=0.08293, over 2586679.42 frames. ], batch size: 25, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:38:33,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=280787.8333333333, ans=0.125 2024-06-20 22:38:35,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=280787.8333333333, ans=0.125 2024-06-20 22:38:36,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=280787.8333333333, ans=0.05 2024-06-20 22:38:40,272 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.19 vs. limit=15.0 2024-06-20 22:38:43,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=280806.1666666667, ans=0.04949747468305833 2024-06-20 22:39:00,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=280861.1666666667, ans=0.025 2024-06-20 22:39:07,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=280861.1666666667, ans=0.125 2024-06-20 22:39:15,825 INFO [train.py:1028] (1/2) Epoch 16, batch 1450, loss[loss=0.2037, simple_loss=0.2584, pruned_loss=0.07444, over 13112.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2683, pruned_loss=0.0831, over 2586622.96 frames. ], batch size: 121, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:39:23,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=280897.8333333333, ans=0.05 2024-06-20 22:39:27,852 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=22.5 2024-06-20 22:39:34,131 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 1.951e+02 2.095e+02 2.280e+02 3.323e+02, threshold=4.190e+02, percent-clipped=0.0 2024-06-20 22:39:35,814 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:39:36,887 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.87 vs. limit=22.5 2024-06-20 22:39:41,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=280934.5, ans=0.125 2024-06-20 22:39:50,859 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.20 vs. limit=12.0 2024-06-20 22:39:53,905 INFO [train.py:1028] (1/2) Epoch 16, batch 1500, loss[loss=0.2061, simple_loss=0.2601, pruned_loss=0.076, over 13227.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2684, pruned_loss=0.08319, over 2589442.28 frames. ], batch size: 83, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:40:26,720 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.00 vs. limit=22.5 2024-06-20 22:40:28,482 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=15.0 2024-06-20 22:40:28,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281044.5, ans=0.1 2024-06-20 22:40:32,561 INFO [train.py:1028] (1/2) Epoch 16, batch 1550, loss[loss=0.1998, simple_loss=0.2477, pruned_loss=0.07594, over 12985.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2681, pruned_loss=0.08324, over 2584204.45 frames. ], batch size: 102, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:40:36,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=281062.8333333333, ans=0.125 2024-06-20 22:40:51,153 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 2.003e+02 2.141e+02 2.380e+02 3.248e+02, threshold=4.282e+02, percent-clipped=0.0 2024-06-20 22:41:00,852 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2024-06-20 22:41:10,924 INFO [train.py:1028] (1/2) Epoch 16, batch 1600, loss[loss=0.2124, simple_loss=0.2686, pruned_loss=0.07813, over 13163.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2678, pruned_loss=0.08286, over 2578992.38 frames. ], batch size: 77, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:41:16,332 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2024-06-20 22:41:21,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=281154.5, ans=0.125 2024-06-20 22:41:38,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=281191.1666666667, ans=0.125 2024-06-20 22:41:39,000 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.84 vs. limit=22.5 2024-06-20 22:41:56,742 INFO [train.py:1028] (1/2) Epoch 16, batch 1650, loss[loss=0.245, simple_loss=0.2903, pruned_loss=0.09982, over 13165.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.268, pruned_loss=0.08319, over 2574656.75 frames. ], batch size: 95, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:42:06,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281264.5, ans=0.1 2024-06-20 22:42:15,287 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 1.962e+02 2.088e+02 2.289e+02 2.758e+02, threshold=4.175e+02, percent-clipped=0.0 2024-06-20 22:42:15,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=281282.8333333333, ans=0.2 2024-06-20 22:42:27,038 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.76 vs. limit=15.0 2024-06-20 22:42:33,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=281319.5, ans=0.125 2024-06-20 22:42:35,824 INFO [train.py:1028] (1/2) Epoch 16, batch 1700, loss[loss=0.2402, simple_loss=0.3055, pruned_loss=0.08743, over 12372.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2686, pruned_loss=0.08293, over 2580498.00 frames. ], batch size: 25, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:42:39,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281337.8333333333, ans=0.1 2024-06-20 22:42:43,359 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.89 vs. limit=15.0 2024-06-20 22:42:48,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.96 vs. limit=15.0 2024-06-20 22:42:59,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=281392.8333333333, ans=0.0 2024-06-20 22:43:06,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=281411.1666666667, ans=0.1 2024-06-20 22:43:08,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281411.1666666667, ans=0.1 2024-06-20 22:43:09,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=281411.1666666667, ans=0.125 2024-06-20 22:43:10,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=281411.1666666667, ans=0.025 2024-06-20 22:43:14,371 INFO [train.py:1028] (1/2) Epoch 16, batch 1750, loss[loss=0.2047, simple_loss=0.2682, pruned_loss=0.07065, over 12674.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2686, pruned_loss=0.08277, over 2581870.40 frames. ], batch size: 22, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:43:31,421 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.31 vs. limit=15.0 2024-06-20 22:43:33,186 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 1.936e+02 2.059e+02 2.202e+02 2.965e+02, threshold=4.119e+02, percent-clipped=0.0 2024-06-20 22:43:33,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=281466.1666666667, ans=0.2 2024-06-20 22:43:36,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=281466.1666666667, ans=22.5 2024-06-20 22:43:54,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=281502.8333333333, ans=0.0 2024-06-20 22:43:57,621 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.50 vs. limit=6.0 2024-06-20 22:43:59,475 INFO [train.py:1028] (1/2) Epoch 16, batch 1800, loss[loss=0.2047, simple_loss=0.2618, pruned_loss=0.07382, over 13236.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2689, pruned_loss=0.08288, over 2581794.16 frames. ], batch size: 67, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:44:09,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=281539.5, ans=0.0 2024-06-20 22:44:09,334 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.60 vs. limit=15.0 2024-06-20 22:44:27,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=281576.1666666667, ans=0.025 2024-06-20 22:44:28,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=281576.1666666667, ans=0.2 2024-06-20 22:44:31,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=281594.5, ans=0.1 2024-06-20 22:44:38,663 INFO [train.py:1028] (1/2) Epoch 16, batch 1850, loss[loss=0.2016, simple_loss=0.2555, pruned_loss=0.07382, over 13220.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2691, pruned_loss=0.08294, over 2583523.65 frames. ], batch size: 83, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:44:44,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=281612.8333333333, ans=0.125 2024-06-20 22:44:49,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=281631.1666666667, ans=0.125 2024-06-20 22:44:57,158 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 1.958e+02 2.098e+02 2.289e+02 2.903e+02, threshold=4.195e+02, percent-clipped=0.0 2024-06-20 22:45:05,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=281667.8333333333, ans=0.125 2024-06-20 22:45:17,173 INFO [train.py:1028] (1/2) Epoch 16, batch 1900, loss[loss=0.2004, simple_loss=0.2541, pruned_loss=0.07333, over 13133.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2684, pruned_loss=0.08275, over 2586487.83 frames. ], batch size: 95, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:45:17,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=281704.5, ans=0.125 2024-06-20 22:45:19,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.65 vs. limit=15.0 2024-06-20 22:45:20,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281704.5, ans=0.1 2024-06-20 22:45:21,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=281704.5, ans=0.125 2024-06-20 22:45:32,464 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.00 vs. limit=15.0 2024-06-20 22:45:42,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=281759.5, ans=0.0 2024-06-20 22:45:43,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=281759.5, ans=0.1 2024-06-20 22:45:46,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.71 vs. limit=10.0 2024-06-20 22:45:47,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=281759.5, ans=0.125 2024-06-20 22:45:55,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=281796.1666666667, ans=0.0 2024-06-20 22:45:55,812 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.95 vs. limit=15.0 2024-06-20 22:45:56,230 INFO [train.py:1028] (1/2) Epoch 16, batch 1950, loss[loss=0.2096, simple_loss=0.2685, pruned_loss=0.07537, over 13286.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2678, pruned_loss=0.08267, over 2592707.72 frames. ], batch size: 52, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:46:07,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=281814.5, ans=0.95 2024-06-20 22:46:18,296 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 1.887e+02 1.993e+02 2.148e+02 2.965e+02, threshold=3.985e+02, percent-clipped=0.0 2024-06-20 22:46:23,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=281851.1666666667, ans=0.0 2024-06-20 22:46:25,722 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=281851.1666666667, ans=0.125 2024-06-20 22:46:27,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=281851.1666666667, ans=0.1 2024-06-20 22:46:31,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281869.5, ans=0.1 2024-06-20 22:46:32,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=281869.5, ans=0.125 2024-06-20 22:46:36,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=281869.5, ans=0.0 2024-06-20 22:46:38,005 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.84 vs. limit=6.0 2024-06-20 22:46:38,326 INFO [train.py:1028] (1/2) Epoch 16, batch 2000, loss[loss=0.217, simple_loss=0.2713, pruned_loss=0.08138, over 12599.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2675, pruned_loss=0.08265, over 2588068.23 frames. ], batch size: 22, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:46:40,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=281887.8333333333, ans=0.125 2024-06-20 22:46:45,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=281906.1666666667, ans=0.125 2024-06-20 22:46:57,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=281924.5, ans=0.05 2024-06-20 22:46:57,694 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=15.0 2024-06-20 22:47:03,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=281942.8333333333, ans=0.2 2024-06-20 22:47:03,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=281942.8333333333, ans=0.2 2024-06-20 22:47:17,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=281979.5, ans=0.95 2024-06-20 22:47:18,161 INFO [train.py:1028] (1/2) Epoch 16, batch 2050, loss[loss=0.2329, simple_loss=0.2875, pruned_loss=0.08912, over 12524.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2677, pruned_loss=0.08254, over 2583298.03 frames. ], batch size: 29, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:47:27,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=281997.8333333333, ans=0.1 2024-06-20 22:47:29,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=281997.8333333333, ans=0.125 2024-06-20 22:47:30,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=281997.8333333333, ans=0.1 2024-06-20 22:47:34,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=282016.1666666667, ans=0.125 2024-06-20 22:47:36,519 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 1.927e+02 2.045e+02 2.194e+02 2.964e+02, threshold=4.090e+02, percent-clipped=0.0 2024-06-20 22:47:56,414 INFO [train.py:1028] (1/2) Epoch 16, batch 2100, loss[loss=0.2196, simple_loss=0.2829, pruned_loss=0.07819, over 13180.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2686, pruned_loss=0.08248, over 2585757.20 frames. ], batch size: 59, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:48:00,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=282071.1666666667, ans=0.025 2024-06-20 22:48:15,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=282107.8333333333, ans=0.1 2024-06-20 22:48:41,644 INFO [train.py:1028] (1/2) Epoch 16, batch 2150, loss[loss=0.2293, simple_loss=0.2888, pruned_loss=0.08494, over 13233.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2691, pruned_loss=0.08269, over 2588190.09 frames. ], batch size: 52, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:48:42,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=282162.8333333333, ans=0.125 2024-06-20 22:48:44,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=282162.8333333333, ans=0.125 2024-06-20 22:48:55,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=282181.1666666667, ans=0.125 2024-06-20 22:49:00,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=282199.5, ans=0.0 2024-06-20 22:49:01,681 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 1.980e+02 2.103e+02 2.307e+02 2.793e+02, threshold=4.207e+02, percent-clipped=0.0 2024-06-20 22:49:09,961 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.26 vs. limit=10.0 2024-06-20 22:49:12,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=282217.8333333333, ans=15.0 2024-06-20 22:49:14,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=282236.1666666667, ans=0.0 2024-06-20 22:49:15,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=282236.1666666667, ans=0.1 2024-06-20 22:49:21,662 INFO [train.py:1028] (1/2) Epoch 16, batch 2200, loss[loss=0.2091, simple_loss=0.2619, pruned_loss=0.07817, over 13246.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2694, pruned_loss=0.08294, over 2587782.96 frames. ], batch size: 83, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:49:29,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=282272.8333333333, ans=0.125 2024-06-20 22:49:32,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=282272.8333333333, ans=0.125 2024-06-20 22:49:41,361 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.85 vs. limit=15.0 2024-06-20 22:49:44,391 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=7.95 vs. limit=12.0 2024-06-20 22:49:44,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=282309.5, ans=0.0 2024-06-20 22:49:47,353 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.31 vs. limit=15.0 2024-06-20 22:49:47,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=282309.5, ans=0.125 2024-06-20 22:49:48,877 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=12.0 2024-06-20 22:50:00,412 INFO [train.py:1028] (1/2) Epoch 16, batch 2250, loss[loss=0.1977, simple_loss=0.2538, pruned_loss=0.0708, over 13243.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2689, pruned_loss=0.08284, over 2586346.35 frames. ], batch size: 63, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:50:12,591 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.60 vs. limit=15.0 2024-06-20 22:50:18,845 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 1.940e+02 2.044e+02 2.193e+02 2.971e+02, threshold=4.089e+02, percent-clipped=0.0 2024-06-20 22:50:23,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=282401.1666666667, ans=0.2 2024-06-20 22:50:23,475 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.26 vs. limit=15.0 2024-06-20 22:50:25,084 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.27 vs. limit=15.0 2024-06-20 22:50:27,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=282401.1666666667, ans=0.125 2024-06-20 22:50:29,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=282401.1666666667, ans=0.125 2024-06-20 22:50:35,842 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.61 vs. limit=6.0 2024-06-20 22:50:39,137 INFO [train.py:1028] (1/2) Epoch 16, batch 2300, loss[loss=0.195, simple_loss=0.2522, pruned_loss=0.06884, over 13034.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2686, pruned_loss=0.08272, over 2580378.89 frames. ], batch size: 33, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:50:42,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=282437.8333333333, ans=0.125 2024-06-20 22:50:43,394 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:50:45,288 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.50 vs. limit=15.0 2024-06-20 22:50:47,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=282456.1666666667, ans=0.2 2024-06-20 22:50:52,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=282456.1666666667, ans=0.5 2024-06-20 22:50:55,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=282474.5, ans=0.0 2024-06-20 22:51:08,165 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2024-06-20 22:51:10,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=282492.8333333333, ans=0.125 2024-06-20 22:51:11,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=282492.8333333333, ans=0.125 2024-06-20 22:51:24,607 INFO [train.py:1028] (1/2) Epoch 16, batch 2350, loss[loss=0.2143, simple_loss=0.2632, pruned_loss=0.08267, over 13216.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2689, pruned_loss=0.08305, over 2584367.02 frames. ], batch size: 67, lr: 3.68e-03, grad_scale: 64.0 2024-06-20 22:51:24,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=282529.5, ans=0.035 2024-06-20 22:51:36,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=282547.8333333333, ans=0.125 2024-06-20 22:51:43,500 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 1.956e+02 2.107e+02 2.315e+02 2.717e+02, threshold=4.214e+02, percent-clipped=0.0 2024-06-20 22:51:45,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=282566.1666666667, ans=0.025 2024-06-20 22:51:50,934 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.03 vs. limit=10.0 2024-06-20 22:51:52,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=282584.5, ans=0.0 2024-06-20 22:52:02,634 INFO [train.py:1028] (1/2) Epoch 16, batch 2400, loss[loss=0.2203, simple_loss=0.2707, pruned_loss=0.08491, over 13286.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2685, pruned_loss=0.08315, over 2587140.58 frames. ], batch size: 46, lr: 3.68e-03, grad_scale: 64.0 2024-06-20 22:52:11,105 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:52:23,436 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.85 vs. limit=15.0 2024-06-20 22:52:29,428 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.61 vs. limit=15.0 2024-06-20 22:52:30,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=282676.1666666667, ans=0.0 2024-06-20 22:52:40,542 INFO [train.py:1028] (1/2) Epoch 16, batch 2450, loss[loss=0.2188, simple_loss=0.2669, pruned_loss=0.08542, over 13273.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.268, pruned_loss=0.08341, over 2583751.33 frames. ], batch size: 63, lr: 3.68e-03, grad_scale: 64.0 2024-06-20 22:52:42,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=282712.8333333333, ans=0.125 2024-06-20 22:52:49,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=282731.1666666667, ans=0.1 2024-06-20 22:52:52,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=282731.1666666667, ans=0.125 2024-06-20 22:53:00,128 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 1.941e+02 2.054e+02 2.219e+02 2.866e+02, threshold=4.109e+02, percent-clipped=0.0 2024-06-20 22:53:19,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=282786.1666666667, ans=0.125 2024-06-20 22:53:20,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=282786.1666666667, ans=0.025 2024-06-20 22:53:25,609 INFO [train.py:1028] (1/2) Epoch 16, batch 2500, loss[loss=0.2118, simple_loss=0.2565, pruned_loss=0.08354, over 13210.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2668, pruned_loss=0.08313, over 2586771.16 frames. ], batch size: 83, lr: 3.68e-03, grad_scale: 64.0 2024-06-20 22:53:28,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=282804.5, ans=0.1 2024-06-20 22:53:32,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=282822.8333333333, ans=0.07 2024-06-20 22:53:33,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=282822.8333333333, ans=0.125 2024-06-20 22:53:44,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=282841.1666666667, ans=0.1 2024-06-20 22:53:46,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=282841.1666666667, ans=0.125 2024-06-20 22:53:49,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=282859.5, ans=0.025 2024-06-20 22:54:04,671 INFO [train.py:1028] (1/2) Epoch 16, batch 2550, loss[loss=0.2184, simple_loss=0.2738, pruned_loss=0.08146, over 12621.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2657, pruned_loss=0.08257, over 2588435.46 frames. ], batch size: 22, lr: 3.68e-03, grad_scale: 64.0 2024-06-20 22:54:06,619 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.85 vs. limit=22.5 2024-06-20 22:54:21,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=282932.8333333333, ans=0.0 2024-06-20 22:54:23,084 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 1.880e+02 2.034e+02 2.239e+02 2.956e+02, threshold=4.068e+02, percent-clipped=0.0 2024-06-20 22:54:24,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=282932.8333333333, ans=0.125 2024-06-20 22:54:32,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=282951.1666666667, ans=0.125 2024-06-20 22:54:41,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=282969.5, ans=0.2 2024-06-20 22:54:42,534 INFO [train.py:1028] (1/2) Epoch 16, batch 2600, loss[loss=0.1983, simple_loss=0.2503, pruned_loss=0.07313, over 13289.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2643, pruned_loss=0.08216, over 2588062.15 frames. ], batch size: 52, lr: 3.68e-03, grad_scale: 64.0 2024-06-20 22:54:48,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=282987.8333333333, ans=0.1 2024-06-20 22:54:54,443 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.45 vs. limit=10.0 2024-06-20 22:55:26,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=283061.1666666667, ans=0.125 2024-06-20 22:55:27,574 INFO [train.py:1028] (1/2) Epoch 16, batch 2650, loss[loss=0.2088, simple_loss=0.2511, pruned_loss=0.08326, over 12995.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2627, pruned_loss=0.08144, over 2589286.33 frames. ], batch size: 144, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 22:55:40,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=283097.8333333333, ans=0.5 2024-06-20 22:55:44,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=283116.1666666667, ans=0.125 2024-06-20 22:55:46,653 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.658e+02 1.909e+02 2.053e+02 2.279e+02 2.896e+02, threshold=4.106e+02, percent-clipped=0.0 2024-06-20 22:56:06,505 INFO [train.py:1028] (1/2) Epoch 16, batch 2700, loss[loss=0.1924, simple_loss=0.243, pruned_loss=0.07086, over 13276.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2613, pruned_loss=0.08126, over 2587006.23 frames. ], batch size: 89, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 22:56:08,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=283171.1666666667, ans=0.0 2024-06-20 22:56:09,660 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=12.0 2024-06-20 22:56:16,739 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.70 vs. limit=6.0 2024-06-20 22:56:26,915 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.19 vs. limit=15.0 2024-06-20 22:56:30,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=283226.1666666667, ans=0.0 2024-06-20 22:56:35,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=283226.1666666667, ans=0.0 2024-06-20 22:56:46,050 INFO [train.py:1028] (1/2) Epoch 16, batch 2750, loss[loss=0.1982, simple_loss=0.2468, pruned_loss=0.07479, over 13231.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.26, pruned_loss=0.08063, over 2582585.02 frames. ], batch size: 43, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 22:57:03,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=283299.5, ans=0.2 2024-06-20 22:57:05,726 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.888e+02 1.989e+02 2.105e+02 3.553e+02, threshold=3.979e+02, percent-clipped=0.0 2024-06-20 22:57:10,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=283317.8333333333, ans=0.0 2024-06-20 22:57:27,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=283336.1666666667, ans=0.125 2024-06-20 22:57:31,813 INFO [train.py:1028] (1/2) Epoch 16, batch 2800, loss[loss=0.2146, simple_loss=0.2522, pruned_loss=0.08846, over 10962.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2597, pruned_loss=0.08063, over 2580039.30 frames. ], batch size: 304, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 22:57:32,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=283354.5, ans=0.125 2024-06-20 22:57:42,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=283372.8333333333, ans=0.125 2024-06-20 22:57:47,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=283391.1666666667, ans=0.125 2024-06-20 22:57:58,768 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.44 vs. limit=15.0 2024-06-20 22:57:59,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=283409.5, ans=0.04949747468305833 2024-06-20 22:58:02,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=283427.8333333333, ans=0.125 2024-06-20 22:58:10,060 INFO [train.py:1028] (1/2) Epoch 16, batch 2850, loss[loss=0.1903, simple_loss=0.2389, pruned_loss=0.0709, over 13269.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2592, pruned_loss=0.08076, over 2577665.55 frames. ], batch size: 49, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 22:58:14,675 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.48 vs. limit=22.5 2024-06-20 22:58:25,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=283482.8333333333, ans=0.125 2024-06-20 22:58:28,780 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 1.881e+02 2.008e+02 2.172e+02 2.782e+02, threshold=4.016e+02, percent-clipped=0.0 2024-06-20 22:58:47,681 INFO [train.py:1028] (1/2) Epoch 16, batch 2900, loss[loss=0.1813, simple_loss=0.2354, pruned_loss=0.06361, over 13167.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2571, pruned_loss=0.0798, over 2585856.41 frames. ], batch size: 55, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 22:58:49,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=283537.8333333333, ans=0.0 2024-06-20 22:58:56,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=283556.1666666667, ans=0.0 2024-06-20 22:59:12,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=283592.8333333333, ans=0.2 2024-06-20 22:59:21,394 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.71 vs. limit=15.0 2024-06-20 22:59:27,294 INFO [train.py:1028] (1/2) Epoch 16, batch 2950, loss[loss=0.1916, simple_loss=0.2441, pruned_loss=0.06955, over 13279.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2564, pruned_loss=0.07937, over 2578752.58 frames. ], batch size: 43, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 22:59:27,747 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.29 vs. limit=15.0 2024-06-20 22:59:32,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=283629.5, ans=0.1 2024-06-20 22:59:52,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=283666.1666666667, ans=0.2 2024-06-20 22:59:54,353 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 1.856e+02 1.969e+02 2.113e+02 2.922e+02, threshold=3.938e+02, percent-clipped=0.0 2024-06-20 23:00:00,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=283684.5, ans=0.125 2024-06-20 23:00:02,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=283684.5, ans=0.07 2024-06-20 23:00:06,661 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:00:11,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=283702.8333333333, ans=0.2 2024-06-20 23:00:14,787 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.28 vs. limit=22.5 2024-06-20 23:00:15,047 INFO [train.py:1028] (1/2) Epoch 16, batch 3000, loss[loss=0.1921, simple_loss=0.2414, pruned_loss=0.07138, over 13241.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2554, pruned_loss=0.0789, over 2578195.15 frames. ], batch size: 59, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:00:15,048 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 23:00:24,217 INFO [train.py:1060] (1/2) Epoch 16, validation: loss=0.1882, simple_loss=0.2529, pruned_loss=0.06175, over 351949.00 frames. 2024-06-20 23:00:24,218 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-20 23:00:31,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=283739.5, ans=0.125 2024-06-20 23:00:35,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=283739.5, ans=0.1 2024-06-20 23:00:54,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=283776.1666666667, ans=0.0 2024-06-20 23:00:55,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=283776.1666666667, ans=0.125 2024-06-20 23:00:57,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=283794.5, ans=0.2 2024-06-20 23:01:05,830 INFO [train.py:1028] (1/2) Epoch 16, batch 3050, loss[loss=0.1936, simple_loss=0.2498, pruned_loss=0.06871, over 13383.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2553, pruned_loss=0.07917, over 2578061.51 frames. ], batch size: 46, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:01:18,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=283831.1666666667, ans=0.1 2024-06-20 23:01:19,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=283831.1666666667, ans=0.0 2024-06-20 23:01:25,566 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.668e+02 1.874e+02 1.952e+02 2.178e+02 2.885e+02, threshold=3.905e+02, percent-clipped=0.0 2024-06-20 23:01:26,013 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.23 vs. limit=15.0 2024-06-20 23:01:30,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=283867.8333333333, ans=0.125 2024-06-20 23:01:31,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=283867.8333333333, ans=0.1 2024-06-20 23:01:31,757 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2024-06-20 23:01:39,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=283886.1666666667, ans=0.04949747468305833 2024-06-20 23:01:39,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=283886.1666666667, ans=0.0 2024-06-20 23:01:45,404 INFO [train.py:1028] (1/2) Epoch 16, batch 3100, loss[loss=0.2097, simple_loss=0.2524, pruned_loss=0.08349, over 13009.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2541, pruned_loss=0.07852, over 2579017.37 frames. ], batch size: 144, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:01:56,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=283922.8333333333, ans=0.0 2024-06-20 23:02:02,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=283941.1666666667, ans=0.2 2024-06-20 23:02:04,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=283941.1666666667, ans=0.09899494936611666 2024-06-20 23:02:15,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=283959.5, ans=0.0 2024-06-20 23:02:17,911 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.42 vs. limit=15.0 2024-06-20 23:02:24,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=283977.8333333333, ans=0.125 2024-06-20 23:02:32,509 INFO [train.py:1028] (1/2) Epoch 16, batch 3150, loss[loss=0.1973, simple_loss=0.2386, pruned_loss=0.07801, over 12878.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2528, pruned_loss=0.07788, over 2581946.36 frames. ], batch size: 158, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:02:40,610 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.43 vs. limit=6.0 2024-06-20 23:02:48,778 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=15.0 2024-06-20 23:02:53,194 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.862e+02 1.971e+02 2.114e+02 2.636e+02, threshold=3.942e+02, percent-clipped=0.0 2024-06-20 23:02:54,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=284032.8333333333, ans=0.1 2024-06-20 23:02:59,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=284051.1666666667, ans=0.125 2024-06-20 23:03:12,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=284087.8333333333, ans=0.0 2024-06-20 23:03:13,374 INFO [train.py:1028] (1/2) Epoch 16, batch 3200, loss[loss=0.2042, simple_loss=0.2504, pruned_loss=0.07898, over 13058.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2526, pruned_loss=0.07772, over 2581649.61 frames. ], batch size: 55, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:03:17,099 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.82 vs. limit=15.0 2024-06-20 23:03:19,413 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.69 vs. limit=22.5 2024-06-20 23:03:37,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=284142.8333333333, ans=0.1 2024-06-20 23:03:40,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=284142.8333333333, ans=0.0 2024-06-20 23:03:44,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=284142.8333333333, ans=0.09899494936611666 2024-06-20 23:03:53,510 INFO [train.py:1028] (1/2) Epoch 16, batch 3250, loss[loss=0.1842, simple_loss=0.2404, pruned_loss=0.06402, over 13201.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2524, pruned_loss=0.07769, over 2585305.18 frames. ], batch size: 72, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:04:08,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=284197.8333333333, ans=0.0 2024-06-20 23:04:14,228 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 1.914e+02 2.049e+02 2.274e+02 2.848e+02, threshold=4.099e+02, percent-clipped=0.0 2024-06-20 23:04:14,902 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.41 vs. limit=6.0 2024-06-20 23:04:18,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=284234.5, ans=0.0 2024-06-20 23:04:25,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=284252.8333333333, ans=0.2 2024-06-20 23:04:32,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=284252.8333333333, ans=0.125 2024-06-20 23:04:37,845 INFO [train.py:1028] (1/2) Epoch 16, batch 3300, loss[loss=0.2271, simple_loss=0.2709, pruned_loss=0.09161, over 12669.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2516, pruned_loss=0.07726, over 2582103.62 frames. ], batch size: 176, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:04:56,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=284307.8333333333, ans=0.0 2024-06-20 23:04:57,816 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.67 vs. limit=22.5 2024-06-20 23:04:58,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=284307.8333333333, ans=0.125 2024-06-20 23:04:58,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.78 vs. limit=6.0 2024-06-20 23:05:02,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=284307.8333333333, ans=0.04949747468305833 2024-06-20 23:05:07,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=284326.1666666667, ans=0.125 2024-06-20 23:05:09,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=284326.1666666667, ans=0.0 2024-06-20 23:05:13,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=284344.5, ans=0.0 2024-06-20 23:05:20,884 INFO [train.py:1028] (1/2) Epoch 16, batch 3350, loss[loss=0.2056, simple_loss=0.2519, pruned_loss=0.07968, over 12922.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2511, pruned_loss=0.07724, over 2576516.95 frames. ], batch size: 158, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:05:24,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=284362.8333333333, ans=0.125 2024-06-20 23:05:28,542 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.03 vs. limit=15.0 2024-06-20 23:05:40,700 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.851e+02 1.984e+02 2.117e+02 2.810e+02, threshold=3.969e+02, percent-clipped=0.0 2024-06-20 23:05:43,051 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.84 vs. limit=10.0 2024-06-20 23:05:46,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=284417.8333333333, ans=0.125 2024-06-20 23:05:51,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=284417.8333333333, ans=0.125 2024-06-20 23:05:56,915 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:05:57,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=284436.1666666667, ans=0.1 2024-06-20 23:06:00,846 INFO [train.py:1028] (1/2) Epoch 16, batch 3400, loss[loss=0.2192, simple_loss=0.2718, pruned_loss=0.08328, over 12550.00 frames. ], tot_loss[loss=0.203, simple_loss=0.251, pruned_loss=0.07754, over 2575553.27 frames. ], batch size: 22, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:06:15,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=284472.8333333333, ans=0.2 2024-06-20 23:06:23,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=284491.1666666667, ans=0.125 2024-06-20 23:06:39,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=284527.8333333333, ans=0.0 2024-06-20 23:06:41,215 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2024-06-20 23:06:41,594 INFO [train.py:1028] (1/2) Epoch 16, batch 3450, loss[loss=0.1955, simple_loss=0.2446, pruned_loss=0.07323, over 12720.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2509, pruned_loss=0.07758, over 2576366.82 frames. ], batch size: 176, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:06:48,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=284546.1666666667, ans=0.0 2024-06-20 23:07:07,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=284582.8333333333, ans=0.1 2024-06-20 23:07:08,700 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=8.54 vs. limit=12.0 2024-06-20 23:07:08,973 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 1.847e+02 1.964e+02 2.134e+02 2.665e+02, threshold=3.928e+02, percent-clipped=0.0 2024-06-20 23:07:13,703 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.53 vs. limit=22.5 2024-06-20 23:07:14,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=284601.1666666667, ans=0.09899494936611666 2024-06-20 23:07:15,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=284601.1666666667, ans=0.0 2024-06-20 23:07:20,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=284619.5, ans=0.0 2024-06-20 23:07:22,186 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.81 vs. limit=15.0 2024-06-20 23:07:26,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=284619.5, ans=0.125 2024-06-20 23:07:29,374 INFO [train.py:1028] (1/2) Epoch 16, batch 3500, loss[loss=0.2121, simple_loss=0.2587, pruned_loss=0.08275, over 12947.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2497, pruned_loss=0.07688, over 2576802.86 frames. ], batch size: 33, lr: 3.66e-03, grad_scale: 64.0 2024-06-20 23:07:31,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=284637.8333333333, ans=0.0 2024-06-20 23:07:31,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.57 vs. limit=15.0 2024-06-20 23:07:39,398 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.28 vs. limit=22.5 2024-06-20 23:07:46,943 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.60 vs. limit=15.0 2024-06-20 23:08:02,378 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.84 vs. limit=22.5 2024-06-20 23:08:10,071 INFO [train.py:1028] (1/2) Epoch 16, batch 3550, loss[loss=0.1803, simple_loss=0.2235, pruned_loss=0.06859, over 13130.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2489, pruned_loss=0.07631, over 2578474.08 frames. ], batch size: 95, lr: 3.66e-03, grad_scale: 64.0 2024-06-20 23:08:17,245 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.62 vs. limit=15.0 2024-06-20 23:08:27,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=284766.1666666667, ans=0.125 2024-06-20 23:08:27,974 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.842e+02 1.928e+02 2.077e+02 2.629e+02, threshold=3.856e+02, percent-clipped=0.0 2024-06-20 23:08:37,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=284784.5, ans=0.2 2024-06-20 23:08:51,038 INFO [train.py:1028] (1/2) Epoch 16, batch 3600, loss[loss=0.1991, simple_loss=0.2469, pruned_loss=0.07564, over 13238.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2484, pruned_loss=0.0762, over 2582641.57 frames. ], batch size: 49, lr: 3.66e-03, grad_scale: 64.0 2024-06-20 23:08:55,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=284821.1666666667, ans=0.0 2024-06-20 23:08:56,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=284821.1666666667, ans=0.125 2024-06-20 23:09:10,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=284857.8333333333, ans=0.1 2024-06-20 23:09:12,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=284857.8333333333, ans=0.125 2024-06-20 23:09:39,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=284894.5, ans=0.025 2024-06-20 23:09:46,866 INFO [train.py:1028] (1/2) Epoch 16, batch 3650, loss[loss=0.184, simple_loss=0.2301, pruned_loss=0.069, over 13038.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2482, pruned_loss=0.07589, over 2580822.62 frames. ], batch size: 102, lr: 3.66e-03, grad_scale: 64.0 2024-06-20 23:10:03,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=284931.1666666667, ans=0.125 2024-06-20 23:10:07,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=284931.1666666667, ans=0.0 2024-06-20 23:10:08,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=284931.1666666667, ans=0.1 2024-06-20 23:10:13,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=284949.5, ans=0.09899494936611666 2024-06-20 23:10:15,617 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.908e+02 2.060e+02 2.298e+02 3.353e+02, threshold=4.120e+02, percent-clipped=0.0 2024-06-20 23:10:24,281 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.09 vs. limit=15.0 2024-06-20 23:10:27,708 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2024-06-20 23:10:40,563 INFO [train.py:1028] (1/2) Epoch 16, batch 3700, loss[loss=0.1861, simple_loss=0.237, pruned_loss=0.06761, over 13212.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2475, pruned_loss=0.07568, over 2585313.44 frames. ], batch size: 72, lr: 3.66e-03, grad_scale: 64.0 2024-06-20 23:11:16,995 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=4.360e-01 2024-06-20 23:11:18,088 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:11:28,342 INFO [train.py:1028] (1/2) Epoch 16, batch 3750, loss[loss=0.2029, simple_loss=0.2568, pruned_loss=0.07453, over 12496.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2476, pruned_loss=0.0757, over 2587782.63 frames. ], batch size: 22, lr: 3.66e-03, grad_scale: 64.0 2024-06-20 23:11:32,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=285096.1666666667, ans=0.025 2024-06-20 23:11:50,705 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.623e+02 1.804e+02 1.922e+02 2.061e+02 3.374e+02, threshold=3.845e+02, percent-clipped=0.0 2024-06-20 23:12:03,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=285151.1666666667, ans=0.1 2024-06-20 23:12:10,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=285169.5, ans=0.1 2024-06-20 23:12:13,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=285187.8333333333, ans=0.0 2024-06-20 23:12:14,340 INFO [train.py:1028] (1/2) Epoch 16, batch 3800, loss[loss=0.1992, simple_loss=0.2463, pruned_loss=0.07605, over 13195.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2471, pruned_loss=0.07544, over 2585223.80 frames. ], batch size: 83, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:12:39,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=285224.5, ans=0.0 2024-06-20 23:12:40,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=285224.5, ans=0.125 2024-06-20 23:12:41,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=285224.5, ans=0.0 2024-06-20 23:12:48,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=285242.8333333333, ans=0.0 2024-06-20 23:13:01,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=285242.8333333333, ans=0.07 2024-06-20 23:13:09,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=285261.1666666667, ans=0.2 2024-06-20 23:13:12,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=285261.1666666667, ans=0.125 2024-06-20 23:13:14,235 INFO [train.py:1028] (1/2) Epoch 16, batch 3850, loss[loss=0.1849, simple_loss=0.2247, pruned_loss=0.07259, over 13023.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2473, pruned_loss=0.07552, over 2584113.22 frames. ], batch size: 144, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:13:20,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=285279.5, ans=0.0 2024-06-20 23:13:29,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=285297.8333333333, ans=0.125 2024-06-20 23:13:38,052 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.841e+02 1.938e+02 2.078e+02 2.531e+02, threshold=3.875e+02, percent-clipped=0.0 2024-06-20 23:13:40,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=285316.1666666667, ans=0.07 2024-06-20 23:13:52,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=285352.8333333333, ans=0.0 2024-06-20 23:13:56,157 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.22 vs. limit=15.0 2024-06-20 23:14:00,648 INFO [train.py:1028] (1/2) Epoch 16, batch 3900, loss[loss=0.2006, simple_loss=0.2489, pruned_loss=0.07617, over 13246.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2467, pruned_loss=0.07538, over 2587139.84 frames. ], batch size: 83, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:14:02,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=285371.1666666667, ans=0.025 2024-06-20 23:14:05,790 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:14:10,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=285389.5, ans=0.125 2024-06-20 23:14:18,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=285407.8333333333, ans=0.0 2024-06-20 23:14:20,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=285407.8333333333, ans=0.2 2024-06-20 23:14:24,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=285407.8333333333, ans=0.125 2024-06-20 23:14:24,613 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.19 vs. limit=22.5 2024-06-20 23:14:34,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=285426.1666666667, ans=0.125 2024-06-20 23:14:41,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=285444.5, ans=0.125 2024-06-20 23:14:46,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=285444.5, ans=0.1 2024-06-20 23:14:48,911 INFO [train.py:1028] (1/2) Epoch 16, batch 3950, loss[loss=0.205, simple_loss=0.2451, pruned_loss=0.08245, over 13117.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2463, pruned_loss=0.07511, over 2588205.91 frames. ], batch size: 132, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:14:53,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=285462.8333333333, ans=0.5 2024-06-20 23:15:16,028 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 1.861e+02 1.991e+02 2.171e+02 2.653e+02, threshold=3.982e+02, percent-clipped=0.0 2024-06-20 23:15:32,934 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.11 vs. limit=22.5 2024-06-20 23:15:46,867 INFO [train.py:1028] (1/2) Epoch 16, batch 4000, loss[loss=0.2003, simple_loss=0.2478, pruned_loss=0.07643, over 12955.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2457, pruned_loss=0.07504, over 2583431.58 frames. ], batch size: 39, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:15:56,917 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.39 vs. limit=22.5 2024-06-20 23:16:26,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=285609.5, ans=0.125 2024-06-20 23:16:33,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=285627.8333333333, ans=0.125 2024-06-20 23:16:42,920 INFO [train.py:1028] (1/2) Epoch 16, batch 4050, loss[loss=0.2075, simple_loss=0.2487, pruned_loss=0.0832, over 10843.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.246, pruned_loss=0.07531, over 2580644.82 frames. ], batch size: 303, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:16:54,921 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:17:04,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=285682.8333333333, ans=0.2 2024-06-20 23:17:06,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=285682.8333333333, ans=0.125 2024-06-20 23:17:07,344 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 1.888e+02 1.987e+02 2.147e+02 2.586e+02, threshold=3.975e+02, percent-clipped=0.0 2024-06-20 23:17:11,752 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.86 vs. limit=15.0 2024-06-20 23:17:30,632 INFO [train.py:1028] (1/2) Epoch 16, batch 4100, loss[loss=0.21, simple_loss=0.2526, pruned_loss=0.08376, over 13059.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2458, pruned_loss=0.07522, over 2577201.54 frames. ], batch size: 102, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:17:34,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=285737.8333333333, ans=0.125 2024-06-20 23:17:36,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=285737.8333333333, ans=15.0 2024-06-20 23:17:39,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=285756.1666666667, ans=0.025 2024-06-20 23:17:40,879 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:17:42,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=285756.1666666667, ans=0.0 2024-06-20 23:17:42,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=285756.1666666667, ans=0.125 2024-06-20 23:18:05,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=285792.8333333333, ans=0.0 2024-06-20 23:18:06,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=285792.8333333333, ans=0.0 2024-06-20 23:18:08,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=285811.1666666667, ans=0.1 2024-06-20 23:18:17,513 INFO [train.py:1028] (1/2) Epoch 16, batch 4150, loss[loss=0.198, simple_loss=0.2423, pruned_loss=0.07683, over 13116.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.246, pruned_loss=0.07527, over 2576169.41 frames. ], batch size: 55, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:18:47,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=285866.1666666667, ans=0.2 2024-06-20 23:18:48,348 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 1.854e+02 1.960e+02 2.155e+02 3.018e+02, threshold=3.919e+02, percent-clipped=0.0 2024-06-20 23:19:00,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=285884.5, ans=0.125 2024-06-20 23:19:08,243 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:19:09,443 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.27 vs. limit=15.0 2024-06-20 23:19:14,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=285902.8333333333, ans=0.125 2024-06-20 23:19:16,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=285921.1666666667, ans=0.0 2024-06-20 23:19:17,812 INFO [train.py:1028] (1/2) Epoch 16, batch 4200, loss[loss=0.2029, simple_loss=0.2465, pruned_loss=0.07962, over 13118.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2456, pruned_loss=0.07523, over 2579382.72 frames. ], batch size: 103, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:19:21,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=285921.1666666667, ans=0.125 2024-06-20 23:19:32,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=285939.5, ans=0.125 2024-06-20 23:19:35,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=285939.5, ans=0.0 2024-06-20 23:19:38,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=285957.8333333333, ans=0.025 2024-06-20 23:19:50,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=285976.1666666667, ans=0.125 2024-06-20 23:20:05,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=285994.5, ans=0.125 2024-06-20 23:20:12,876 INFO [train.py:1028] (1/2) Epoch 16, batch 4250, loss[loss=0.1961, simple_loss=0.2446, pruned_loss=0.07379, over 13294.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2454, pruned_loss=0.07506, over 2581205.81 frames. ], batch size: 46, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:20:37,889 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 1.867e+02 1.996e+02 2.216e+02 2.847e+02, threshold=3.992e+02, percent-clipped=0.0 2024-06-20 23:21:00,988 INFO [train.py:1028] (1/2) Epoch 16, batch 4300, loss[loss=0.1892, simple_loss=0.2436, pruned_loss=0.06743, over 13223.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2449, pruned_loss=0.07479, over 2581300.02 frames. ], batch size: 59, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:21:12,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=286122.8333333333, ans=0.1 2024-06-20 23:21:17,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=286141.1666666667, ans=0.2 2024-06-20 23:21:25,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=286141.1666666667, ans=0.0 2024-06-20 23:21:29,503 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.27 vs. limit=15.0 2024-06-20 23:21:49,698 INFO [train.py:1028] (1/2) Epoch 16, batch 4350, loss[loss=0.1755, simple_loss=0.2287, pruned_loss=0.06113, over 13144.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2447, pruned_loss=0.07484, over 2586376.23 frames. ], batch size: 59, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:22:04,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=286196.1666666667, ans=0.125 2024-06-20 23:22:07,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=286214.5, ans=0.0 2024-06-20 23:22:07,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=286214.5, ans=0.025 2024-06-20 23:22:11,718 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.21 vs. limit=15.0 2024-06-20 23:22:12,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=286214.5, ans=0.05 2024-06-20 23:22:21,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=286232.8333333333, ans=0.0 2024-06-20 23:22:22,707 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.864e+02 1.997e+02 2.170e+02 3.078e+02, threshold=3.995e+02, percent-clipped=0.0 2024-06-20 23:22:29,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=286251.1666666667, ans=0.125 2024-06-20 23:22:34,381 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.33 vs. limit=15.0 2024-06-20 23:22:40,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=286269.5, ans=0.0 2024-06-20 23:22:45,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=286287.8333333333, ans=0.1 2024-06-20 23:22:46,587 INFO [train.py:1028] (1/2) Epoch 16, batch 4400, loss[loss=0.2014, simple_loss=0.241, pruned_loss=0.08089, over 13233.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2447, pruned_loss=0.07487, over 2586380.98 frames. ], batch size: 83, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:22:54,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.69 vs. limit=15.0 2024-06-20 23:22:58,072 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-20 23:23:13,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=286324.5, ans=0.125 2024-06-20 23:23:16,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=286342.8333333333, ans=0.125 2024-06-20 23:23:19,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=286342.8333333333, ans=0.125 2024-06-20 23:23:25,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=286361.1666666667, ans=0.035 2024-06-20 23:23:28,271 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.319e+00 2024-06-20 23:23:31,827 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.24 vs. limit=12.0 2024-06-20 23:23:35,985 INFO [train.py:1028] (1/2) Epoch 16, batch 4450, loss[loss=0.1952, simple_loss=0.2498, pruned_loss=0.07029, over 12806.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2455, pruned_loss=0.07519, over 2580621.83 frames. ], batch size: 33, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:24:00,551 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 1.863e+02 2.029e+02 2.205e+02 2.881e+02, threshold=4.057e+02, percent-clipped=0.0 2024-06-20 23:24:03,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=286434.5, ans=0.125 2024-06-20 23:24:19,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=286452.8333333333, ans=0.2 2024-06-20 23:24:23,589 INFO [train.py:1028] (1/2) Epoch 16, batch 4500, loss[loss=0.2046, simple_loss=0.2476, pruned_loss=0.0808, over 13266.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2445, pruned_loss=0.07456, over 2585308.45 frames. ], batch size: 89, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:24:52,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=286507.8333333333, ans=0.125 2024-06-20 23:24:53,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=286507.8333333333, ans=0.09899494936611666 2024-06-20 23:24:58,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=286507.8333333333, ans=0.125 2024-06-20 23:25:13,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=286544.5, ans=0.125 2024-06-20 23:25:21,841 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.81 vs. limit=15.0 2024-06-20 23:25:24,246 INFO [train.py:1028] (1/2) Epoch 16, batch 4550, loss[loss=0.184, simple_loss=0.2334, pruned_loss=0.06727, over 13303.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2449, pruned_loss=0.07469, over 2588627.55 frames. ], batch size: 52, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:25:29,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=12.0 2024-06-20 23:25:44,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=286599.5, ans=0.0 2024-06-20 23:25:49,042 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 1.808e+02 1.933e+02 2.089e+02 2.767e+02, threshold=3.867e+02, percent-clipped=0.0 2024-06-20 23:25:49,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=286599.5, ans=0.1 2024-06-20 23:26:02,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=286617.8333333333, ans=0.125 2024-06-20 23:26:06,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=286636.1666666667, ans=0.125 2024-06-20 23:26:10,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=286636.1666666667, ans=0.125 2024-06-20 23:26:12,695 INFO [train.py:1028] (1/2) Epoch 16, batch 4600, loss[loss=0.2216, simple_loss=0.2636, pruned_loss=0.08987, over 12536.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2444, pruned_loss=0.0745, over 2583822.49 frames. ], batch size: 202, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:26:38,976 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2024-06-20 23:26:42,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=286709.5, ans=0.025 2024-06-20 23:26:55,200 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.05 vs. limit=15.0 2024-06-20 23:26:58,953 INFO [train.py:1028] (1/2) Epoch 16, batch 4650, loss[loss=0.1872, simple_loss=0.227, pruned_loss=0.07373, over 13060.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2436, pruned_loss=0.07431, over 2587067.23 frames. ], batch size: 132, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:27:08,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=286764.5, ans=0.1 2024-06-20 23:27:22,206 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.674e+02 1.829e+02 1.958e+02 2.171e+02 3.937e+02, threshold=3.917e+02, percent-clipped=1.0 2024-06-20 23:27:38,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=286801.1666666667, ans=0.0 2024-06-20 23:27:40,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=286801.1666666667, ans=0.125 2024-06-20 23:27:52,297 INFO [train.py:1028] (1/2) Epoch 16, batch 4700, loss[loss=0.2107, simple_loss=0.2578, pruned_loss=0.08183, over 12705.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2432, pruned_loss=0.07406, over 2582034.39 frames. ], batch size: 26, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:27:55,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=286837.8333333333, ans=0.2 2024-06-20 23:28:05,316 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.25 vs. limit=6.0 2024-06-20 23:28:07,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=286837.8333333333, ans=0.125 2024-06-20 23:28:15,809 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.42 vs. limit=15.0 2024-06-20 23:28:17,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=286874.5, ans=0.0 2024-06-20 23:28:26,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=286892.8333333333, ans=0.125 2024-06-20 23:28:26,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=286892.8333333333, ans=0.035 2024-06-20 23:28:38,674 INFO [train.py:1028] (1/2) Epoch 16, batch 4750, loss[loss=0.2403, simple_loss=0.2757, pruned_loss=0.1024, over 12502.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2426, pruned_loss=0.07408, over 2578354.77 frames. ], batch size: 202, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:28:45,969 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.98 vs. limit=15.0 2024-06-20 23:29:04,435 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.674e+02 1.938e+02 2.108e+02 2.370e+02 3.291e+02, threshold=4.216e+02, percent-clipped=0.0 2024-06-20 23:29:07,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=286984.5, ans=0.125 2024-06-20 23:29:26,555 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.95 vs. limit=10.0 2024-06-20 23:29:27,728 INFO [train.py:1028] (1/2) Epoch 16, batch 4800, loss[loss=0.1813, simple_loss=0.2356, pruned_loss=0.06355, over 13270.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2424, pruned_loss=0.07422, over 2575306.72 frames. ], batch size: 63, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:29:29,094 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.72 vs. limit=6.0 2024-06-20 23:29:33,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=287021.1666666667, ans=0.0 2024-06-20 23:29:34,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=287021.1666666667, ans=0.0 2024-06-20 23:29:48,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=287057.8333333333, ans=0.0 2024-06-20 23:30:07,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=287094.5, ans=0.125 2024-06-20 23:30:10,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=287094.5, ans=0.1 2024-06-20 23:30:10,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=287094.5, ans=0.035 2024-06-20 23:30:13,967 INFO [train.py:1028] (1/2) Epoch 16, batch 4850, loss[loss=0.1931, simple_loss=0.2329, pruned_loss=0.07665, over 13234.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2423, pruned_loss=0.07417, over 2573940.68 frames. ], batch size: 89, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:30:35,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=287131.1666666667, ans=0.1 2024-06-20 23:30:40,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=287149.5, ans=0.125 2024-06-20 23:30:43,593 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:30:52,853 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.57 vs. limit=6.0 2024-06-20 23:30:53,055 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.829e+02 1.967e+02 2.103e+02 2.788e+02, threshold=3.934e+02, percent-clipped=0.0 2024-06-20 23:30:59,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=287167.8333333333, ans=0.125 2024-06-20 23:31:03,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=287167.8333333333, ans=0.125 2024-06-20 23:31:06,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=287167.8333333333, ans=0.1 2024-06-20 23:31:19,299 INFO [train.py:1028] (1/2) Epoch 16, batch 4900, loss[loss=0.1782, simple_loss=0.2292, pruned_loss=0.06363, over 13190.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2426, pruned_loss=0.07426, over 2575920.59 frames. ], batch size: 59, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:31:41,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=287241.1666666667, ans=0.0 2024-06-20 23:32:06,114 INFO [train.py:1028] (1/2) Epoch 16, batch 4950, loss[loss=0.1839, simple_loss=0.2219, pruned_loss=0.073, over 11016.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2418, pruned_loss=0.07411, over 2570219.16 frames. ], batch size: 303, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:32:30,384 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.836e+02 1.982e+02 2.125e+02 2.832e+02, threshold=3.964e+02, percent-clipped=0.0 2024-06-20 23:32:47,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=287369.5, ans=0.0 2024-06-20 23:32:53,564 INFO [train.py:1028] (1/2) Epoch 16, batch 5000, loss[loss=0.1843, simple_loss=0.2242, pruned_loss=0.07225, over 13149.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.242, pruned_loss=0.07404, over 2574202.91 frames. ], batch size: 95, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:32:59,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=287387.8333333333, ans=0.125 2024-06-20 23:33:05,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=287406.1666666667, ans=0.125 2024-06-20 23:33:10,118 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:33:51,725 INFO [train.py:1028] (1/2) Epoch 16, batch 5050, loss[loss=0.1805, simple_loss=0.228, pruned_loss=0.06648, over 12977.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2424, pruned_loss=0.07396, over 2573748.92 frames. ], batch size: 36, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:34:16,928 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.882e+02 1.981e+02 2.255e+02 3.441e+02, threshold=3.962e+02, percent-clipped=0.0 2024-06-20 23:34:41,728 INFO [train.py:1028] (1/2) Epoch 16, batch 5100, loss[loss=0.1826, simple_loss=0.2314, pruned_loss=0.06688, over 12878.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2426, pruned_loss=0.07461, over 2570242.61 frames. ], batch size: 39, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:34:43,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=287571.1666666667, ans=0.0 2024-06-20 23:34:46,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=287571.1666666667, ans=0.1 2024-06-20 23:34:52,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=287589.5, ans=0.1 2024-06-20 23:34:52,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=287589.5, ans=0.2 2024-06-20 23:34:54,816 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2024-06-20 23:34:57,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=287589.5, ans=0.2 2024-06-20 23:34:58,687 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.05 vs. limit=12.0 2024-06-20 23:34:59,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=287589.5, ans=0.125 2024-06-20 23:35:00,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=287607.8333333333, ans=0.025 2024-06-20 23:35:01,824 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:35:06,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=287607.8333333333, ans=0.035 2024-06-20 23:35:09,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=287626.1666666667, ans=0.0 2024-06-20 23:35:25,989 INFO [train.py:1028] (1/2) Epoch 16, batch 5150, loss[loss=0.1898, simple_loss=0.2278, pruned_loss=0.07591, over 13124.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2422, pruned_loss=0.07442, over 2572243.86 frames. ], batch size: 132, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:35:40,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=287681.1666666667, ans=0.1 2024-06-20 23:35:46,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=287699.5, ans=0.1 2024-06-20 23:35:51,604 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.805e+02 1.910e+02 2.086e+02 2.814e+02, threshold=3.821e+02, percent-clipped=0.0 2024-06-20 23:36:13,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=287736.1666666667, ans=0.95 2024-06-20 23:36:28,200 INFO [train.py:1028] (1/2) Epoch 16, batch 5200, loss[loss=0.1996, simple_loss=0.2401, pruned_loss=0.07951, over 13190.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.242, pruned_loss=0.07446, over 2574521.94 frames. ], batch size: 95, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:36:28,466 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:36:45,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=287791.1666666667, ans=0.1 2024-06-20 23:37:00,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=287809.5, ans=0.2 2024-06-20 23:37:06,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=287827.8333333333, ans=0.125 2024-06-20 23:37:11,190 INFO [train.py:1028] (1/2) Epoch 16, batch 5250, loss[loss=0.1844, simple_loss=0.2447, pruned_loss=0.06201, over 13225.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2425, pruned_loss=0.07458, over 2570865.23 frames. ], batch size: 52, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:37:31,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=287882.8333333333, ans=0.125 2024-06-20 23:37:33,213 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.860e+02 2.011e+02 2.207e+02 3.279e+02, threshold=4.021e+02, percent-clipped=0.0 2024-06-20 23:37:54,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=287919.5, ans=0.125 2024-06-20 23:37:55,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=287919.5, ans=0.0 2024-06-20 23:37:56,947 INFO [train.py:1028] (1/2) Epoch 16, batch 5300, loss[loss=0.2003, simple_loss=0.2434, pruned_loss=0.07864, over 13019.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2425, pruned_loss=0.07471, over 2566837.99 frames. ], batch size: 144, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:38:12,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=287956.1666666667, ans=0.125 2024-06-20 23:38:14,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=287956.1666666667, ans=0.1 2024-06-20 23:38:41,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=288011.1666666667, ans=0.125 2024-06-20 23:38:50,038 INFO [train.py:1028] (1/2) Epoch 16, batch 5350, loss[loss=0.2207, simple_loss=0.2708, pruned_loss=0.0853, over 11015.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2422, pruned_loss=0.07442, over 2572446.81 frames. ], batch size: 16, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:38:56,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=288029.5, ans=0.2 2024-06-20 23:38:59,254 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.84 vs. limit=15.0 2024-06-20 23:39:22,449 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2024-06-20 23:39:24,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=288066.1666666667, ans=0.125 2024-06-20 23:39:25,364 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.852e+02 1.936e+02 2.092e+02 2.983e+02, threshold=3.872e+02, percent-clipped=0.0 2024-06-20 23:39:28,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=288084.5, ans=0.2 2024-06-20 23:39:29,916 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2024-06-20 23:39:37,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=288084.5, ans=0.1 2024-06-20 23:39:45,743 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.56 vs. limit=22.5 2024-06-20 23:39:47,933 INFO [train.py:1028] (1/2) Epoch 16, batch 5400, loss[loss=0.214, simple_loss=0.2485, pruned_loss=0.08978, over 12216.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2422, pruned_loss=0.07482, over 2565430.63 frames. ], batch size: 240, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:39:53,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=288121.1666666667, ans=0.1 2024-06-20 23:40:01,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=288139.5, ans=0.2 2024-06-20 23:40:01,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=288139.5, ans=0.1 2024-06-20 23:40:23,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=288176.1666666667, ans=0.125 2024-06-20 23:40:26,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=288194.5, ans=0.2 2024-06-20 23:40:35,583 INFO [train.py:1028] (1/2) Epoch 16, batch 5450, loss[loss=0.2012, simple_loss=0.2504, pruned_loss=0.07602, over 12967.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2425, pruned_loss=0.07456, over 2569306.69 frames. ], batch size: 26, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:40:36,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=288212.8333333333, ans=0.125 2024-06-20 23:40:37,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=288212.8333333333, ans=0.2 2024-06-20 23:40:41,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=288212.8333333333, ans=0.125 2024-06-20 23:40:50,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=288231.1666666667, ans=0.0 2024-06-20 23:41:00,537 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 1.839e+02 1.953e+02 2.091e+02 3.274e+02, threshold=3.906e+02, percent-clipped=0.0 2024-06-20 23:41:09,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=288267.8333333333, ans=0.0 2024-06-20 23:41:11,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=288267.8333333333, ans=0.125 2024-06-20 23:41:17,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=288286.1666666667, ans=0.125 2024-06-20 23:41:20,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=288286.1666666667, ans=0.025 2024-06-20 23:41:24,226 INFO [train.py:1028] (1/2) Epoch 16, batch 5500, loss[loss=0.2292, simple_loss=0.262, pruned_loss=0.09819, over 12237.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2424, pruned_loss=0.07461, over 2563792.66 frames. ], batch size: 241, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:41:44,822 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.57 vs. limit=15.0 2024-06-20 23:41:52,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=288341.1666666667, ans=0.1 2024-06-20 23:42:07,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=288377.8333333333, ans=0.125 2024-06-20 23:42:24,231 INFO [train.py:1028] (1/2) Epoch 16, batch 5550, loss[loss=0.1975, simple_loss=0.2556, pruned_loss=0.06971, over 13295.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2422, pruned_loss=0.07412, over 2567862.27 frames. ], batch size: 43, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:42:43,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=288432.8333333333, ans=0.0 2024-06-20 23:42:48,524 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 1.833e+02 1.956e+02 2.099e+02 2.857e+02, threshold=3.913e+02, percent-clipped=0.0 2024-06-20 23:43:06,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=288469.5, ans=0.0 2024-06-20 23:43:09,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=288469.5, ans=0.125 2024-06-20 23:43:11,283 INFO [train.py:1028] (1/2) Epoch 16, batch 5600, loss[loss=0.1893, simple_loss=0.2397, pruned_loss=0.06946, over 13240.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2424, pruned_loss=0.0741, over 2570941.10 frames. ], batch size: 89, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:43:13,853 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.56 vs. limit=15.0 2024-06-20 23:43:21,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=288506.1666666667, ans=0.125 2024-06-20 23:43:23,471 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.14 vs. limit=15.0 2024-06-20 23:43:45,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=288542.8333333333, ans=0.125 2024-06-20 23:43:48,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=288561.1666666667, ans=0.125 2024-06-20 23:43:48,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2024-06-20 23:43:51,011 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:43:58,181 INFO [train.py:1028] (1/2) Epoch 16, batch 5650, loss[loss=0.2033, simple_loss=0.2501, pruned_loss=0.07828, over 12565.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2425, pruned_loss=0.07391, over 2575784.12 frames. ], batch size: 202, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:43:59,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=288579.5, ans=0.0 2024-06-20 23:44:07,750 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2024-06-20 23:44:08,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=288597.8333333333, ans=0.125 2024-06-20 23:44:09,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=288597.8333333333, ans=0.125 2024-06-20 23:44:14,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=288597.8333333333, ans=0.0 2024-06-20 23:44:14,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=288597.8333333333, ans=0.125 2024-06-20 23:44:23,338 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.830e+02 1.952e+02 2.157e+02 2.981e+02, threshold=3.905e+02, percent-clipped=0.0 2024-06-20 23:44:25,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=288616.1666666667, ans=0.0 2024-06-20 23:44:39,179 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:44:45,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=288652.8333333333, ans=0.2 2024-06-20 23:44:50,763 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.01 vs. limit=10.0 2024-06-20 23:44:53,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=288671.1666666667, ans=0.04949747468305833 2024-06-20 23:44:53,688 INFO [train.py:1028] (1/2) Epoch 16, batch 5700, loss[loss=0.2016, simple_loss=0.2517, pruned_loss=0.07574, over 13276.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2421, pruned_loss=0.07383, over 2579262.65 frames. ], batch size: 63, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:44:56,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=288671.1666666667, ans=0.125 2024-06-20 23:45:04,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=288689.5, ans=0.0 2024-06-20 23:45:08,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=288689.5, ans=0.125 2024-06-20 23:45:25,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=288707.8333333333, ans=0.1 2024-06-20 23:45:30,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=288726.1666666667, ans=0.0 2024-06-20 23:45:30,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=288726.1666666667, ans=0.0 2024-06-20 23:45:32,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=288726.1666666667, ans=0.025 2024-06-20 23:45:36,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=288744.5, ans=0.125 2024-06-20 23:45:43,198 INFO [train.py:1028] (1/2) Epoch 16, batch 5750, loss[loss=0.2179, simple_loss=0.2588, pruned_loss=0.08854, over 12837.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2429, pruned_loss=0.07418, over 2580502.12 frames. ], batch size: 177, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:45:44,872 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.14 vs. limit=15.0 2024-06-20 23:45:47,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=288762.8333333333, ans=0.0 2024-06-20 23:45:48,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=288762.8333333333, ans=0.2 2024-06-20 23:45:49,179 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.84 vs. limit=5.0 2024-06-20 23:46:00,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=288799.5, ans=0.05 2024-06-20 23:46:04,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=288799.5, ans=0.05 2024-06-20 23:46:04,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=288799.5, ans=0.125 2024-06-20 23:46:04,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=288799.5, ans=0.1 2024-06-20 23:46:07,142 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 1.855e+02 2.000e+02 2.153e+02 2.672e+02, threshold=4.001e+02, percent-clipped=0.0 2024-06-20 23:46:10,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=288817.8333333333, ans=0.5 2024-06-20 23:46:11,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=288817.8333333333, ans=0.07 2024-06-20 23:46:29,120 INFO [train.py:1028] (1/2) Epoch 16, batch 5800, loss[loss=0.1984, simple_loss=0.2399, pruned_loss=0.07844, over 12824.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2444, pruned_loss=0.07507, over 2580236.99 frames. ], batch size: 177, lr: 3.64e-03, grad_scale: 64.0 2024-06-20 23:46:34,694 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.67 vs. limit=15.0 2024-06-20 23:46:35,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=288854.5, ans=0.95 2024-06-20 23:46:48,617 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.73 vs. limit=15.0 2024-06-20 23:47:15,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=288927.8333333333, ans=0.2 2024-06-20 23:47:16,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=288927.8333333333, ans=0.0 2024-06-20 23:47:18,847 INFO [train.py:1028] (1/2) Epoch 16, batch 5850, loss[loss=0.2038, simple_loss=0.2512, pruned_loss=0.07819, over 12575.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2463, pruned_loss=0.07581, over 2578059.25 frames. ], batch size: 202, lr: 3.64e-03, grad_scale: 64.0 2024-06-20 23:47:28,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=288964.5, ans=0.0 2024-06-20 23:47:40,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=288982.8333333333, ans=0.1 2024-06-20 23:47:40,831 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.31 vs. limit=22.5 2024-06-20 23:47:41,849 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 1.910e+02 2.099e+02 2.263e+02 2.938e+02, threshold=4.198e+02, percent-clipped=0.0 2024-06-20 23:47:56,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=289019.5, ans=0.125 2024-06-20 23:48:00,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=289019.5, ans=0.125 2024-06-20 23:48:02,613 INFO [train.py:1028] (1/2) Epoch 16, batch 5900, loss[loss=0.1919, simple_loss=0.2304, pruned_loss=0.07668, over 13090.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2475, pruned_loss=0.07621, over 2577297.89 frames. ], batch size: 121, lr: 3.64e-03, grad_scale: 64.0 2024-06-20 23:48:20,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=289074.5, ans=0.0 2024-06-20 23:48:43,447 INFO [train.py:1028] (1/2) Epoch 16, batch 5950, loss[loss=0.19, simple_loss=0.2332, pruned_loss=0.07339, over 13082.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2488, pruned_loss=0.07662, over 2581653.13 frames. ], batch size: 121, lr: 3.64e-03, grad_scale: 64.0 2024-06-20 23:48:45,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=289129.5, ans=0.125 2024-06-20 23:48:51,405 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.33 vs. limit=15.0 2024-06-20 23:48:58,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=289147.8333333333, ans=0.125 2024-06-20 23:49:03,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=289166.1666666667, ans=0.0 2024-06-20 23:49:04,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=289166.1666666667, ans=0.125 2024-06-20 23:49:09,095 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.895e+02 2.011e+02 2.184e+02 3.157e+02, threshold=4.023e+02, percent-clipped=0.0 2024-06-20 23:49:14,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=289184.5, ans=0.0 2024-06-20 23:49:32,758 INFO [train.py:1028] (1/2) Epoch 16, batch 6000, loss[loss=0.2613, simple_loss=0.2898, pruned_loss=0.1163, over 12298.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2497, pruned_loss=0.07708, over 2575927.81 frames. ], batch size: 241, lr: 3.64e-03, grad_scale: 64.0 2024-06-20 23:49:32,759 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-20 23:49:43,867 INFO [train.py:1060] (1/2) Epoch 16, validation: loss=0.1885, simple_loss=0.2532, pruned_loss=0.0619, over 351949.00 frames. 2024-06-20 23:49:43,868 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-20 23:49:45,807 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.17 vs. limit=15.0 2024-06-20 23:49:55,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=289239.5, ans=0.2 2024-06-20 23:50:10,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=289257.8333333333, ans=0.1 2024-06-20 23:50:20,338 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2024-06-20 23:50:23,649 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.57 vs. limit=15.0 2024-06-20 23:50:28,979 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.005e+00 2024-06-20 23:50:30,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=289294.5, ans=0.2 2024-06-20 23:50:31,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=289294.5, ans=0.125 2024-06-20 23:50:39,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=289294.5, ans=0.05 2024-06-20 23:50:40,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=289312.8333333333, ans=0.09899494936611666 2024-06-20 23:50:41,536 INFO [train.py:1028] (1/2) Epoch 16, batch 6050, loss[loss=0.1965, simple_loss=0.2451, pruned_loss=0.0739, over 12929.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2508, pruned_loss=0.07762, over 2577845.09 frames. ], batch size: 39, lr: 3.64e-03, grad_scale: 64.0 2024-06-20 23:50:41,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=289312.8333333333, ans=0.0 2024-06-20 23:51:01,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=289331.1666666667, ans=0.125 2024-06-20 23:51:05,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=289331.1666666667, ans=0.125 2024-06-20 23:51:09,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=289349.5, ans=0.0 2024-06-20 23:51:12,658 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 1.944e+02 2.087e+02 2.324e+02 3.316e+02, threshold=4.174e+02, percent-clipped=0.0 2024-06-20 23:51:26,578 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.48 vs. limit=22.5 2024-06-20 23:51:36,271 INFO [train.py:1028] (1/2) Epoch 16, batch 6100, loss[loss=0.2035, simple_loss=0.2427, pruned_loss=0.08217, over 13129.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2515, pruned_loss=0.07784, over 2579514.22 frames. ], batch size: 121, lr: 3.63e-03, grad_scale: 64.0 2024-06-20 23:51:38,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=289404.5, ans=0.1 2024-06-20 23:51:56,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=289441.1666666667, ans=0.125 2024-06-20 23:52:11,252 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.70 vs. limit=22.5 2024-06-20 23:52:18,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=289477.8333333333, ans=0.125 2024-06-20 23:52:26,496 INFO [train.py:1028] (1/2) Epoch 16, batch 6150, loss[loss=0.2268, simple_loss=0.263, pruned_loss=0.09533, over 11013.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2536, pruned_loss=0.07869, over 2579486.94 frames. ], batch size: 304, lr: 3.63e-03, grad_scale: 64.0 2024-06-20 23:52:43,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289514.5, ans=0.1 2024-06-20 23:52:46,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=289532.8333333333, ans=0.0 2024-06-20 23:52:50,721 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 1.959e+02 2.236e+02 2.570e+02 4.175e+02, threshold=4.473e+02, percent-clipped=0.0 2024-06-20 23:52:51,451 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=12.0 2024-06-20 23:52:51,505 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=8.0 2024-06-20 23:53:03,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=289551.1666666667, ans=0.125 2024-06-20 23:53:05,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=289551.1666666667, ans=0.125 2024-06-20 23:53:15,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=289569.5, ans=0.125 2024-06-20 23:53:21,648 INFO [train.py:1028] (1/2) Epoch 16, batch 6200, loss[loss=0.2294, simple_loss=0.2776, pruned_loss=0.09065, over 13248.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2551, pruned_loss=0.07916, over 2576749.63 frames. ], batch size: 89, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:53:27,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=289587.8333333333, ans=0.1 2024-06-20 23:53:29,407 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.53 vs. limit=15.0 2024-06-20 23:53:37,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=289606.1666666667, ans=0.0 2024-06-20 23:54:03,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289642.8333333333, ans=0.1 2024-06-20 23:54:05,967 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:54:15,401 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:54:19,059 INFO [train.py:1028] (1/2) Epoch 16, batch 6250, loss[loss=0.2006, simple_loss=0.2492, pruned_loss=0.07605, over 13205.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2564, pruned_loss=0.07973, over 2569764.09 frames. ], batch size: 83, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:54:19,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=289679.5, ans=0.125 2024-06-20 23:54:19,712 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.50 vs. limit=22.5 2024-06-20 23:54:25,484 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=289679.5, ans=0.2 2024-06-20 23:54:35,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=289697.8333333333, ans=0.125 2024-06-20 23:54:44,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=289716.1666666667, ans=0.125 2024-06-20 23:54:44,681 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.30 vs. limit=15.0 2024-06-20 23:54:44,854 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 2.008e+02 2.208e+02 2.592e+02 3.461e+02, threshold=4.417e+02, percent-clipped=0.0 2024-06-20 23:54:46,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=289716.1666666667, ans=0.0 2024-06-20 23:54:53,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=289734.5, ans=0.2 2024-06-20 23:54:57,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=289752.8333333333, ans=0.2 2024-06-20 23:55:06,206 INFO [train.py:1028] (1/2) Epoch 16, batch 6300, loss[loss=0.1912, simple_loss=0.2478, pruned_loss=0.06725, over 11449.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2581, pruned_loss=0.08023, over 2564874.83 frames. ], batch size: 16, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:55:06,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=289771.1666666667, ans=0.1 2024-06-20 23:55:14,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=289771.1666666667, ans=0.125 2024-06-20 23:55:14,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=289771.1666666667, ans=0.125 2024-06-20 23:55:15,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=289789.5, ans=0.0 2024-06-20 23:55:21,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=289789.5, ans=0.0 2024-06-20 23:55:22,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289789.5, ans=0.1 2024-06-20 23:55:25,266 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2024-06-20 23:55:53,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=289862.8333333333, ans=0.125 2024-06-20 23:55:54,192 INFO [train.py:1028] (1/2) Epoch 16, batch 6350, loss[loss=0.265, simple_loss=0.3052, pruned_loss=0.1124, over 12575.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2596, pruned_loss=0.08069, over 2574343.62 frames. ], batch size: 202, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:56:10,944 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2024-06-20 23:56:17,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289899.5, ans=0.1 2024-06-20 23:56:24,069 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 2.023e+02 2.243e+02 2.516e+02 3.825e+02, threshold=4.486e+02, percent-clipped=0.0 2024-06-20 23:56:25,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=289899.5, ans=0.2 2024-06-20 23:56:37,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=289936.1666666667, ans=0.0 2024-06-20 23:56:42,617 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.37 vs. limit=10.0 2024-06-20 23:56:43,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=289936.1666666667, ans=0.0 2024-06-20 23:56:43,248 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:56:46,658 INFO [train.py:1028] (1/2) Epoch 16, batch 6400, loss[loss=0.2264, simple_loss=0.2704, pruned_loss=0.09124, over 13222.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2617, pruned_loss=0.08177, over 2576226.97 frames. ], batch size: 67, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:57:21,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=289991.1666666667, ans=0.0 2024-06-20 23:57:38,636 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.86 vs. limit=22.5 2024-06-20 23:57:41,404 INFO [train.py:1028] (1/2) Epoch 16, batch 6450, loss[loss=0.2415, simple_loss=0.288, pruned_loss=0.09748, over 12550.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2633, pruned_loss=0.0825, over 2582234.53 frames. ], batch size: 202, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:57:53,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=290064.5, ans=0.0 2024-06-20 23:57:54,289 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:58:03,770 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.008e+02 2.198e+02 2.547e+02 3.640e+02, threshold=4.396e+02, percent-clipped=0.0 2024-06-20 23:58:10,440 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-20 23:58:25,329 INFO [train.py:1028] (1/2) Epoch 16, batch 6500, loss[loss=0.2155, simple_loss=0.2527, pruned_loss=0.08911, over 10905.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2651, pruned_loss=0.08289, over 2585990.23 frames. ], batch size: 303, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:59:18,968 INFO [train.py:1028] (1/2) Epoch 16, batch 6550, loss[loss=0.2079, simple_loss=0.2631, pruned_loss=0.07639, over 12467.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2666, pruned_loss=0.0831, over 2589060.98 frames. ], batch size: 22, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:59:19,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=290229.5, ans=0.07 2024-06-20 23:59:27,351 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=8.16 vs. limit=12.0 2024-06-20 23:59:39,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=290266.1666666667, ans=0.0 2024-06-20 23:59:43,665 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.770e+02 2.054e+02 2.209e+02 2.430e+02 3.221e+02, threshold=4.418e+02, percent-clipped=0.0 2024-06-20 23:59:45,746 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2024-06-21 00:00:01,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=290302.8333333333, ans=0.2 2024-06-21 00:00:05,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=290302.8333333333, ans=0.2 2024-06-21 00:00:08,709 INFO [train.py:1028] (1/2) Epoch 16, batch 6600, loss[loss=0.2019, simple_loss=0.2531, pruned_loss=0.0753, over 13244.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.267, pruned_loss=0.08328, over 2591201.41 frames. ], batch size: 72, lr: 3.63e-03, grad_scale: 32.0 2024-06-21 00:00:08,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=290321.1666666667, ans=0.125 2024-06-21 00:00:30,623 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2024-06-21 00:00:37,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=290357.8333333333, ans=15.0 2024-06-21 00:00:40,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=290376.1666666667, ans=0.125 2024-06-21 00:00:46,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=290376.1666666667, ans=0.05 2024-06-21 00:00:49,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=290394.5, ans=0.025 2024-06-21 00:00:51,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=290394.5, ans=0.025 2024-06-21 00:00:58,853 INFO [train.py:1028] (1/2) Epoch 16, batch 6650, loss[loss=0.2396, simple_loss=0.2871, pruned_loss=0.09602, over 12913.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2685, pruned_loss=0.08372, over 2584769.02 frames. ], batch size: 158, lr: 3.63e-03, grad_scale: 32.0 2024-06-21 00:01:12,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=290431.1666666667, ans=0.125 2024-06-21 00:01:21,698 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.75 vs. limit=6.0 2024-06-21 00:01:22,732 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.64 vs. limit=22.5 2024-06-21 00:01:23,735 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.53 vs. limit=15.0 2024-06-21 00:01:25,306 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.89 vs. limit=15.0 2024-06-21 00:01:25,740 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.042e+02 2.235e+02 2.498e+02 3.205e+02, threshold=4.470e+02, percent-clipped=0.0 2024-06-21 00:01:30,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=290467.8333333333, ans=0.1 2024-06-21 00:01:31,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=290467.8333333333, ans=0.0 2024-06-21 00:01:48,357 INFO [train.py:1028] (1/2) Epoch 16, batch 6700, loss[loss=0.2343, simple_loss=0.2784, pruned_loss=0.09513, over 12746.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2705, pruned_loss=0.08448, over 2584868.49 frames. ], batch size: 176, lr: 3.63e-03, grad_scale: 32.0 2024-06-21 00:01:54,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=290504.5, ans=0.125 2024-06-21 00:02:22,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=290559.5, ans=0.0 2024-06-21 00:02:29,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=290559.5, ans=0.1 2024-06-21 00:02:33,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=290577.8333333333, ans=0.1 2024-06-21 00:02:38,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=290577.8333333333, ans=0.0 2024-06-21 00:02:43,111 INFO [train.py:1028] (1/2) Epoch 16, batch 6750, loss[loss=0.2592, simple_loss=0.2976, pruned_loss=0.1104, over 12231.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2709, pruned_loss=0.08492, over 2577615.14 frames. ], batch size: 240, lr: 3.63e-03, grad_scale: 32.0 2024-06-21 00:03:06,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=290614.5, ans=0.2 2024-06-21 00:03:14,139 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 1.950e+02 2.113e+02 2.278e+02 2.957e+02, threshold=4.226e+02, percent-clipped=0.0 2024-06-21 00:03:31,633 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2024-06-21 00:03:36,817 INFO [train.py:1028] (1/2) Epoch 16, batch 6800, loss[loss=0.2013, simple_loss=0.2535, pruned_loss=0.07452, over 13221.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.272, pruned_loss=0.0851, over 2581043.26 frames. ], batch size: 67, lr: 3.63e-03, grad_scale: 32.0 2024-06-21 00:04:10,331 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.93 vs. limit=22.5 2024-06-21 00:04:14,130 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:04:22,698 INFO [train.py:1028] (1/2) Epoch 16, batch 6850, loss[loss=0.22, simple_loss=0.2804, pruned_loss=0.0798, over 13275.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2725, pruned_loss=0.08525, over 2584489.17 frames. ], batch size: 63, lr: 3.63e-03, grad_scale: 32.0 2024-06-21 00:04:32,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=290797.8333333333, ans=0.1 2024-06-21 00:04:33,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=290797.8333333333, ans=0.125 2024-06-21 00:04:40,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=290797.8333333333, ans=0.035 2024-06-21 00:04:48,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=290816.1666666667, ans=0.125 2024-06-21 00:04:49,124 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 1.990e+02 2.193e+02 2.477e+02 3.705e+02, threshold=4.387e+02, percent-clipped=0.0 2024-06-21 00:04:54,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=290834.5, ans=0.0 2024-06-21 00:04:55,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=290834.5, ans=0.125 2024-06-21 00:04:56,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=290834.5, ans=0.125 2024-06-21 00:05:00,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=290852.8333333333, ans=0.125 2024-06-21 00:05:17,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=290871.1666666667, ans=0.125 2024-06-21 00:05:18,513 INFO [train.py:1028] (1/2) Epoch 16, batch 6900, loss[loss=0.2228, simple_loss=0.2813, pruned_loss=0.08211, over 13276.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2733, pruned_loss=0.08555, over 2586651.25 frames. ], batch size: 49, lr: 3.63e-03, grad_scale: 32.0 2024-06-21 00:05:21,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=290871.1666666667, ans=0.125 2024-06-21 00:05:34,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=290889.5, ans=0.0 2024-06-21 00:05:50,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=290926.1666666667, ans=0.1 2024-06-21 00:06:04,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=290944.5, ans=0.0 2024-06-21 00:06:13,579 INFO [train.py:1028] (1/2) Epoch 16, batch 6950, loss[loss=0.2286, simple_loss=0.2917, pruned_loss=0.08281, over 11883.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2734, pruned_loss=0.08546, over 2579522.94 frames. ], batch size: 17, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:06:14,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=290962.8333333333, ans=0.0 2024-06-21 00:06:25,888 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.73 vs. limit=22.5 2024-06-21 00:06:28,460 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.50 vs. limit=15.0 2024-06-21 00:06:30,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=290981.1666666667, ans=0.125 2024-06-21 00:06:36,452 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.48 vs. limit=22.5 2024-06-21 00:06:39,235 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 1.996e+02 2.159e+02 2.274e+02 2.965e+02, threshold=4.318e+02, percent-clipped=0.0 2024-06-21 00:06:45,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=291017.8333333333, ans=0.125 2024-06-21 00:06:47,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=291017.8333333333, ans=0.04949747468305833 2024-06-21 00:06:57,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=291036.1666666667, ans=0.07 2024-06-21 00:07:00,101 INFO [train.py:1028] (1/2) Epoch 16, batch 7000, loss[loss=0.2269, simple_loss=0.2793, pruned_loss=0.08727, over 12934.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2733, pruned_loss=0.08541, over 2574649.00 frames. ], batch size: 158, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:07:30,289 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.31 vs. limit=22.5 2024-06-21 00:07:39,261 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.44 vs. limit=15.0 2024-06-21 00:07:47,807 INFO [train.py:1028] (1/2) Epoch 16, batch 7050, loss[loss=0.2453, simple_loss=0.2973, pruned_loss=0.09663, over 12734.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2747, pruned_loss=0.08579, over 2581741.14 frames. ], batch size: 176, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:08:20,428 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.068e+02 2.289e+02 2.568e+02 3.625e+02, threshold=4.578e+02, percent-clipped=0.0 2024-06-21 00:08:20,994 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2024-06-21 00:08:25,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=291201.1666666667, ans=0.125 2024-06-21 00:08:30,521 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:08:35,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=291219.5, ans=0.025 2024-06-21 00:08:36,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=291219.5, ans=0.0 2024-06-21 00:08:43,304 INFO [train.py:1028] (1/2) Epoch 16, batch 7100, loss[loss=0.2198, simple_loss=0.274, pruned_loss=0.08279, over 13131.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2756, pruned_loss=0.08656, over 2574352.36 frames. ], batch size: 112, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:08:44,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=291237.8333333333, ans=0.125 2024-06-21 00:08:48,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=291237.8333333333, ans=0.125 2024-06-21 00:09:01,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=291256.1666666667, ans=0.125 2024-06-21 00:09:34,678 INFO [train.py:1028] (1/2) Epoch 16, batch 7150, loss[loss=0.2651, simple_loss=0.3056, pruned_loss=0.1123, over 12501.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2762, pruned_loss=0.0863, over 2572383.10 frames. ], batch size: 202, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:09:39,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=291329.5, ans=0.125 2024-06-21 00:09:43,523 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=15.0 2024-06-21 00:09:44,371 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.89 vs. limit=6.0 2024-06-21 00:09:57,696 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=15.0 2024-06-21 00:10:00,855 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 1.990e+02 2.134e+02 2.356e+02 3.295e+02, threshold=4.268e+02, percent-clipped=0.0 2024-06-21 00:10:02,791 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.49 vs. limit=15.0 2024-06-21 00:10:04,518 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.45 vs. limit=22.5 2024-06-21 00:10:05,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=291384.5, ans=0.125 2024-06-21 00:10:06,553 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.63 vs. limit=15.0 2024-06-21 00:10:12,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=291384.5, ans=0.125 2024-06-21 00:10:22,864 INFO [train.py:1028] (1/2) Epoch 16, batch 7200, loss[loss=0.2367, simple_loss=0.2859, pruned_loss=0.09374, over 13180.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2771, pruned_loss=0.08657, over 2578221.47 frames. ], batch size: 112, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:10:38,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=291457.8333333333, ans=0.125 2024-06-21 00:10:48,636 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.17 vs. limit=12.0 2024-06-21 00:10:59,160 INFO [train.py:1028] (1/2) Epoch 16, batch 7250, loss[loss=0.2085, simple_loss=0.2616, pruned_loss=0.07775, over 12887.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2773, pruned_loss=0.08638, over 2579871.41 frames. ], batch size: 36, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:11:22,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=291549.5, ans=0.125 2024-06-21 00:11:29,719 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 1.997e+02 2.143e+02 2.395e+02 3.371e+02, threshold=4.287e+02, percent-clipped=0.0 2024-06-21 00:11:35,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=291567.8333333333, ans=0.2 2024-06-21 00:11:38,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=291567.8333333333, ans=0.125 2024-06-21 00:11:46,775 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:11:47,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=291586.1666666667, ans=0.125 2024-06-21 00:11:58,746 INFO [train.py:1028] (1/2) Epoch 16, batch 7300, loss[loss=0.2245, simple_loss=0.2757, pruned_loss=0.08661, over 12918.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2785, pruned_loss=0.08696, over 2579895.10 frames. ], batch size: 36, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:12:02,745 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2024-06-21 00:12:18,915 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:12:32,778 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=291659.5, ans=0.125 2024-06-21 00:12:35,918 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.46 vs. limit=6.0 2024-06-21 00:12:40,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=291677.8333333333, ans=0.1 2024-06-21 00:12:45,699 INFO [train.py:1028] (1/2) Epoch 16, batch 7350, loss[loss=0.2357, simple_loss=0.2915, pruned_loss=0.08994, over 13306.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2792, pruned_loss=0.08715, over 2581136.76 frames. ], batch size: 46, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:12:45,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=291696.1666666667, ans=0.025 2024-06-21 00:13:12,583 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.770e+02 1.960e+02 2.120e+02 2.309e+02 2.996e+02, threshold=4.241e+02, percent-clipped=0.0 2024-06-21 00:13:13,258 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.39 vs. limit=22.5 2024-06-21 00:13:26,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=291769.5, ans=0.0 2024-06-21 00:13:27,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=291769.5, ans=0.2 2024-06-21 00:13:34,315 INFO [train.py:1028] (1/2) Epoch 16, batch 7400, loss[loss=0.2463, simple_loss=0.3046, pruned_loss=0.09403, over 13261.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2794, pruned_loss=0.08701, over 2585499.19 frames. ], batch size: 63, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:13:59,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=291824.5, ans=0.125 2024-06-21 00:14:01,896 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.48 vs. limit=15.0 2024-06-21 00:14:05,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=291842.8333333333, ans=0.1 2024-06-21 00:14:17,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=291842.8333333333, ans=0.2 2024-06-21 00:14:30,390 INFO [train.py:1028] (1/2) Epoch 16, batch 7450, loss[loss=0.2053, simple_loss=0.2555, pruned_loss=0.07756, over 12646.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2792, pruned_loss=0.08671, over 2577753.00 frames. ], batch size: 29, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:14:48,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=291897.8333333333, ans=0.125 2024-06-21 00:14:55,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=291916.1666666667, ans=0.025 2024-06-21 00:14:59,709 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.009e+02 2.161e+02 2.336e+02 2.937e+02, threshold=4.323e+02, percent-clipped=0.0 2024-06-21 00:15:21,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=291952.8333333333, ans=0.0 2024-06-21 00:15:27,195 INFO [train.py:1028] (1/2) Epoch 16, batch 7500, loss[loss=0.2533, simple_loss=0.2959, pruned_loss=0.1053, over 10635.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2804, pruned_loss=0.08753, over 2576408.83 frames. ], batch size: 303, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:15:33,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=291971.1666666667, ans=0.2 2024-06-21 00:15:49,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=292007.8333333333, ans=0.025 2024-06-21 00:15:49,390 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=292007.8333333333, ans=0.1 2024-06-21 00:15:55,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=292026.1666666667, ans=0.035 2024-06-21 00:16:06,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=292044.5, ans=0.0 2024-06-21 00:16:15,129 INFO [train.py:1028] (1/2) Epoch 16, batch 7550, loss[loss=0.2217, simple_loss=0.2702, pruned_loss=0.0866, over 12938.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2814, pruned_loss=0.08828, over 2575801.42 frames. ], batch size: 158, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:16:21,262 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.30 vs. limit=6.0 2024-06-21 00:16:42,347 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.045e+02 2.229e+02 2.488e+02 3.356e+02, threshold=4.459e+02, percent-clipped=0.0 2024-06-21 00:16:53,639 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.57 vs. limit=15.0 2024-06-21 00:17:05,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=292136.1666666667, ans=0.125 2024-06-21 00:17:09,794 INFO [train.py:1028] (1/2) Epoch 16, batch 7600, loss[loss=0.2324, simple_loss=0.2889, pruned_loss=0.08796, over 13237.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2815, pruned_loss=0.08816, over 2574385.93 frames. ], batch size: 83, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:17:13,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=292154.5, ans=0.0 2024-06-21 00:17:14,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=292154.5, ans=0.0 2024-06-21 00:17:14,400 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.95 vs. limit=15.0 2024-06-21 00:17:17,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=292172.8333333333, ans=0.125 2024-06-21 00:17:18,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=292172.8333333333, ans=0.015 2024-06-21 00:17:22,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=292172.8333333333, ans=0.125 2024-06-21 00:18:04,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=292227.8333333333, ans=0.125 2024-06-21 00:18:06,423 INFO [train.py:1028] (1/2) Epoch 16, batch 7650, loss[loss=0.2388, simple_loss=0.2853, pruned_loss=0.09614, over 12994.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.2822, pruned_loss=0.08851, over 2570940.94 frames. ], batch size: 33, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:18:09,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=292246.1666666667, ans=0.07 2024-06-21 00:18:22,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=292264.5, ans=0.125 2024-06-21 00:18:25,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=292264.5, ans=0.0 2024-06-21 00:18:33,476 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.031e+02 2.163e+02 2.370e+02 8.674e+02, threshold=4.327e+02, percent-clipped=1.0 2024-06-21 00:18:45,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=292301.1666666667, ans=0.0 2024-06-21 00:18:46,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=292319.5, ans=0.0 2024-06-21 00:18:49,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=292319.5, ans=0.0 2024-06-21 00:18:55,144 INFO [train.py:1028] (1/2) Epoch 16, batch 7700, loss[loss=0.2173, simple_loss=0.2779, pruned_loss=0.07833, over 13283.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2826, pruned_loss=0.08856, over 2568676.02 frames. ], batch size: 63, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:19:09,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=292356.1666666667, ans=0.0 2024-06-21 00:19:12,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=292374.5, ans=0.025 2024-06-21 00:19:23,165 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.28 vs. limit=8.0 2024-06-21 00:19:26,029 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=292392.8333333333, ans=0.1 2024-06-21 00:19:27,466 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.28 vs. limit=15.0 2024-06-21 00:19:41,959 INFO [train.py:1028] (1/2) Epoch 16, batch 7750, loss[loss=0.2316, simple_loss=0.2833, pruned_loss=0.08998, over 13212.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2829, pruned_loss=0.08883, over 2572765.16 frames. ], batch size: 72, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:19:57,087 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.79 vs. limit=22.5 2024-06-21 00:20:03,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=292447.8333333333, ans=0.0 2024-06-21 00:20:15,019 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 2.111e+02 2.280e+02 2.502e+02 3.219e+02, threshold=4.560e+02, percent-clipped=0.0 2024-06-21 00:20:30,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=292502.8333333333, ans=0.2 2024-06-21 00:20:32,286 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.72 vs. limit=15.0 2024-06-21 00:20:33,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=292502.8333333333, ans=0.125 2024-06-21 00:20:36,013 INFO [train.py:1028] (1/2) Epoch 16, batch 7800, loss[loss=0.241, simple_loss=0.2908, pruned_loss=0.09556, over 13114.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.2832, pruned_loss=0.08904, over 2577333.15 frames. ], batch size: 95, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:20:54,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=292539.5, ans=0.0 2024-06-21 00:21:10,763 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.60 vs. limit=15.0 2024-06-21 00:21:13,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=292576.1666666667, ans=0.125 2024-06-21 00:21:24,865 INFO [train.py:1028] (1/2) Epoch 16, batch 7850, loss[loss=0.2176, simple_loss=0.2803, pruned_loss=0.07743, over 10993.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2839, pruned_loss=0.08945, over 2571671.43 frames. ], batch size: 16, lr: 3.61e-03, grad_scale: 32.0 2024-06-21 00:21:34,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=292631.1666666667, ans=0.09899494936611666 2024-06-21 00:21:39,304 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2024-06-21 00:21:41,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=292631.1666666667, ans=0.1 2024-06-21 00:21:50,068 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.016e+02 2.161e+02 2.474e+02 3.343e+02, threshold=4.322e+02, percent-clipped=0.0 2024-06-21 00:21:51,175 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:21:51,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=292649.5, ans=0.125 2024-06-21 00:21:57,941 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.12 vs. limit=6.0 2024-06-21 00:21:59,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=292667.8333333333, ans=0.0 2024-06-21 00:22:11,618 INFO [train.py:1028] (1/2) Epoch 16, batch 7900, loss[loss=0.2118, simple_loss=0.2661, pruned_loss=0.07873, over 13195.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2845, pruned_loss=0.08971, over 2571838.13 frames. ], batch size: 77, lr: 3.61e-03, grad_scale: 32.0 2024-06-21 00:22:47,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=292759.5, ans=0.1 2024-06-21 00:23:05,495 INFO [train.py:1028] (1/2) Epoch 16, batch 7950, loss[loss=0.2213, simple_loss=0.2639, pruned_loss=0.08939, over 10594.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.2843, pruned_loss=0.08948, over 2575184.82 frames. ], batch size: 303, lr: 3.61e-03, grad_scale: 32.0 2024-06-21 00:23:13,156 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=15.76 vs. limit=15.0 2024-06-21 00:23:22,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=292814.5, ans=0.125 2024-06-21 00:23:29,931 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.063e+02 2.247e+02 2.453e+02 3.551e+02, threshold=4.494e+02, percent-clipped=0.0 2024-06-21 00:23:43,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=292851.1666666667, ans=0.125 2024-06-21 00:23:51,276 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.52 vs. limit=12.0 2024-06-21 00:23:55,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=292869.5, ans=0.0 2024-06-21 00:23:59,894 INFO [train.py:1028] (1/2) Epoch 16, batch 8000, loss[loss=0.2277, simple_loss=0.2819, pruned_loss=0.08675, over 12683.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.2856, pruned_loss=0.09003, over 2572263.96 frames. ], batch size: 29, lr: 3.61e-03, grad_scale: 32.0 2024-06-21 00:24:16,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=292906.1666666667, ans=0.1 2024-06-21 00:24:17,209 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.54 vs. limit=15.0 2024-06-21 00:24:38,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=292961.1666666667, ans=0.125 2024-06-21 00:24:46,703 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2024-06-21 00:24:48,185 INFO [train.py:1028] (1/2) Epoch 16, batch 8050, loss[loss=0.2306, simple_loss=0.2856, pruned_loss=0.08777, over 13235.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.2846, pruned_loss=0.08955, over 2570906.11 frames. ], batch size: 83, lr: 3.61e-03, grad_scale: 32.0 2024-06-21 00:24:59,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=292997.8333333333, ans=0.0 2024-06-21 00:25:04,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=292997.8333333333, ans=0.05 2024-06-21 00:25:12,713 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 2.034e+02 2.253e+02 2.433e+02 3.066e+02, threshold=4.506e+02, percent-clipped=0.0 2024-06-21 00:25:17,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=293034.5, ans=0.0 2024-06-21 00:25:20,101 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.91 vs. limit=10.0 2024-06-21 00:25:20,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=293034.5, ans=0.125 2024-06-21 00:25:22,044 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.66 vs. limit=12.0 2024-06-21 00:25:32,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=293052.8333333333, ans=0.125 2024-06-21 00:25:35,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=293052.8333333333, ans=0.125 2024-06-21 00:25:36,013 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.94 vs. limit=15.0 2024-06-21 00:25:41,859 INFO [train.py:1028] (1/2) Epoch 16, batch 8100, loss[loss=0.2289, simple_loss=0.2813, pruned_loss=0.08825, over 13143.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2848, pruned_loss=0.08963, over 2575336.28 frames. ], batch size: 112, lr: 3.61e-03, grad_scale: 32.0 2024-06-21 00:26:25,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=293144.5, ans=0.035 2024-06-21 00:26:25,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=293144.5, ans=0.125 2024-06-21 00:26:26,300 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.87 vs. limit=22.5 2024-06-21 00:26:34,794 INFO [train.py:1028] (1/2) Epoch 16, batch 8150, loss[loss=0.2305, simple_loss=0.2812, pruned_loss=0.08993, over 13064.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2852, pruned_loss=0.0894, over 2579663.58 frames. ], batch size: 121, lr: 3.61e-03, grad_scale: 32.0 2024-06-21 00:26:42,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=293162.8333333333, ans=0.125 2024-06-21 00:26:48,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=293181.1666666667, ans=0.2 2024-06-21 00:26:49,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=293181.1666666667, ans=0.125 2024-06-21 00:27:00,890 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.070e+02 2.211e+02 2.375e+02 3.623e+02, threshold=4.421e+02, percent-clipped=0.0 2024-06-21 00:27:22,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=293254.5, ans=0.125 2024-06-21 00:27:23,490 INFO [train.py:1028] (1/2) Epoch 16, batch 8200, loss[loss=0.2353, simple_loss=0.2855, pruned_loss=0.09252, over 13122.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.2851, pruned_loss=0.08925, over 2583011.62 frames. ], batch size: 112, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:27:32,607 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2024-06-21 00:27:34,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=293272.8333333333, ans=0.125 2024-06-21 00:27:53,681 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.11 vs. limit=10.0 2024-06-21 00:28:03,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=293327.8333333333, ans=0.04949747468305833 2024-06-21 00:28:13,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=293327.8333333333, ans=0.0 2024-06-21 00:28:14,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=293327.8333333333, ans=0.025 2024-06-21 00:28:17,999 INFO [train.py:1028] (1/2) Epoch 16, batch 8250, loss[loss=0.2312, simple_loss=0.2957, pruned_loss=0.08332, over 13250.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.2859, pruned_loss=0.08964, over 2583429.12 frames. ], batch size: 52, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:28:50,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=293382.8333333333, ans=0.04949747468305833 2024-06-21 00:28:50,887 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.131e+02 2.293e+02 2.526e+02 3.565e+02, threshold=4.587e+02, percent-clipped=0.0 2024-06-21 00:28:52,408 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.16 vs. limit=15.0 2024-06-21 00:28:57,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=293401.1666666667, ans=0.0 2024-06-21 00:29:09,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=293419.5, ans=0.1 2024-06-21 00:29:12,455 INFO [train.py:1028] (1/2) Epoch 16, batch 8300, loss[loss=0.2186, simple_loss=0.2703, pruned_loss=0.08342, over 13004.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.2851, pruned_loss=0.0891, over 2580690.70 frames. ], batch size: 102, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:29:18,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=293437.8333333333, ans=0.0 2024-06-21 00:29:22,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=293456.1666666667, ans=0.0 2024-06-21 00:29:27,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=293456.1666666667, ans=0.125 2024-06-21 00:29:31,376 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=293474.5, ans=0.2 2024-06-21 00:29:42,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=293474.5, ans=0.125 2024-06-21 00:30:04,119 INFO [train.py:1028] (1/2) Epoch 16, batch 8350, loss[loss=0.2288, simple_loss=0.2816, pruned_loss=0.08805, over 13204.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.2849, pruned_loss=0.08886, over 2580954.10 frames. ], batch size: 112, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:30:11,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=293529.5, ans=0.2 2024-06-21 00:30:18,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=293547.8333333333, ans=0.0 2024-06-21 00:30:18,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=293547.8333333333, ans=0.0 2024-06-21 00:30:30,527 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.134e+02 2.298e+02 2.669e+02 4.226e+02, threshold=4.596e+02, percent-clipped=0.0 2024-06-21 00:30:39,678 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.88 vs. limit=22.5 2024-06-21 00:30:53,677 INFO [train.py:1028] (1/2) Epoch 16, batch 8400, loss[loss=0.2225, simple_loss=0.2722, pruned_loss=0.08634, over 12936.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2854, pruned_loss=0.08926, over 2576118.47 frames. ], batch size: 39, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:30:54,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=293621.1666666667, ans=0.125 2024-06-21 00:31:02,742 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.92 vs. limit=15.0 2024-06-21 00:31:03,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=293639.5, ans=0.1 2024-06-21 00:31:13,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=293657.8333333333, ans=0.0 2024-06-21 00:31:38,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=293694.5, ans=0.125 2024-06-21 00:31:45,244 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.15 vs. limit=15.0 2024-06-21 00:31:46,267 INFO [train.py:1028] (1/2) Epoch 16, batch 8450, loss[loss=0.2523, simple_loss=0.3086, pruned_loss=0.09798, over 13147.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.2862, pruned_loss=0.08957, over 2578191.36 frames. ], batch size: 112, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:31:58,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=293731.1666666667, ans=0.0 2024-06-21 00:32:02,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=293731.1666666667, ans=0.2 2024-06-21 00:32:10,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=293749.5, ans=0.0 2024-06-21 00:32:11,864 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.091e+02 2.234e+02 2.390e+02 3.677e+02, threshold=4.469e+02, percent-clipped=0.0 2024-06-21 00:32:36,854 INFO [train.py:1028] (1/2) Epoch 16, batch 8500, loss[loss=0.224, simple_loss=0.2737, pruned_loss=0.0871, over 12610.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.2871, pruned_loss=0.08992, over 2577313.36 frames. ], batch size: 29, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:32:37,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=293804.5, ans=0.2 2024-06-21 00:32:37,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=293804.5, ans=0.125 2024-06-21 00:32:37,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=293804.5, ans=0.05 2024-06-21 00:32:43,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=293804.5, ans=0.5 2024-06-21 00:32:47,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=293822.8333333333, ans=0.125 2024-06-21 00:32:47,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=293822.8333333333, ans=0.1 2024-06-21 00:33:09,869 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2024-06-21 00:33:15,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=293859.5, ans=0.1 2024-06-21 00:33:25,746 INFO [train.py:1028] (1/2) Epoch 16, batch 8550, loss[loss=0.2157, simple_loss=0.2816, pruned_loss=0.07492, over 12726.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.2867, pruned_loss=0.08973, over 2575598.32 frames. ], batch size: 22, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:33:38,787 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.85 vs. limit=6.0 2024-06-21 00:33:45,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.13 vs. limit=6.0 2024-06-21 00:33:48,661 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.121e+02 2.254e+02 2.627e+02 3.742e+02, threshold=4.508e+02, percent-clipped=0.0 2024-06-21 00:34:02,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=293969.5, ans=0.125 2024-06-21 00:34:11,501 INFO [train.py:1028] (1/2) Epoch 16, batch 8600, loss[loss=0.2158, simple_loss=0.2675, pruned_loss=0.082, over 13133.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.2876, pruned_loss=0.09007, over 2573683.53 frames. ], batch size: 121, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:34:39,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=294024.5, ans=0.125 2024-06-21 00:34:53,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=294042.8333333333, ans=10.0 2024-06-21 00:35:07,559 INFO [train.py:1028] (1/2) Epoch 16, batch 8650, loss[loss=0.218, simple_loss=0.2712, pruned_loss=0.08245, over 13020.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.2873, pruned_loss=0.08986, over 2576156.21 frames. ], batch size: 102, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:35:11,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=294079.5, ans=0.125 2024-06-21 00:35:12,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=294079.5, ans=0.0 2024-06-21 00:35:14,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=294097.8333333333, ans=0.0 2024-06-21 00:35:21,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=294097.8333333333, ans=0.0 2024-06-21 00:35:29,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=294116.1666666667, ans=0.125 2024-06-21 00:35:32,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=294116.1666666667, ans=0.04949747468305833 2024-06-21 00:35:36,259 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 2.057e+02 2.180e+02 2.452e+02 3.730e+02, threshold=4.360e+02, percent-clipped=0.0 2024-06-21 00:35:37,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=294116.1666666667, ans=0.125 2024-06-21 00:35:44,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=294134.5, ans=0.125 2024-06-21 00:35:46,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=294134.5, ans=0.125 2024-06-21 00:35:57,327 INFO [train.py:1028] (1/2) Epoch 16, batch 8700, loss[loss=0.2423, simple_loss=0.3024, pruned_loss=0.09112, over 13169.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.2873, pruned_loss=0.09024, over 2573219.23 frames. ], batch size: 59, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:36:12,120 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2024-06-21 00:36:14,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=294189.5, ans=0.125 2024-06-21 00:36:15,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=294207.8333333333, ans=0.2 2024-06-21 00:36:41,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=294244.5, ans=0.025 2024-06-21 00:36:44,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=294244.5, ans=0.0 2024-06-21 00:36:46,059 INFO [train.py:1028] (1/2) Epoch 16, batch 8750, loss[loss=0.2398, simple_loss=0.2846, pruned_loss=0.09745, over 13114.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.2871, pruned_loss=0.09012, over 2568380.17 frames. ], batch size: 121, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:37:16,326 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.002e+02 2.159e+02 2.431e+02 2.997e+02, threshold=4.318e+02, percent-clipped=0.0 2024-06-21 00:37:19,834 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.14 vs. limit=15.0 2024-06-21 00:37:29,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=294336.1666666667, ans=0.125 2024-06-21 00:37:40,168 INFO [train.py:1028] (1/2) Epoch 16, batch 8800, loss[loss=0.2313, simple_loss=0.287, pruned_loss=0.08786, over 13277.00 frames. ], tot_loss[loss=0.234, simple_loss=0.2875, pruned_loss=0.09022, over 2574282.00 frames. ], batch size: 72, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:37:42,444 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.57 vs. limit=15.0 2024-06-21 00:38:01,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=294391.1666666667, ans=0.125 2024-06-21 00:38:04,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=294391.1666666667, ans=0.0 2024-06-21 00:38:16,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=294409.5, ans=0.2 2024-06-21 00:38:18,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=294409.5, ans=0.09899494936611666 2024-06-21 00:38:20,453 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294409.5, ans=0.1 2024-06-21 00:38:23,372 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.21 vs. limit=15.0 2024-06-21 00:38:32,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=294427.8333333333, ans=0.125 2024-06-21 00:38:36,388 INFO [train.py:1028] (1/2) Epoch 16, batch 8850, loss[loss=0.2577, simple_loss=0.3034, pruned_loss=0.106, over 12582.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.2874, pruned_loss=0.0906, over 2562351.19 frames. ], batch size: 202, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:38:39,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=294446.1666666667, ans=0.0 2024-06-21 00:38:44,787 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.30 vs. limit=15.0 2024-06-21 00:38:48,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=294464.5, ans=0.0 2024-06-21 00:38:56,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=294482.8333333333, ans=0.125 2024-06-21 00:39:00,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=294482.8333333333, ans=0.0 2024-06-21 00:39:03,409 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 2.060e+02 2.205e+02 2.342e+02 3.169e+02, threshold=4.411e+02, percent-clipped=0.0 2024-06-21 00:39:16,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=294519.5, ans=0.125 2024-06-21 00:39:25,489 INFO [train.py:1028] (1/2) Epoch 16, batch 8900, loss[loss=0.2121, simple_loss=0.2729, pruned_loss=0.07565, over 12900.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.2881, pruned_loss=0.09076, over 2561238.38 frames. ], batch size: 33, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:39:25,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=294537.8333333333, ans=0.0 2024-06-21 00:39:40,295 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.03 vs. limit=15.0 2024-06-21 00:39:53,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=294574.5, ans=0.0 2024-06-21 00:39:58,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=294592.8333333333, ans=0.0 2024-06-21 00:40:20,942 INFO [train.py:1028] (1/2) Epoch 16, batch 8950, loss[loss=0.2684, simple_loss=0.3146, pruned_loss=0.1111, over 12458.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.2889, pruned_loss=0.09066, over 2560580.06 frames. ], batch size: 202, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:40:28,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=294629.5, ans=0.0 2024-06-21 00:40:29,192 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=22.5 2024-06-21 00:40:30,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=294647.8333333333, ans=0.125 2024-06-21 00:40:48,067 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 2.060e+02 2.227e+02 2.403e+02 3.562e+02, threshold=4.453e+02, percent-clipped=0.0 2024-06-21 00:40:58,339 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.16 vs. limit=15.0 2024-06-21 00:41:06,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=294702.8333333333, ans=0.09899494936611666 2024-06-21 00:41:10,586 INFO [train.py:1028] (1/2) Epoch 16, batch 9000, loss[loss=0.2175, simple_loss=0.2708, pruned_loss=0.08214, over 13340.00 frames. ], tot_loss[loss=0.235, simple_loss=0.289, pruned_loss=0.09052, over 2566301.19 frames. ], batch size: 46, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:41:10,586 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 00:41:17,953 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.5687, 2.3058, 2.5418, 2.1693], device='cuda:1') 2024-06-21 00:41:23,698 INFO [train.py:1060] (1/2) Epoch 16, validation: loss=0.1882, simple_loss=0.2528, pruned_loss=0.06174, over 351949.00 frames. 2024-06-21 00:41:23,699 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 00:41:43,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=294757.8333333333, ans=0.0 2024-06-21 00:41:55,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=294776.1666666667, ans=0.125 2024-06-21 00:42:10,638 INFO [train.py:1028] (1/2) Epoch 16, batch 9050, loss[loss=0.2275, simple_loss=0.2889, pruned_loss=0.08308, over 12368.00 frames. ], tot_loss[loss=0.236, simple_loss=0.2901, pruned_loss=0.09094, over 2567222.45 frames. ], batch size: 19, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:42:32,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=294849.5, ans=0.125 2024-06-21 00:42:33,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=294849.5, ans=0.125 2024-06-21 00:42:36,206 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.067e+02 2.207e+02 2.370e+02 2.974e+02, threshold=4.414e+02, percent-clipped=0.0 2024-06-21 00:42:42,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=294867.8333333333, ans=0.0 2024-06-21 00:42:45,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=294867.8333333333, ans=0.2 2024-06-21 00:42:47,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=294867.8333333333, ans=0.125 2024-06-21 00:42:48,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2024-06-21 00:42:58,970 INFO [train.py:1028] (1/2) Epoch 16, batch 9100, loss[loss=0.2449, simple_loss=0.3078, pruned_loss=0.09101, over 13250.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.2896, pruned_loss=0.09069, over 2568470.74 frames. ], batch size: 72, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:43:02,891 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.12 vs. limit=15.0 2024-06-21 00:43:15,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=294922.8333333333, ans=15.0 2024-06-21 00:43:24,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=294941.1666666667, ans=0.0 2024-06-21 00:43:30,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=294959.5, ans=0.125 2024-06-21 00:43:33,733 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.09 vs. limit=15.0 2024-06-21 00:43:42,256 INFO [train.py:1028] (1/2) Epoch 16, batch 9150, loss[loss=0.224, simple_loss=0.2847, pruned_loss=0.08167, over 13173.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.2891, pruned_loss=0.09057, over 2570193.62 frames. ], batch size: 77, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:43:44,153 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.87 vs. limit=15.0 2024-06-21 00:43:44,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=294996.1666666667, ans=0.125 2024-06-21 00:43:55,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=295014.5, ans=0.1 2024-06-21 00:44:02,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=295032.8333333333, ans=0.125 2024-06-21 00:44:07,835 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.083e+02 2.233e+02 2.458e+02 3.018e+02, threshold=4.467e+02, percent-clipped=0.0 2024-06-21 00:44:09,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=295032.8333333333, ans=0.0 2024-06-21 00:44:09,695 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.39 vs. limit=22.5 2024-06-21 00:44:14,484 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=295051.1666666667, ans=0.125 2024-06-21 00:44:18,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=295051.1666666667, ans=0.2 2024-06-21 00:44:18,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=295051.1666666667, ans=0.2 2024-06-21 00:44:28,707 INFO [train.py:1028] (1/2) Epoch 16, batch 9200, loss[loss=0.2291, simple_loss=0.2921, pruned_loss=0.08308, over 12952.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.2888, pruned_loss=0.08995, over 2573270.47 frames. ], batch size: 36, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:45:01,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=295142.8333333333, ans=0.07 2024-06-21 00:45:16,713 INFO [train.py:1028] (1/2) Epoch 16, batch 9250, loss[loss=0.2407, simple_loss=0.2904, pruned_loss=0.0955, over 13201.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.2892, pruned_loss=0.08991, over 2574336.08 frames. ], batch size: 67, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:45:28,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=295197.8333333333, ans=0.0 2024-06-21 00:45:29,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=295197.8333333333, ans=0.125 2024-06-21 00:45:35,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=295216.1666666667, ans=0.125 2024-06-21 00:45:40,565 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 2.038e+02 2.172e+02 2.341e+02 3.671e+02, threshold=4.344e+02, percent-clipped=0.0 2024-06-21 00:45:55,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=295252.8333333333, ans=0.125 2024-06-21 00:46:01,050 INFO [train.py:1028] (1/2) Epoch 16, batch 9300, loss[loss=0.2245, simple_loss=0.2777, pruned_loss=0.08567, over 12945.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.2894, pruned_loss=0.08985, over 2570916.29 frames. ], batch size: 39, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:46:25,532 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=295307.8333333333, ans=0.125 2024-06-21 00:46:37,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=295344.5, ans=0.125 2024-06-21 00:46:41,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=295344.5, ans=0.0 2024-06-21 00:46:46,398 INFO [train.py:1028] (1/2) Epoch 16, batch 9350, loss[loss=0.2425, simple_loss=0.3034, pruned_loss=0.09077, over 12798.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.29, pruned_loss=0.09025, over 2568632.89 frames. ], batch size: 22, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:46:49,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=295362.8333333333, ans=0.1 2024-06-21 00:46:56,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=295381.1666666667, ans=0.0 2024-06-21 00:47:02,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=295399.5, ans=0.5 2024-06-21 00:47:03,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=295399.5, ans=0.1 2024-06-21 00:47:10,197 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.054e+02 2.151e+02 2.286e+02 2.863e+02, threshold=4.301e+02, percent-clipped=0.0 2024-06-21 00:47:15,395 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=26.04 vs. limit=22.5 2024-06-21 00:47:18,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=295417.8333333333, ans=0.04949747468305833 2024-06-21 00:47:30,049 INFO [train.py:1028] (1/2) Epoch 16, batch 9400, loss[loss=0.2298, simple_loss=0.2911, pruned_loss=0.08423, over 13264.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.2899, pruned_loss=0.09032, over 2567726.31 frames. ], batch size: 52, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:47:32,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=295454.5, ans=0.125 2024-06-21 00:47:51,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=295491.1666666667, ans=0.025 2024-06-21 00:47:58,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=295509.5, ans=0.2 2024-06-21 00:48:14,157 INFO [train.py:1028] (1/2) Epoch 16, batch 9450, loss[loss=0.232, simple_loss=0.2851, pruned_loss=0.08946, over 12643.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.291, pruned_loss=0.0909, over 2568636.57 frames. ], batch size: 22, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:48:16,566 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:48:29,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.60 vs. limit=22.5 2024-06-21 00:48:37,789 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 2.031e+02 2.129e+02 2.339e+02 3.092e+02, threshold=4.257e+02, percent-clipped=0.0 2024-06-21 00:48:49,516 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:48:57,190 INFO [train.py:1028] (1/2) Epoch 16, batch 9500, loss[loss=0.2286, simple_loss=0.2818, pruned_loss=0.0877, over 13280.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.29, pruned_loss=0.09038, over 2577440.15 frames. ], batch size: 43, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:49:05,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=295656.1666666667, ans=10.0 2024-06-21 00:49:15,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=295656.1666666667, ans=0.2 2024-06-21 00:49:16,983 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=7.64 vs. limit=12.0 2024-06-21 00:49:32,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=295692.8333333333, ans=0.0 2024-06-21 00:49:36,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=295711.1666666667, ans=0.0 2024-06-21 00:49:45,403 INFO [train.py:1028] (1/2) Epoch 16, batch 9550, loss[loss=0.229, simple_loss=0.2914, pruned_loss=0.08333, over 12873.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.2902, pruned_loss=0.09071, over 2571962.45 frames. ], batch size: 39, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:49:49,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=295729.5, ans=0.0 2024-06-21 00:50:00,777 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.55 vs. limit=15.0 2024-06-21 00:50:01,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=295766.1666666667, ans=0.125 2024-06-21 00:50:08,392 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.091e+02 2.304e+02 2.615e+02 3.563e+02, threshold=4.608e+02, percent-clipped=0.0 2024-06-21 00:50:09,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=295784.5, ans=0.0 2024-06-21 00:50:11,196 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:50:17,083 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.82 vs. limit=15.0 2024-06-21 00:50:23,784 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:50:27,183 INFO [train.py:1028] (1/2) Epoch 16, batch 9600, loss[loss=0.2506, simple_loss=0.2922, pruned_loss=0.1045, over 10367.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.29, pruned_loss=0.09074, over 2570570.36 frames. ], batch size: 303, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:50:31,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=295821.1666666667, ans=0.2 2024-06-21 00:50:42,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=295839.5, ans=0.05 2024-06-21 00:50:52,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=295857.8333333333, ans=0.2 2024-06-21 00:50:56,343 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.61 vs. limit=12.0 2024-06-21 00:50:58,919 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=295876.1666666667, ans=0.0 2024-06-21 00:50:59,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=295876.1666666667, ans=0.0 2024-06-21 00:51:00,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=295876.1666666667, ans=0.125 2024-06-21 00:51:08,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=295894.5, ans=0.125 2024-06-21 00:51:13,125 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.83 vs. limit=15.0 2024-06-21 00:51:13,483 INFO [train.py:1028] (1/2) Epoch 16, batch 9650, loss[loss=0.2198, simple_loss=0.2724, pruned_loss=0.0836, over 13111.00 frames. ], tot_loss[loss=0.236, simple_loss=0.2898, pruned_loss=0.09108, over 2561642.53 frames. ], batch size: 132, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:51:23,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=295931.1666666667, ans=0.125 2024-06-21 00:51:36,880 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.111e+02 2.278e+02 2.540e+02 3.381e+02, threshold=4.555e+02, percent-clipped=0.0 2024-06-21 00:51:39,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=295967.8333333333, ans=0.0 2024-06-21 00:51:41,495 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.22 vs. limit=15.0 2024-06-21 00:51:50,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=295986.1666666667, ans=0.125 2024-06-21 00:51:53,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=295986.1666666667, ans=0.025 2024-06-21 00:51:56,236 INFO [train.py:1028] (1/2) Epoch 16, batch 9700, loss[loss=0.2451, simple_loss=0.2947, pruned_loss=0.09776, over 13067.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.29, pruned_loss=0.09125, over 2555141.34 frames. ], batch size: 144, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:52:20,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=296041.1666666667, ans=0.0 2024-06-21 00:52:26,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=296059.5, ans=0.0 2024-06-21 00:52:28,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=296059.5, ans=0.0 2024-06-21 00:52:28,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=296059.5, ans=0.0 2024-06-21 00:52:32,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=296059.5, ans=10.0 2024-06-21 00:52:39,808 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.83 vs. limit=15.0 2024-06-21 00:52:41,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=296077.8333333333, ans=0.125 2024-06-21 00:52:43,314 INFO [train.py:1028] (1/2) Epoch 16, batch 9750, loss[loss=0.2467, simple_loss=0.2954, pruned_loss=0.09897, over 13077.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.2886, pruned_loss=0.09025, over 2551627.08 frames. ], batch size: 132, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:52:43,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=296096.1666666667, ans=0.125 2024-06-21 00:52:52,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=296114.5, ans=0.025 2024-06-21 00:53:01,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=296132.8333333333, ans=0.125 2024-06-21 00:53:05,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=296132.8333333333, ans=0.2 2024-06-21 00:53:06,425 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.052e+02 2.160e+02 2.428e+02 3.667e+02, threshold=4.320e+02, percent-clipped=0.0 2024-06-21 00:53:13,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=296151.1666666667, ans=0.1 2024-06-21 00:53:23,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=296169.5, ans=0.0 2024-06-21 00:53:24,553 INFO [train.py:1028] (1/2) Epoch 16, batch 9800, loss[loss=0.2185, simple_loss=0.2717, pruned_loss=0.08268, over 12861.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.2877, pruned_loss=0.08968, over 2545697.67 frames. ], batch size: 39, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:53:29,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=296187.8333333333, ans=0.1 2024-06-21 00:53:35,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=296206.1666666667, ans=0.0 2024-06-21 00:53:36,550 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.97 vs. limit=22.5 2024-06-21 00:53:44,630 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2024-06-21 00:53:46,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=296224.5, ans=0.0 2024-06-21 00:53:50,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=296242.8333333333, ans=0.0 2024-06-21 00:54:07,020 INFO [train.py:1028] (1/2) Epoch 16, batch 9850, loss[loss=0.2451, simple_loss=0.2958, pruned_loss=0.09725, over 13031.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.2875, pruned_loss=0.08954, over 2537068.78 frames. ], batch size: 102, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:54:08,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=296279.5, ans=0.1 2024-06-21 00:54:21,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=296297.8333333333, ans=0.125 2024-06-21 00:54:31,711 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 2.050e+02 2.180e+02 2.433e+02 3.497e+02, threshold=4.359e+02, percent-clipped=0.0 2024-06-21 00:54:32,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=296334.5, ans=0.015 2024-06-21 00:54:33,639 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.951e+01 2024-06-21 00:54:41,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=296352.8333333333, ans=0.125 2024-06-21 00:54:45,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=296352.8333333333, ans=0.1 2024-06-21 00:54:50,355 INFO [train.py:1028] (1/2) Epoch 16, batch 9900, loss[loss=0.216, simple_loss=0.2681, pruned_loss=0.08194, over 12898.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.2872, pruned_loss=0.09, over 2529854.59 frames. ], batch size: 39, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:54:50,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=296371.1666666667, ans=0.0 2024-06-21 00:54:58,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=296389.5, ans=0.125 2024-06-21 00:55:08,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=296407.8333333333, ans=0.07 2024-06-21 00:55:22,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=296444.5, ans=0.125 2024-06-21 00:55:31,503 INFO [train.py:1028] (1/2) Epoch 16, batch 9950, loss[loss=0.2516, simple_loss=0.2991, pruned_loss=0.1021, over 12554.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.2862, pruned_loss=0.09017, over 2522395.82 frames. ], batch size: 29, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:55:37,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=296462.8333333333, ans=0.125 2024-06-21 00:55:43,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=296481.1666666667, ans=0.125 2024-06-21 00:55:50,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=296499.5, ans=0.2 2024-06-21 00:55:53,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=296499.5, ans=0.0 2024-06-21 00:55:56,105 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.17 vs. limit=22.5 2024-06-21 00:55:57,112 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.028e+02 2.184e+02 2.354e+02 3.237e+02, threshold=4.367e+02, percent-clipped=0.0 2024-06-21 00:56:16,300 INFO [train.py:1028] (1/2) Epoch 16, batch 10000, loss[loss=0.2096, simple_loss=0.2807, pruned_loss=0.06918, over 12482.00 frames. ], tot_loss[loss=0.234, simple_loss=0.2866, pruned_loss=0.09073, over 2486364.27 frames. ], batch size: 22, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:56:27,503 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.66 vs. limit=12.0 2024-06-21 00:56:30,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=296572.8333333333, ans=0.0 2024-06-21 00:56:47,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=296609.5, ans=0.025 2024-06-21 00:56:47,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=296609.5, ans=0.1 2024-06-21 00:57:00,197 INFO [train.py:1028] (1/2) Epoch 16, batch 10050, loss[loss=0.2571, simple_loss=0.3092, pruned_loss=0.1025, over 12396.00 frames. ], tot_loss[loss=0.235, simple_loss=0.2869, pruned_loss=0.0915, over 2443580.48 frames. ], batch size: 22, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:57:01,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=296646.1666666667, ans=0.0 2024-06-21 00:57:14,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=296664.5, ans=0.125 2024-06-21 00:57:18,673 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.96 vs. limit=10.0 2024-06-21 00:57:20,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=296682.8333333333, ans=0.1 2024-06-21 00:57:23,600 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.106e+02 2.262e+02 2.542e+02 4.440e+02, threshold=4.524e+02, percent-clipped=1.0 2024-06-21 00:57:23,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=296682.8333333333, ans=0.0 2024-06-21 00:57:30,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=296701.1666666667, ans=0.0 2024-06-21 00:57:32,199 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=8.55 vs. limit=12.0 2024-06-21 00:57:40,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=296719.5, ans=0.035 2024-06-21 00:57:44,041 INFO [train.py:1028] (1/2) Epoch 16, batch 10100, loss[loss=0.2087, simple_loss=0.2617, pruned_loss=0.07787, over 11120.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.2857, pruned_loss=0.09007, over 2424008.81 frames. ], batch size: 16, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:57:45,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=296737.8333333333, ans=0.125 2024-06-21 00:57:46,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=296737.8333333333, ans=0.0 2024-06-21 01:00:56,107 INFO [train.py:1028] (1/2) Epoch 17, batch 0, loss[loss=0.2014, simple_loss=0.2563, pruned_loss=0.07326, over 12930.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2563, pruned_loss=0.07326, over 12930.00 frames. ], batch size: 36, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:00:56,108 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 01:01:02,180 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([6.7348, 5.8715, 6.3093, 6.0947], device='cuda:1') 2024-06-21 01:01:04,980 INFO [train.py:1060] (1/2) Epoch 17, validation: loss=0.1896, simple_loss=0.255, pruned_loss=0.06204, over 351949.00 frames. 2024-06-21 01:01:04,981 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 01:01:40,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=296825.8333333333, ans=0.1 2024-06-21 01:01:53,373 INFO [train.py:1028] (1/2) Epoch 17, batch 50, loss[loss=0.1998, simple_loss=0.2567, pruned_loss=0.0714, over 12557.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2673, pruned_loss=0.08158, over 574192.72 frames. ], batch size: 29, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:02:01,889 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 1.942e+02 2.093e+02 2.276e+02 2.812e+02, threshold=4.187e+02, percent-clipped=0.0 2024-06-21 01:02:03,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=296880.8333333333, ans=0.2 2024-06-21 01:02:14,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=296899.1666666667, ans=0.1 2024-06-21 01:02:19,864 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.65 vs. limit=22.5 2024-06-21 01:02:21,652 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.15 vs. limit=8.0 2024-06-21 01:02:22,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=296917.5, ans=0.125 2024-06-21 01:02:27,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=296935.8333333333, ans=0.125 2024-06-21 01:02:29,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=296935.8333333333, ans=0.0 2024-06-21 01:02:36,394 INFO [train.py:1028] (1/2) Epoch 17, batch 100, loss[loss=0.2189, simple_loss=0.2756, pruned_loss=0.08109, over 13329.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2655, pruned_loss=0.08109, over 1016846.86 frames. ], batch size: 46, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:02:44,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=296972.5, ans=0.0 2024-06-21 01:02:44,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=296972.5, ans=0.125 2024-06-21 01:03:04,788 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:03:10,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=297009.1666666667, ans=0.0 2024-06-21 01:03:29,056 INFO [train.py:1028] (1/2) Epoch 17, batch 150, loss[loss=0.2131, simple_loss=0.2658, pruned_loss=0.08022, over 12543.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2662, pruned_loss=0.08096, over 1364837.12 frames. ], batch size: 29, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:03:38,114 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 1.949e+02 2.067e+02 2.180e+02 2.825e+02, threshold=4.135e+02, percent-clipped=0.0 2024-06-21 01:03:44,097 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.33 vs. limit=15.0 2024-06-21 01:03:54,564 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.02 vs. limit=22.5 2024-06-21 01:04:05,992 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.90 vs. limit=15.0 2024-06-21 01:04:07,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=297119.1666666667, ans=0.125 2024-06-21 01:04:08,068 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.64 vs. limit=22.5 2024-06-21 01:04:11,342 INFO [train.py:1028] (1/2) Epoch 17, batch 200, loss[loss=0.2403, simple_loss=0.2817, pruned_loss=0.09943, over 12598.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2665, pruned_loss=0.08135, over 1634185.31 frames. ], batch size: 202, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:04:15,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=297137.5, ans=0.125 2024-06-21 01:04:19,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=297137.5, ans=0.1 2024-06-21 01:04:22,487 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.35 vs. limit=15.0 2024-06-21 01:04:23,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=297155.8333333333, ans=10.0 2024-06-21 01:04:35,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.28 vs. limit=15.0 2024-06-21 01:04:41,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=297174.1666666667, ans=0.125 2024-06-21 01:04:44,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=297192.5, ans=0.1 2024-06-21 01:04:56,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=297210.8333333333, ans=0.2 2024-06-21 01:04:59,893 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=297210.8333333333, ans=0.125 2024-06-21 01:05:02,493 INFO [train.py:1028] (1/2) Epoch 17, batch 250, loss[loss=0.2118, simple_loss=0.2498, pruned_loss=0.08694, over 13041.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2665, pruned_loss=0.08173, over 1845224.36 frames. ], batch size: 144, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:05:11,891 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 2.034e+02 2.201e+02 2.360e+02 3.228e+02, threshold=4.403e+02, percent-clipped=0.0 2024-06-21 01:05:17,151 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.38 vs. limit=15.0 2024-06-21 01:05:17,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=297247.5, ans=0.05 2024-06-21 01:05:29,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=297284.1666666667, ans=0.1 2024-06-21 01:05:30,799 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.77 vs. limit=15.0 2024-06-21 01:05:44,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=297302.5, ans=0.125 2024-06-21 01:05:44,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=297302.5, ans=0.125 2024-06-21 01:05:46,195 INFO [train.py:1028] (1/2) Epoch 17, batch 300, loss[loss=0.2268, simple_loss=0.2669, pruned_loss=0.09333, over 13189.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2658, pruned_loss=0.0812, over 2008788.49 frames. ], batch size: 112, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:06:07,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=297357.5, ans=0.125 2024-06-21 01:06:36,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=297394.1666666667, ans=0.125 2024-06-21 01:06:38,552 INFO [train.py:1028] (1/2) Epoch 17, batch 350, loss[loss=0.2128, simple_loss=0.2736, pruned_loss=0.07599, over 12947.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2653, pruned_loss=0.08081, over 2138429.72 frames. ], batch size: 33, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:06:38,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=297412.5, ans=0.0 2024-06-21 01:06:45,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=297412.5, ans=0.125 2024-06-21 01:06:47,640 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 1.967e+02 2.096e+02 2.286e+02 3.034e+02, threshold=4.193e+02, percent-clipped=0.0 2024-06-21 01:07:10,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=297467.5, ans=0.025 2024-06-21 01:07:15,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=297485.8333333333, ans=0.1 2024-06-21 01:07:16,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=297485.8333333333, ans=10.0 2024-06-21 01:07:17,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=297485.8333333333, ans=0.125 2024-06-21 01:07:29,651 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=22.5 2024-06-21 01:07:30,078 INFO [train.py:1028] (1/2) Epoch 17, batch 400, loss[loss=0.2066, simple_loss=0.2691, pruned_loss=0.07203, over 13299.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.265, pruned_loss=0.0802, over 2239619.62 frames. ], batch size: 63, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:07:30,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=297504.1666666667, ans=0.2 2024-06-21 01:07:30,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=297504.1666666667, ans=0.2 2024-06-21 01:07:30,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.90 vs. limit=15.0 2024-06-21 01:07:57,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=297559.1666666667, ans=0.125 2024-06-21 01:08:10,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=297577.5, ans=0.125 2024-06-21 01:08:16,001 INFO [train.py:1028] (1/2) Epoch 17, batch 450, loss[loss=0.2247, simple_loss=0.2812, pruned_loss=0.08412, over 13215.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2652, pruned_loss=0.08016, over 2315215.21 frames. ], batch size: 67, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:08:24,591 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.955e+02 2.055e+02 2.174e+02 2.775e+02, threshold=4.109e+02, percent-clipped=0.0 2024-06-21 01:08:51,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=297669.1666666667, ans=0.025 2024-06-21 01:08:52,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=297669.1666666667, ans=15.0 2024-06-21 01:08:57,547 INFO [train.py:1028] (1/2) Epoch 17, batch 500, loss[loss=0.1982, simple_loss=0.2493, pruned_loss=0.07355, over 13106.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2655, pruned_loss=0.08004, over 2377000.65 frames. ], batch size: 121, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:08:58,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=297687.5, ans=0.025 2024-06-21 01:09:01,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=297687.5, ans=0.2 2024-06-21 01:09:21,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=297724.1666666667, ans=0.125 2024-06-21 01:09:33,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=297760.8333333333, ans=0.1 2024-06-21 01:09:41,846 INFO [train.py:1028] (1/2) Epoch 17, batch 550, loss[loss=0.2011, simple_loss=0.2476, pruned_loss=0.07729, over 12953.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2656, pruned_loss=0.08022, over 2421270.25 frames. ], batch size: 158, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:09:45,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=297779.1666666667, ans=0.0 2024-06-21 01:09:50,764 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 1.928e+02 2.028e+02 2.198e+02 3.160e+02, threshold=4.057e+02, percent-clipped=0.0 2024-06-21 01:09:59,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=297815.8333333333, ans=0.1 2024-06-21 01:10:17,210 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.98 vs. limit=6.0 2024-06-21 01:10:20,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=297852.5, ans=0.125 2024-06-21 01:10:23,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=297852.5, ans=0.0 2024-06-21 01:10:25,815 INFO [train.py:1028] (1/2) Epoch 17, batch 600, loss[loss=0.2048, simple_loss=0.25, pruned_loss=0.07979, over 13032.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2656, pruned_loss=0.08019, over 2459054.89 frames. ], batch size: 144, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:10:58,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=297944.1666666667, ans=0.125 2024-06-21 01:11:00,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=297944.1666666667, ans=0.125 2024-06-21 01:11:04,256 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.87 vs. limit=15.0 2024-06-21 01:11:07,366 INFO [train.py:1028] (1/2) Epoch 17, batch 650, loss[loss=0.2141, simple_loss=0.2659, pruned_loss=0.08118, over 13219.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2651, pruned_loss=0.07964, over 2489859.31 frames. ], batch size: 59, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:11:17,236 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.020e+02 2.118e+02 2.265e+02 2.937e+02, threshold=4.235e+02, percent-clipped=0.0 2024-06-21 01:11:31,237 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=297999.1666666667, ans=0.1 2024-06-21 01:11:35,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=298017.5, ans=0.125 2024-06-21 01:11:47,736 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:11:52,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=298054.1666666667, ans=0.1 2024-06-21 01:11:52,901 INFO [train.py:1028] (1/2) Epoch 17, batch 700, loss[loss=0.1976, simple_loss=0.2548, pruned_loss=0.07024, over 13186.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2649, pruned_loss=0.07974, over 2512696.77 frames. ], batch size: 46, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:12:13,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=298072.5, ans=0.0 2024-06-21 01:12:24,257 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=3.947e-02 2024-06-21 01:12:25,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=298109.1666666667, ans=0.09899494936611666 2024-06-21 01:12:40,397 INFO [train.py:1028] (1/2) Epoch 17, batch 750, loss[loss=0.2235, simple_loss=0.2773, pruned_loss=0.0848, over 13278.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2648, pruned_loss=0.07949, over 2527382.45 frames. ], batch size: 63, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:12:40,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=298145.8333333333, ans=0.125 2024-06-21 01:12:41,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=298145.8333333333, ans=0.125 2024-06-21 01:12:49,367 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 1.954e+02 2.080e+02 2.263e+02 2.989e+02, threshold=4.161e+02, percent-clipped=0.0 2024-06-21 01:13:14,845 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.06 vs. limit=15.0 2024-06-21 01:13:17,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=298200.8333333333, ans=0.5 2024-06-21 01:13:20,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=298200.8333333333, ans=0.0 2024-06-21 01:13:32,255 INFO [train.py:1028] (1/2) Epoch 17, batch 800, loss[loss=0.1994, simple_loss=0.2584, pruned_loss=0.07023, over 12925.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2645, pruned_loss=0.07919, over 2541679.83 frames. ], batch size: 36, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:13:37,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=298237.5, ans=0.2 2024-06-21 01:13:41,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=298255.8333333333, ans=0.125 2024-06-21 01:13:45,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=298255.8333333333, ans=0.125 2024-06-21 01:14:10,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=298310.8333333333, ans=0.0 2024-06-21 01:14:15,468 INFO [train.py:1028] (1/2) Epoch 17, batch 850, loss[loss=0.1981, simple_loss=0.2552, pruned_loss=0.07044, over 13106.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2643, pruned_loss=0.07928, over 2552451.49 frames. ], batch size: 95, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:14:17,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=298329.1666666667, ans=0.0 2024-06-21 01:14:22,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=298347.5, ans=0.0 2024-06-21 01:14:23,051 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 1.966e+02 2.081e+02 2.235e+02 2.672e+02, threshold=4.162e+02, percent-clipped=0.0 2024-06-21 01:14:32,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=298365.8333333333, ans=0.125 2024-06-21 01:14:34,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=298365.8333333333, ans=0.05 2024-06-21 01:14:37,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=298365.8333333333, ans=0.0 2024-06-21 01:14:41,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=298384.1666666667, ans=0.125 2024-06-21 01:14:42,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=298384.1666666667, ans=0.0 2024-06-21 01:14:56,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=298402.5, ans=0.125 2024-06-21 01:15:02,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=298402.5, ans=0.2 2024-06-21 01:15:03,403 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=22.5 2024-06-21 01:15:04,988 INFO [train.py:1028] (1/2) Epoch 17, batch 900, loss[loss=0.2153, simple_loss=0.2634, pruned_loss=0.08362, over 13032.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2644, pruned_loss=0.0796, over 2557537.24 frames. ], batch size: 36, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:15:11,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=298420.8333333333, ans=0.125 2024-06-21 01:15:16,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=298439.1666666667, ans=0.0 2024-06-21 01:15:16,697 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.75 vs. limit=22.5 2024-06-21 01:15:23,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=298457.5, ans=0.125 2024-06-21 01:15:24,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=298457.5, ans=0.1 2024-06-21 01:15:25,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=298457.5, ans=0.1 2024-06-21 01:15:26,029 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=298457.5, ans=0.95 2024-06-21 01:15:56,854 INFO [train.py:1028] (1/2) Epoch 17, batch 950, loss[loss=0.2107, simple_loss=0.2721, pruned_loss=0.07461, over 12852.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2642, pruned_loss=0.07948, over 2561091.83 frames. ], batch size: 39, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:15:59,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=298512.5, ans=0.0 2024-06-21 01:16:01,146 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.87 vs. limit=15.0 2024-06-21 01:16:05,981 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 1.908e+02 2.011e+02 2.199e+02 3.008e+02, threshold=4.023e+02, percent-clipped=0.0 2024-06-21 01:16:09,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=298530.8333333333, ans=0.125 2024-06-21 01:16:32,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=298567.5, ans=0.025 2024-06-21 01:16:36,383 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.43 vs. limit=22.5 2024-06-21 01:16:43,724 INFO [train.py:1028] (1/2) Epoch 17, batch 1000, loss[loss=0.2309, simple_loss=0.2867, pruned_loss=0.08755, over 13050.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2646, pruned_loss=0.07998, over 2562795.58 frames. ], batch size: 48, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:16:47,279 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:16:50,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=298604.1666666667, ans=0.125 2024-06-21 01:16:52,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=298622.5, ans=0.125 2024-06-21 01:16:54,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=298622.5, ans=0.125 2024-06-21 01:17:07,211 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.54 vs. limit=15.0 2024-06-21 01:17:16,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=298659.1666666667, ans=0.125 2024-06-21 01:17:26,223 INFO [train.py:1028] (1/2) Epoch 17, batch 1050, loss[loss=0.1939, simple_loss=0.2546, pruned_loss=0.06656, over 13223.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2657, pruned_loss=0.08023, over 2564058.54 frames. ], batch size: 77, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:17:31,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=298695.8333333333, ans=0.0 2024-06-21 01:17:37,445 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 1.973e+02 2.061e+02 2.394e+02 3.486e+02, threshold=4.123e+02, percent-clipped=0.0 2024-06-21 01:17:58,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=298750.8333333333, ans=0.125 2024-06-21 01:17:58,731 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.09 vs. limit=15.0 2024-06-21 01:17:59,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=298750.8333333333, ans=22.5 2024-06-21 01:18:00,347 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.72 vs. limit=15.0 2024-06-21 01:18:08,697 INFO [train.py:1028] (1/2) Epoch 17, batch 1100, loss[loss=0.1992, simple_loss=0.2553, pruned_loss=0.07151, over 13317.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2654, pruned_loss=0.08034, over 2569127.53 frames. ], batch size: 52, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:18:13,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=298787.5, ans=0.1 2024-06-21 01:18:16,045 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.75 vs. limit=10.0 2024-06-21 01:18:18,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=298805.8333333333, ans=0.125 2024-06-21 01:18:27,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=298824.1666666667, ans=0.125 2024-06-21 01:18:28,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=298824.1666666667, ans=0.1 2024-06-21 01:18:36,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=298842.5, ans=0.0 2024-06-21 01:18:39,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=298842.5, ans=0.025 2024-06-21 01:18:53,921 INFO [train.py:1028] (1/2) Epoch 17, batch 1150, loss[loss=0.2187, simple_loss=0.272, pruned_loss=0.08267, over 13245.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2655, pruned_loss=0.08044, over 2570011.45 frames. ], batch size: 52, lr: 3.47e-03, grad_scale: 64.0 2024-06-21 01:18:58,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=298879.1666666667, ans=0.025 2024-06-21 01:19:01,706 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2024-06-21 01:19:02,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=298897.5, ans=0.2 2024-06-21 01:19:02,841 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.979e+02 2.080e+02 2.298e+02 3.092e+02, threshold=4.160e+02, percent-clipped=0.0 2024-06-21 01:19:11,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=298915.8333333333, ans=0.025 2024-06-21 01:19:34,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=298952.5, ans=0.125 2024-06-21 01:19:38,208 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2024-06-21 01:19:39,473 INFO [train.py:1028] (1/2) Epoch 17, batch 1200, loss[loss=0.2081, simple_loss=0.2682, pruned_loss=0.074, over 13210.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2656, pruned_loss=0.08063, over 2572817.24 frames. ], batch size: 77, lr: 3.47e-03, grad_scale: 64.0 2024-06-21 01:20:16,233 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.83 vs. limit=15.0 2024-06-21 01:20:19,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=299025.8333333333, ans=0.125 2024-06-21 01:20:29,375 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.28 vs. limit=8.0 2024-06-21 01:20:31,197 INFO [train.py:1028] (1/2) Epoch 17, batch 1250, loss[loss=0.2045, simple_loss=0.2479, pruned_loss=0.08061, over 13155.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2644, pruned_loss=0.07992, over 2581761.04 frames. ], batch size: 112, lr: 3.47e-03, grad_scale: 64.0 2024-06-21 01:20:32,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=299062.5, ans=0.0 2024-06-21 01:20:38,709 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 1.924e+02 2.117e+02 2.280e+02 3.067e+02, threshold=4.233e+02, percent-clipped=0.0 2024-06-21 01:20:39,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=299080.8333333333, ans=0.125 2024-06-21 01:20:40,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=299080.8333333333, ans=0.125 2024-06-21 01:20:48,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=299099.1666666667, ans=0.2 2024-06-21 01:21:14,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=299135.8333333333, ans=0.0 2024-06-21 01:21:20,281 INFO [train.py:1028] (1/2) Epoch 17, batch 1300, loss[loss=0.2137, simple_loss=0.2677, pruned_loss=0.07982, over 12736.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2649, pruned_loss=0.08024, over 2582009.01 frames. ], batch size: 176, lr: 3.47e-03, grad_scale: 64.0 2024-06-21 01:21:24,606 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:21:25,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=299154.1666666667, ans=0.0 2024-06-21 01:21:41,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=299190.8333333333, ans=0.125 2024-06-21 01:21:44,742 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.47 vs. limit=15.0 2024-06-21 01:22:04,641 INFO [train.py:1028] (1/2) Epoch 17, batch 1350, loss[loss=0.2191, simple_loss=0.2822, pruned_loss=0.078, over 13223.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2647, pruned_loss=0.08, over 2583432.30 frames. ], batch size: 59, lr: 3.47e-03, grad_scale: 64.0 2024-06-21 01:22:12,947 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 1.951e+02 2.065e+02 2.211e+02 2.588e+02, threshold=4.131e+02, percent-clipped=0.0 2024-06-21 01:22:35,163 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2024-06-21 01:22:40,443 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.55 vs. limit=15.0 2024-06-21 01:22:40,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=299300.8333333333, ans=0.125 2024-06-21 01:22:46,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=299319.1666666667, ans=0.04949747468305833 2024-06-21 01:22:48,773 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.63 vs. limit=15.0 2024-06-21 01:22:50,969 INFO [train.py:1028] (1/2) Epoch 17, batch 1400, loss[loss=0.2239, simple_loss=0.2816, pruned_loss=0.0831, over 12275.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2651, pruned_loss=0.08024, over 2584518.08 frames. ], batch size: 25, lr: 3.47e-03, grad_scale: 64.0 2024-06-21 01:22:58,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=299337.5, ans=0.125 2024-06-21 01:22:59,870 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.60 vs. limit=15.0 2024-06-21 01:23:30,421 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=12.0 2024-06-21 01:23:34,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=299410.8333333333, ans=0.1 2024-06-21 01:23:39,024 INFO [train.py:1028] (1/2) Epoch 17, batch 1450, loss[loss=0.1925, simple_loss=0.2388, pruned_loss=0.07309, over 13116.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2655, pruned_loss=0.08041, over 2584785.26 frames. ], batch size: 121, lr: 3.47e-03, grad_scale: 64.0 2024-06-21 01:23:41,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=299429.1666666667, ans=0.0 2024-06-21 01:23:48,520 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 1.943e+02 2.028e+02 2.165e+02 3.264e+02, threshold=4.056e+02, percent-clipped=0.0 2024-06-21 01:24:09,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=299465.8333333333, ans=0.0 2024-06-21 01:24:10,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=299465.8333333333, ans=0.125 2024-06-21 01:24:17,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=299484.1666666667, ans=0.125 2024-06-21 01:24:19,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=299484.1666666667, ans=0.1 2024-06-21 01:24:30,175 INFO [train.py:1028] (1/2) Epoch 17, batch 1500, loss[loss=0.1961, simple_loss=0.2477, pruned_loss=0.07228, over 13187.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.265, pruned_loss=0.08032, over 2588151.31 frames. ], batch size: 83, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:24:30,759 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.45 vs. limit=6.0 2024-06-21 01:24:40,770 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.42 vs. limit=12.0 2024-06-21 01:24:41,709 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.73 vs. limit=15.0 2024-06-21 01:24:45,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=299539.1666666667, ans=0.125 2024-06-21 01:24:47,435 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.87 vs. limit=22.5 2024-06-21 01:24:52,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=299557.5, ans=0.125 2024-06-21 01:25:03,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=299594.1666666667, ans=0.125 2024-06-21 01:25:05,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=299594.1666666667, ans=0.125 2024-06-21 01:25:05,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=299594.1666666667, ans=0.125 2024-06-21 01:25:11,266 INFO [train.py:1028] (1/2) Epoch 17, batch 1550, loss[loss=0.2281, simple_loss=0.277, pruned_loss=0.08958, over 13056.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2655, pruned_loss=0.08058, over 2583117.77 frames. ], batch size: 102, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:25:12,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=299612.5, ans=0.125 2024-06-21 01:25:12,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=299612.5, ans=0.1 2024-06-21 01:25:16,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=299612.5, ans=0.125 2024-06-21 01:25:17,254 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=22.5 2024-06-21 01:25:18,958 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 1.934e+02 2.071e+02 2.258e+02 2.984e+02, threshold=4.143e+02, percent-clipped=0.0 2024-06-21 01:25:22,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=299630.8333333333, ans=0.125 2024-06-21 01:25:22,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=299630.8333333333, ans=0.1 2024-06-21 01:25:30,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=299649.1666666667, ans=0.1 2024-06-21 01:25:42,293 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.71 vs. limit=15.0 2024-06-21 01:25:53,948 INFO [train.py:1028] (1/2) Epoch 17, batch 1600, loss[loss=0.2059, simple_loss=0.2577, pruned_loss=0.07706, over 13149.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2657, pruned_loss=0.08055, over 2579176.27 frames. ], batch size: 77, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:25:54,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=299704.1666666667, ans=0.125 2024-06-21 01:25:59,910 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.60 vs. limit=15.0 2024-06-21 01:26:26,693 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.28 vs. limit=15.0 2024-06-21 01:26:27,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=299759.1666666667, ans=0.125 2024-06-21 01:26:32,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=299777.5, ans=0.0 2024-06-21 01:26:38,302 INFO [train.py:1028] (1/2) Epoch 17, batch 1650, loss[loss=0.2164, simple_loss=0.2617, pruned_loss=0.08556, over 13218.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2659, pruned_loss=0.08084, over 2574841.00 frames. ], batch size: 95, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:26:39,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=299795.8333333333, ans=0.07 2024-06-21 01:26:44,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=299795.8333333333, ans=0.2 2024-06-21 01:26:47,047 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 1.984e+02 2.112e+02 2.272e+02 2.758e+02, threshold=4.224e+02, percent-clipped=0.0 2024-06-21 01:26:57,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=299832.5, ans=0.05 2024-06-21 01:27:26,076 INFO [train.py:1028] (1/2) Epoch 17, batch 1700, loss[loss=0.2176, simple_loss=0.2794, pruned_loss=0.07784, over 12817.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2665, pruned_loss=0.08095, over 2579735.43 frames. ], batch size: 26, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:27:32,081 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.59 vs. limit=15.0 2024-06-21 01:27:35,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=299905.8333333333, ans=0.1 2024-06-21 01:27:41,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=299905.8333333333, ans=0.1 2024-06-21 01:27:45,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=299924.1666666667, ans=0.125 2024-06-21 01:27:48,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=299924.1666666667, ans=0.04949747468305833 2024-06-21 01:27:59,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=299942.5, ans=0.0 2024-06-21 01:28:03,004 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:28:10,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=299960.8333333333, ans=0.125 2024-06-21 01:28:14,622 INFO [train.py:1028] (1/2) Epoch 17, batch 1750, loss[loss=0.2031, simple_loss=0.2661, pruned_loss=0.07008, over 12453.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2668, pruned_loss=0.08092, over 2580819.45 frames. ], batch size: 22, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:28:21,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=299997.5, ans=0.125 2024-06-21 01:28:22,430 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 1.956e+02 2.039e+02 2.211e+02 2.782e+02, threshold=4.079e+02, percent-clipped=0.0 2024-06-21 01:28:42,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=300015.8333333333, ans=0.125 2024-06-21 01:28:50,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=300034.1666666667, ans=0.04949747468305833 2024-06-21 01:28:51,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=300034.1666666667, ans=0.0 2024-06-21 01:29:05,032 INFO [train.py:1028] (1/2) Epoch 17, batch 1800, loss[loss=0.2147, simple_loss=0.2685, pruned_loss=0.08049, over 13224.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2672, pruned_loss=0.08118, over 2581208.41 frames. ], batch size: 67, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:29:05,698 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.73 vs. limit=15.0 2024-06-21 01:29:49,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=300144.1666666667, ans=0.1 2024-06-21 01:29:57,098 INFO [train.py:1028] (1/2) Epoch 17, batch 1850, loss[loss=0.2156, simple_loss=0.2678, pruned_loss=0.08171, over 13203.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2669, pruned_loss=0.0809, over 2582985.63 frames. ], batch size: 83, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:30:06,671 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 1.968e+02 2.072e+02 2.207e+02 3.153e+02, threshold=4.145e+02, percent-clipped=0.0 2024-06-21 01:30:18,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=300199.1666666667, ans=0.1 2024-06-21 01:30:34,598 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.37 vs. limit=22.5 2024-06-21 01:30:38,628 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.58 vs. limit=6.0 2024-06-21 01:30:43,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=300254.1666666667, ans=0.0 2024-06-21 01:30:44,528 INFO [train.py:1028] (1/2) Epoch 17, batch 1900, loss[loss=0.2053, simple_loss=0.2573, pruned_loss=0.0766, over 13164.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2661, pruned_loss=0.08092, over 2585178.50 frames. ], batch size: 95, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:30:57,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=300272.5, ans=0.0 2024-06-21 01:31:02,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=300272.5, ans=0.1 2024-06-21 01:31:05,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=300290.8333333333, ans=0.125 2024-06-21 01:31:19,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=300309.1666666667, ans=0.0 2024-06-21 01:31:33,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=300327.5, ans=0.0 2024-06-21 01:31:35,356 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.20 vs. limit=10.0 2024-06-21 01:31:37,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=300345.8333333333, ans=0.0 2024-06-21 01:31:38,429 INFO [train.py:1028] (1/2) Epoch 17, batch 1950, loss[loss=0.2122, simple_loss=0.2761, pruned_loss=0.07421, over 13249.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2656, pruned_loss=0.08091, over 2591218.30 frames. ], batch size: 52, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:31:41,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=300345.8333333333, ans=0.0 2024-06-21 01:31:48,014 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.728e+02 1.949e+02 2.046e+02 2.167e+02 2.912e+02, threshold=4.092e+02, percent-clipped=0.0 2024-06-21 01:32:16,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=300400.8333333333, ans=0.1 2024-06-21 01:32:19,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=300419.1666666667, ans=0.2 2024-06-21 01:32:29,919 INFO [train.py:1028] (1/2) Epoch 17, batch 2000, loss[loss=0.2179, simple_loss=0.2785, pruned_loss=0.07861, over 12490.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2656, pruned_loss=0.08104, over 2586917.03 frames. ], batch size: 22, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:32:39,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=300455.8333333333, ans=0.2 2024-06-21 01:32:39,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=300455.8333333333, ans=0.025 2024-06-21 01:32:42,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=300455.8333333333, ans=0.125 2024-06-21 01:32:51,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=300474.1666666667, ans=0.1 2024-06-21 01:32:56,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=300474.1666666667, ans=0.2 2024-06-21 01:33:16,664 INFO [train.py:1028] (1/2) Epoch 17, batch 2050, loss[loss=0.2125, simple_loss=0.2655, pruned_loss=0.07978, over 12636.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.266, pruned_loss=0.08127, over 2583276.94 frames. ], batch size: 29, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:33:18,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=300529.1666666667, ans=0.2 2024-06-21 01:33:25,909 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 1.992e+02 2.131e+02 2.419e+02 3.758e+02, threshold=4.262e+02, percent-clipped=0.0 2024-06-21 01:33:33,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=300547.5, ans=0.0 2024-06-21 01:33:36,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=300565.8333333333, ans=10.0 2024-06-21 01:33:37,789 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.49 vs. limit=15.0 2024-06-21 01:33:42,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=300584.1666666667, ans=0.0 2024-06-21 01:33:46,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=300584.1666666667, ans=0.2 2024-06-21 01:33:52,988 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.21 vs. limit=15.0 2024-06-21 01:33:57,828 INFO [train.py:1028] (1/2) Epoch 17, batch 2100, loss[loss=0.2109, simple_loss=0.2717, pruned_loss=0.07506, over 13197.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2662, pruned_loss=0.08099, over 2585794.98 frames. ], batch size: 59, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:33:58,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=300620.8333333333, ans=0.125 2024-06-21 01:34:07,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=300639.1666666667, ans=0.0 2024-06-21 01:34:49,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=300712.5, ans=0.125 2024-06-21 01:34:50,557 INFO [train.py:1028] (1/2) Epoch 17, batch 2150, loss[loss=0.2012, simple_loss=0.2571, pruned_loss=0.07272, over 13204.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2663, pruned_loss=0.08087, over 2589103.08 frames. ], batch size: 52, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:34:59,800 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 1.986e+02 2.133e+02 2.320e+02 2.856e+02, threshold=4.266e+02, percent-clipped=0.0 2024-06-21 01:35:16,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=300749.1666666667, ans=0.2 2024-06-21 01:35:16,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=300749.1666666667, ans=0.07 2024-06-21 01:35:27,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=300785.8333333333, ans=0.125 2024-06-21 01:35:34,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=300785.8333333333, ans=0.025 2024-06-21 01:35:36,619 INFO [train.py:1028] (1/2) Epoch 17, batch 2200, loss[loss=0.2216, simple_loss=0.2737, pruned_loss=0.08479, over 13246.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2668, pruned_loss=0.08117, over 2589048.65 frames. ], batch size: 83, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:35:40,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=300804.1666666667, ans=0.125 2024-06-21 01:35:47,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=300822.5, ans=0.0 2024-06-21 01:35:48,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.51 vs. limit=15.0 2024-06-21 01:36:01,987 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.33 vs. limit=12.0 2024-06-21 01:36:23,717 INFO [train.py:1028] (1/2) Epoch 17, batch 2250, loss[loss=0.2131, simple_loss=0.2695, pruned_loss=0.07831, over 13256.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2662, pruned_loss=0.08084, over 2588881.66 frames. ], batch size: 63, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:36:30,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=300895.8333333333, ans=0.0 2024-06-21 01:36:33,065 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 1.974e+02 2.116e+02 2.311e+02 3.847e+02, threshold=4.232e+02, percent-clipped=0.0 2024-06-21 01:36:35,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=300914.1666666667, ans=0.0 2024-06-21 01:36:47,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=300932.5, ans=0.125 2024-06-21 01:36:48,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=300932.5, ans=0.0 2024-06-21 01:36:54,812 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2024-06-21 01:37:01,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=300969.1666666667, ans=0.125 2024-06-21 01:37:13,505 INFO [train.py:1028] (1/2) Epoch 17, batch 2300, loss[loss=0.1983, simple_loss=0.2539, pruned_loss=0.07138, over 12867.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2659, pruned_loss=0.08072, over 2582426.77 frames. ], batch size: 33, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:37:13,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=300987.5, ans=0.125 2024-06-21 01:37:23,367 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2024-06-21 01:37:23,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=301005.8333333333, ans=0.125 2024-06-21 01:37:25,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=301005.8333333333, ans=10.0 2024-06-21 01:37:35,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.53 vs. limit=15.0 2024-06-21 01:37:41,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=301042.5, ans=0.0 2024-06-21 01:37:42,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=301042.5, ans=0.025 2024-06-21 01:37:50,877 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.06 vs. limit=10.0 2024-06-21 01:38:02,710 INFO [train.py:1028] (1/2) Epoch 17, batch 2350, loss[loss=0.2134, simple_loss=0.2659, pruned_loss=0.0804, over 13237.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2659, pruned_loss=0.08078, over 2585826.42 frames. ], batch size: 67, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:38:02,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=301079.1666666667, ans=0.2 2024-06-21 01:38:11,991 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 1.937e+02 2.077e+02 2.304e+02 2.921e+02, threshold=4.154e+02, percent-clipped=0.0 2024-06-21 01:38:12,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=12.0 2024-06-21 01:38:22,177 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:38:24,465 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.79 vs. limit=6.0 2024-06-21 01:38:39,425 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.53 vs. limit=22.5 2024-06-21 01:38:45,442 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:38:49,427 INFO [train.py:1028] (1/2) Epoch 17, batch 2400, loss[loss=0.2172, simple_loss=0.2687, pruned_loss=0.08286, over 13368.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.266, pruned_loss=0.08116, over 2588563.06 frames. ], batch size: 46, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:38:53,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=301170.8333333333, ans=0.125 2024-06-21 01:38:56,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=301170.8333333333, ans=0.125 2024-06-21 01:39:00,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=301189.1666666667, ans=10.0 2024-06-21 01:39:00,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=301189.1666666667, ans=0.125 2024-06-21 01:39:22,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=301225.8333333333, ans=0.125 2024-06-21 01:39:24,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=301225.8333333333, ans=0.2 2024-06-21 01:39:25,357 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.73 vs. limit=22.5 2024-06-21 01:39:32,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=301244.1666666667, ans=0.0 2024-06-21 01:39:35,494 INFO [train.py:1028] (1/2) Epoch 17, batch 2450, loss[loss=0.2133, simple_loss=0.2625, pruned_loss=0.08209, over 13279.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2651, pruned_loss=0.08117, over 2583653.38 frames. ], batch size: 63, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:39:44,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=301262.5, ans=0.125 2024-06-21 01:39:47,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=301262.5, ans=0.04949747468305833 2024-06-21 01:39:50,229 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.723e+02 1.950e+02 2.067e+02 2.233e+02 3.113e+02, threshold=4.134e+02, percent-clipped=0.0 2024-06-21 01:39:57,588 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.35 vs. limit=10.0 2024-06-21 01:40:00,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=301299.1666666667, ans=0.125 2024-06-21 01:40:11,130 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:40:25,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=301335.8333333333, ans=0.125 2024-06-21 01:40:26,798 INFO [train.py:1028] (1/2) Epoch 17, batch 2500, loss[loss=0.2034, simple_loss=0.2604, pruned_loss=0.07316, over 13224.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2639, pruned_loss=0.08076, over 2587134.92 frames. ], batch size: 83, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:40:33,672 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.29 vs. limit=15.0 2024-06-21 01:40:38,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=301354.1666666667, ans=0.125 2024-06-21 01:40:39,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=15.0 2024-06-21 01:40:51,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=301390.8333333333, ans=0.1 2024-06-21 01:40:58,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=301390.8333333333, ans=0.0 2024-06-21 01:41:00,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=301409.1666666667, ans=0.0 2024-06-21 01:41:08,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=301409.1666666667, ans=0.125 2024-06-21 01:41:20,327 INFO [train.py:1028] (1/2) Epoch 17, batch 2550, loss[loss=0.2224, simple_loss=0.2792, pruned_loss=0.08286, over 12589.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2626, pruned_loss=0.08024, over 2586057.64 frames. ], batch size: 22, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:41:26,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=301445.8333333333, ans=0.125 2024-06-21 01:41:28,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=301445.8333333333, ans=0.125 2024-06-21 01:41:30,218 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.581e+02 1.955e+02 2.086e+02 2.265e+02 2.886e+02, threshold=4.173e+02, percent-clipped=0.0 2024-06-21 01:41:33,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=301464.1666666667, ans=0.1 2024-06-21 01:41:34,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=301464.1666666667, ans=0.0 2024-06-21 01:41:36,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=301464.1666666667, ans=0.125 2024-06-21 01:41:43,187 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=301482.5, ans=0.0 2024-06-21 01:41:44,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=301482.5, ans=0.2 2024-06-21 01:41:47,167 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.82 vs. limit=6.0 2024-06-21 01:42:05,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=301537.5, ans=0.0 2024-06-21 01:42:06,251 INFO [train.py:1028] (1/2) Epoch 17, batch 2600, loss[loss=0.2232, simple_loss=0.2776, pruned_loss=0.08441, over 13245.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2614, pruned_loss=0.07973, over 2585246.49 frames. ], batch size: 52, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:42:07,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=301537.5, ans=0.125 2024-06-21 01:42:26,734 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.94 vs. limit=15.0 2024-06-21 01:42:28,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=301574.1666666667, ans=0.125 2024-06-21 01:42:35,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=301592.5, ans=0.1 2024-06-21 01:42:37,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=301592.5, ans=0.125 2024-06-21 01:42:45,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=301610.8333333333, ans=0.1 2024-06-21 01:42:48,847 INFO [train.py:1028] (1/2) Epoch 17, batch 2650, loss[loss=0.2011, simple_loss=0.2437, pruned_loss=0.07926, over 13029.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.26, pruned_loss=0.07927, over 2585506.76 frames. ], batch size: 144, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:42:52,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=301629.1666666667, ans=0.2 2024-06-21 01:42:56,642 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 1.917e+02 2.101e+02 2.280e+02 3.150e+02, threshold=4.202e+02, percent-clipped=0.0 2024-06-21 01:43:13,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=301665.8333333333, ans=0.1 2024-06-21 01:43:32,248 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.488e-01 2024-06-21 01:43:35,143 INFO [train.py:1028] (1/2) Epoch 17, batch 2700, loss[loss=0.2066, simple_loss=0.2522, pruned_loss=0.08053, over 13278.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2583, pruned_loss=0.07875, over 2584492.74 frames. ], batch size: 89, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:43:35,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=301720.8333333333, ans=0.025 2024-06-21 01:43:46,168 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2024-06-21 01:43:48,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=301739.1666666667, ans=0.0 2024-06-21 01:43:53,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=301757.5, ans=0.0 2024-06-21 01:43:59,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=301775.8333333333, ans=0.125 2024-06-21 01:44:08,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=301794.1666666667, ans=0.1 2024-06-21 01:44:12,117 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.09 vs. limit=15.0 2024-06-21 01:44:13,813 INFO [train.py:1028] (1/2) Epoch 17, batch 2750, loss[loss=0.1959, simple_loss=0.2488, pruned_loss=0.07154, over 13274.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2571, pruned_loss=0.07793, over 2580405.23 frames. ], batch size: 43, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:44:16,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=301812.5, ans=0.1 2024-06-21 01:44:19,012 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.85 vs. limit=15.0 2024-06-21 01:44:20,950 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 1.922e+02 2.080e+02 2.322e+02 3.208e+02, threshold=4.160e+02, percent-clipped=0.0 2024-06-21 01:44:30,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=301849.1666666667, ans=0.125 2024-06-21 01:44:43,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=301867.5, ans=0.0 2024-06-21 01:45:00,704 INFO [train.py:1028] (1/2) Epoch 17, batch 2800, loss[loss=0.2295, simple_loss=0.2713, pruned_loss=0.09384, over 10855.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2569, pruned_loss=0.078, over 2577570.29 frames. ], batch size: 304, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:45:31,501 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=14.23 vs. limit=15.0 2024-06-21 01:45:40,549 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.40 vs. limit=22.5 2024-06-21 01:45:44,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=301977.5, ans=0.025 2024-06-21 01:45:51,400 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.86 vs. limit=22.5 2024-06-21 01:45:53,449 INFO [train.py:1028] (1/2) Epoch 17, batch 2850, loss[loss=0.2043, simple_loss=0.254, pruned_loss=0.07723, over 13313.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2566, pruned_loss=0.0778, over 2575525.46 frames. ], batch size: 49, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:45:57,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=301995.8333333333, ans=0.025 2024-06-21 01:46:01,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=301995.8333333333, ans=0.1 2024-06-21 01:46:02,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=302014.1666666667, ans=0.125 2024-06-21 01:46:02,605 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.723e+02 1.935e+02 2.086e+02 2.271e+02 3.501e+02, threshold=4.172e+02, percent-clipped=0.0 2024-06-21 01:46:04,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=302014.1666666667, ans=0.125 2024-06-21 01:46:05,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=302014.1666666667, ans=0.0 2024-06-21 01:46:20,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=302032.5, ans=0.2 2024-06-21 01:46:25,092 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2024-06-21 01:46:33,863 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.38 vs. limit=22.5 2024-06-21 01:46:40,444 INFO [train.py:1028] (1/2) Epoch 17, batch 2900, loss[loss=0.216, simple_loss=0.2641, pruned_loss=0.08398, over 13173.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2549, pruned_loss=0.07711, over 2583586.10 frames. ], batch size: 55, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:46:41,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=302087.5, ans=0.0 2024-06-21 01:46:49,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=302105.8333333333, ans=0.125 2024-06-21 01:46:52,928 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.13 vs. limit=22.5 2024-06-21 01:47:17,595 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.17 vs. limit=15.0 2024-06-21 01:47:23,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=302160.8333333333, ans=0.125 2024-06-21 01:47:25,812 INFO [train.py:1028] (1/2) Epoch 17, batch 2950, loss[loss=0.2079, simple_loss=0.2687, pruned_loss=0.07357, over 13239.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2544, pruned_loss=0.07664, over 2577729.78 frames. ], batch size: 43, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:47:42,034 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.853e+02 1.970e+02 2.093e+02 3.192e+02, threshold=3.940e+02, percent-clipped=0.0 2024-06-21 01:47:43,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=302197.5, ans=0.2 2024-06-21 01:47:43,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.95 vs. limit=22.5 2024-06-21 01:47:48,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=302197.5, ans=0.0 2024-06-21 01:47:57,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=302215.8333333333, ans=0.0 2024-06-21 01:48:16,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=302252.5, ans=0.2 2024-06-21 01:48:25,573 INFO [train.py:1028] (1/2) Epoch 17, batch 3000, loss[loss=0.2022, simple_loss=0.2527, pruned_loss=0.07588, over 13212.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2533, pruned_loss=0.07633, over 2575864.42 frames. ], batch size: 59, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:48:25,574 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 01:48:35,895 INFO [train.py:1060] (1/2) Epoch 17, validation: loss=0.1874, simple_loss=0.2525, pruned_loss=0.0611, over 351949.00 frames. 2024-06-21 01:48:35,897 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 01:48:47,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=302289.1666666667, ans=0.0 2024-06-21 01:48:52,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=302289.1666666667, ans=10.0 2024-06-21 01:49:20,213 INFO [train.py:1028] (1/2) Epoch 17, batch 3050, loss[loss=0.1839, simple_loss=0.2374, pruned_loss=0.06522, over 13323.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2528, pruned_loss=0.07653, over 2576880.72 frames. ], batch size: 46, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:49:22,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=302362.5, ans=0.0 2024-06-21 01:49:24,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=302362.5, ans=0.1 2024-06-21 01:49:28,885 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 1.947e+02 2.089e+02 2.325e+02 3.780e+02, threshold=4.179e+02, percent-clipped=0.0 2024-06-21 01:49:40,164 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=4.988e-02 2024-06-21 01:49:48,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=302417.5, ans=0.2 2024-06-21 01:49:49,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=302417.5, ans=0.02 2024-06-21 01:49:53,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=302417.5, ans=0.0 2024-06-21 01:50:05,391 INFO [train.py:1028] (1/2) Epoch 17, batch 3100, loss[loss=0.1951, simple_loss=0.242, pruned_loss=0.07414, over 13028.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2517, pruned_loss=0.07571, over 2578383.67 frames. ], batch size: 144, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:50:14,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=302472.5, ans=0.125 2024-06-21 01:50:52,628 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.70 vs. limit=6.0 2024-06-21 01:51:01,714 INFO [train.py:1028] (1/2) Epoch 17, batch 3150, loss[loss=0.2049, simple_loss=0.2502, pruned_loss=0.07976, over 12964.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2503, pruned_loss=0.07509, over 2581289.55 frames. ], batch size: 158, lr: 3.45e-03, grad_scale: 128.0 2024-06-21 01:51:04,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=302545.8333333333, ans=0.07 2024-06-21 01:51:05,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=302545.8333333333, ans=0.0 2024-06-21 01:51:06,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.45 vs. limit=5.0 2024-06-21 01:51:11,439 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 1.861e+02 1.969e+02 2.146e+02 2.870e+02, threshold=3.938e+02, percent-clipped=0.0 2024-06-21 01:51:12,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=302564.1666666667, ans=0.0 2024-06-21 01:51:13,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=302564.1666666667, ans=0.125 2024-06-21 01:51:15,299 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.58 vs. limit=15.0 2024-06-21 01:51:18,138 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2024-06-21 01:51:37,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=302600.8333333333, ans=0.125 2024-06-21 01:51:49,672 INFO [train.py:1028] (1/2) Epoch 17, batch 3200, loss[loss=0.1868, simple_loss=0.2402, pruned_loss=0.06667, over 13138.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2499, pruned_loss=0.07496, over 2581821.95 frames. ], batch size: 55, lr: 3.45e-03, grad_scale: 128.0 2024-06-21 01:52:08,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=302674.1666666667, ans=0.0 2024-06-21 01:52:10,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=302674.1666666667, ans=15.0 2024-06-21 01:52:22,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=302692.5, ans=0.2 2024-06-21 01:52:36,433 INFO [train.py:1028] (1/2) Epoch 17, batch 3250, loss[loss=0.1829, simple_loss=0.2375, pruned_loss=0.06411, over 13221.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2496, pruned_loss=0.07525, over 2586681.53 frames. ], batch size: 72, lr: 3.45e-03, grad_scale: 128.0 2024-06-21 01:52:39,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=302729.1666666667, ans=0.2 2024-06-21 01:52:44,807 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 1.912e+02 2.016e+02 2.135e+02 2.872e+02, threshold=4.031e+02, percent-clipped=0.0 2024-06-21 01:52:49,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=302747.5, ans=0.125 2024-06-21 01:53:01,274 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.12 vs. limit=15.0 2024-06-21 01:53:09,990 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.04 vs. limit=15.0 2024-06-21 01:53:14,268 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.98 vs. limit=15.0 2024-06-21 01:53:27,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=302802.5, ans=0.0 2024-06-21 01:53:30,798 INFO [train.py:1028] (1/2) Epoch 17, batch 3300, loss[loss=0.2072, simple_loss=0.2463, pruned_loss=0.0841, over 12780.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2495, pruned_loss=0.07539, over 2583173.66 frames. ], batch size: 176, lr: 3.45e-03, grad_scale: 128.0 2024-06-21 01:53:32,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=302820.8333333333, ans=0.125 2024-06-21 01:53:32,293 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.212e+01 2024-06-21 01:53:45,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=302839.1666666667, ans=0.05 2024-06-21 01:53:48,879 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.39 vs. limit=15.0 2024-06-21 01:53:48,981 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.20 vs. limit=15.0 2024-06-21 01:53:57,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=302857.5, ans=0.0 2024-06-21 01:54:02,276 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.81 vs. limit=22.5 2024-06-21 01:54:07,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=302875.8333333333, ans=0.125 2024-06-21 01:54:21,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=302894.1666666667, ans=0.0 2024-06-21 01:54:23,710 INFO [train.py:1028] (1/2) Epoch 17, batch 3350, loss[loss=0.2017, simple_loss=0.2433, pruned_loss=0.08004, over 12969.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2497, pruned_loss=0.0757, over 2578140.58 frames. ], batch size: 158, lr: 3.45e-03, grad_scale: 128.0 2024-06-21 01:54:26,036 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.17 vs. limit=15.0 2024-06-21 01:54:30,452 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:54:31,624 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 1.993e+02 2.110e+02 2.278e+02 2.978e+02, threshold=4.220e+02, percent-clipped=0.0 2024-06-21 01:54:34,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=302930.8333333333, ans=0.1 2024-06-21 01:54:37,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=302930.8333333333, ans=0.2 2024-06-21 01:54:39,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=302949.1666666667, ans=0.125 2024-06-21 01:54:50,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=302967.5, ans=0.0 2024-06-21 01:54:54,894 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:54:56,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=302985.8333333333, ans=0.025 2024-06-21 01:55:00,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=302985.8333333333, ans=0.0 2024-06-21 01:55:02,316 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.26 vs. limit=15.0 2024-06-21 01:55:04,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=302985.8333333333, ans=0.0 2024-06-21 01:55:07,412 INFO [train.py:1028] (1/2) Epoch 17, batch 3400, loss[loss=0.1938, simple_loss=0.2488, pruned_loss=0.0694, over 12718.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2496, pruned_loss=0.07606, over 2575936.74 frames. ], batch size: 22, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 01:55:10,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=303004.1666666667, ans=0.1 2024-06-21 01:55:11,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=303004.1666666667, ans=0.0 2024-06-21 01:55:13,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.41 vs. limit=10.0 2024-06-21 01:55:21,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=303022.5, ans=0.1 2024-06-21 01:55:25,221 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.63 vs. limit=10.0 2024-06-21 01:55:27,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=303040.8333333333, ans=0.125 2024-06-21 01:55:28,599 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.79 vs. limit=15.0 2024-06-21 01:55:37,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=303059.1666666667, ans=0.125 2024-06-21 01:55:37,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=303059.1666666667, ans=22.5 2024-06-21 01:55:38,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=303059.1666666667, ans=0.125 2024-06-21 01:55:42,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=303059.1666666667, ans=0.125 2024-06-21 01:55:44,804 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2024-06-21 01:55:53,424 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=15.0 2024-06-21 01:56:02,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=303077.5, ans=0.1 2024-06-21 01:56:04,710 INFO [train.py:1028] (1/2) Epoch 17, batch 3450, loss[loss=0.2193, simple_loss=0.2572, pruned_loss=0.09071, over 12686.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2487, pruned_loss=0.07566, over 2576822.95 frames. ], batch size: 176, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 01:56:12,388 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.98 vs. limit=15.0 2024-06-21 01:56:14,059 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.930e+02 2.049e+02 2.266e+02 2.726e+02, threshold=4.097e+02, percent-clipped=0.0 2024-06-21 01:56:42,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=303150.8333333333, ans=0.125 2024-06-21 01:56:52,695 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.07 vs. limit=15.0 2024-06-21 01:56:56,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=303169.1666666667, ans=0.125 2024-06-21 01:56:59,281 INFO [train.py:1028] (1/2) Epoch 17, batch 3500, loss[loss=0.177, simple_loss=0.2235, pruned_loss=0.06526, over 13018.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2473, pruned_loss=0.07491, over 2576358.64 frames. ], batch size: 33, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 01:57:08,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=303205.8333333333, ans=0.0 2024-06-21 01:57:14,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=303205.8333333333, ans=0.1 2024-06-21 01:57:40,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=303260.8333333333, ans=0.125 2024-06-21 01:57:42,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=303260.8333333333, ans=0.125 2024-06-21 01:57:45,672 INFO [train.py:1028] (1/2) Epoch 17, batch 3550, loss[loss=0.1976, simple_loss=0.2439, pruned_loss=0.07561, over 13141.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2463, pruned_loss=0.07435, over 2577509.84 frames. ], batch size: 95, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 01:57:51,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=303279.1666666667, ans=0.125 2024-06-21 01:57:54,625 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.652e+02 1.883e+02 1.984e+02 2.125e+02 3.280e+02, threshold=3.967e+02, percent-clipped=0.0 2024-06-21 01:57:58,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=303297.5, ans=0.05 2024-06-21 01:57:59,011 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:58:08,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=303315.8333333333, ans=0.125 2024-06-21 01:58:30,582 INFO [train.py:1028] (1/2) Epoch 17, batch 3600, loss[loss=0.2061, simple_loss=0.2527, pruned_loss=0.07971, over 13242.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2457, pruned_loss=0.07408, over 2581463.87 frames. ], batch size: 49, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 01:58:33,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=303370.8333333333, ans=0.0 2024-06-21 01:58:39,442 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.87 vs. limit=10.0 2024-06-21 01:59:11,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=303444.1666666667, ans=0.1 2024-06-21 01:59:21,194 INFO [train.py:1028] (1/2) Epoch 17, batch 3650, loss[loss=0.1998, simple_loss=0.2402, pruned_loss=0.07972, over 13013.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.246, pruned_loss=0.07415, over 2579488.60 frames. ], batch size: 102, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 01:59:21,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=303462.5, ans=0.1 2024-06-21 01:59:24,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=303462.5, ans=0.125 2024-06-21 01:59:36,733 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.681e+02 1.919e+02 2.031e+02 2.156e+02 2.684e+02, threshold=4.062e+02, percent-clipped=0.0 2024-06-21 01:59:41,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=303480.8333333333, ans=0.125 2024-06-21 01:59:49,132 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.33 vs. limit=15.0 2024-06-21 02:00:09,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=303535.8333333333, ans=0.125 2024-06-21 02:00:14,228 INFO [train.py:1028] (1/2) Epoch 17, batch 3700, loss[loss=0.1969, simple_loss=0.2478, pruned_loss=0.073, over 13216.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2457, pruned_loss=0.0741, over 2585069.13 frames. ], batch size: 72, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 02:00:17,942 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2024-06-21 02:00:44,160 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.42 vs. limit=10.0 2024-06-21 02:01:01,481 INFO [train.py:1028] (1/2) Epoch 17, batch 3750, loss[loss=0.1917, simple_loss=0.2415, pruned_loss=0.07092, over 12584.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2447, pruned_loss=0.07353, over 2585545.06 frames. ], batch size: 22, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 02:01:11,379 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.704e+02 1.908e+02 2.025e+02 2.256e+02 3.117e+02, threshold=4.049e+02, percent-clipped=0.0 2024-06-21 02:01:19,566 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.14 vs. limit=15.0 2024-06-21 02:01:31,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=303700.8333333333, ans=0.0 2024-06-21 02:01:34,448 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.55 vs. limit=10.0 2024-06-21 02:01:48,434 INFO [train.py:1028] (1/2) Epoch 17, batch 3800, loss[loss=0.2017, simple_loss=0.2437, pruned_loss=0.0799, over 13143.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2456, pruned_loss=0.07385, over 2583470.28 frames. ], batch size: 83, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 02:01:48,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=303737.5, ans=0.125 2024-06-21 02:01:50,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=303737.5, ans=0.0 2024-06-21 02:01:53,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=303737.5, ans=0.1 2024-06-21 02:01:55,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=303737.5, ans=0.125 2024-06-21 02:02:20,234 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.76 vs. limit=15.0 2024-06-21 02:02:34,180 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.03 vs. limit=22.5 2024-06-21 02:02:45,959 INFO [train.py:1028] (1/2) Epoch 17, batch 3850, loss[loss=0.199, simple_loss=0.2459, pruned_loss=0.07607, over 13047.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2449, pruned_loss=0.07344, over 2582348.45 frames. ], batch size: 144, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 02:02:47,434 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.21 vs. limit=10.0 2024-06-21 02:02:55,733 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.707e+02 1.912e+02 2.110e+02 2.317e+02 3.462e+02, threshold=4.219e+02, percent-clipped=0.0 2024-06-21 02:03:04,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=303847.5, ans=0.125 2024-06-21 02:03:08,838 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2024-06-21 02:03:09,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.04 vs. limit=15.0 2024-06-21 02:03:15,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=303884.1666666667, ans=0.2 2024-06-21 02:03:22,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=303884.1666666667, ans=0.0 2024-06-21 02:03:30,502 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.65 vs. limit=15.0 2024-06-21 02:03:32,945 INFO [train.py:1028] (1/2) Epoch 17, batch 3900, loss[loss=0.1944, simple_loss=0.241, pruned_loss=0.07396, over 13217.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2448, pruned_loss=0.07336, over 2585189.63 frames. ], batch size: 83, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 02:03:40,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=303920.8333333333, ans=0.2 2024-06-21 02:04:00,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=303975.8333333333, ans=0.0 2024-06-21 02:04:03,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=303975.8333333333, ans=0.2 2024-06-21 02:04:06,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=303994.1666666667, ans=0.125 2024-06-21 02:04:08,532 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.08 vs. limit=6.0 2024-06-21 02:04:09,166 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.67 vs. limit=6.0 2024-06-21 02:04:12,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=303994.1666666667, ans=0.125 2024-06-21 02:04:13,350 INFO [train.py:1028] (1/2) Epoch 17, batch 3950, loss[loss=0.1966, simple_loss=0.2408, pruned_loss=0.07617, over 13118.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.244, pruned_loss=0.07279, over 2588185.91 frames. ], batch size: 132, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 02:04:20,332 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.853e+02 2.013e+02 2.170e+02 3.320e+02, threshold=4.026e+02, percent-clipped=0.0 2024-06-21 02:04:27,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=304049.1666666667, ans=0.0 2024-06-21 02:04:31,474 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.14 vs. limit=6.0 2024-06-21 02:04:41,238 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.28 vs. limit=22.5 2024-06-21 02:04:43,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=304085.8333333333, ans=0.0 2024-06-21 02:04:54,005 INFO [train.py:1028] (1/2) Epoch 17, batch 4000, loss[loss=0.2087, simple_loss=0.2692, pruned_loss=0.07413, over 12946.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2436, pruned_loss=0.07267, over 2583559.92 frames. ], batch size: 39, lr: 3.44e-03, grad_scale: 64.0 2024-06-21 02:05:06,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=304122.5, ans=0.025 2024-06-21 02:05:15,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=304140.8333333333, ans=0.0 2024-06-21 02:05:16,632 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.84 vs. limit=22.5 2024-06-21 02:05:24,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=304159.1666666667, ans=0.1 2024-06-21 02:05:38,159 INFO [train.py:1028] (1/2) Epoch 17, batch 4050, loss[loss=0.2223, simple_loss=0.2571, pruned_loss=0.09381, over 11045.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2438, pruned_loss=0.07297, over 2580559.15 frames. ], batch size: 304, lr: 3.44e-03, grad_scale: 64.0 2024-06-21 02:05:38,566 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.87 vs. limit=12.0 2024-06-21 02:05:46,604 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 1.911e+02 2.030e+02 2.140e+02 2.792e+02, threshold=4.061e+02, percent-clipped=0.0 2024-06-21 02:05:57,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=304232.5, ans=0.125 2024-06-21 02:06:18,232 INFO [train.py:1028] (1/2) Epoch 17, batch 4100, loss[loss=0.1959, simple_loss=0.2453, pruned_loss=0.0732, over 13152.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2439, pruned_loss=0.07317, over 2577956.60 frames. ], batch size: 103, lr: 3.44e-03, grad_scale: 64.0 2024-06-21 02:06:23,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=304287.5, ans=0.125 2024-06-21 02:06:28,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=304305.8333333333, ans=0.035 2024-06-21 02:06:29,421 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:06:33,845 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.34 vs. limit=15.0 2024-06-21 02:06:37,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=304324.1666666667, ans=0.07 2024-06-21 02:06:46,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=304342.5, ans=0.025 2024-06-21 02:06:58,689 INFO [train.py:1028] (1/2) Epoch 17, batch 4150, loss[loss=0.1957, simple_loss=0.242, pruned_loss=0.07472, over 13132.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2435, pruned_loss=0.07322, over 2576730.60 frames. ], batch size: 55, lr: 3.44e-03, grad_scale: 64.0 2024-06-21 02:06:59,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=304379.1666666667, ans=0.025 2024-06-21 02:07:00,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=304379.1666666667, ans=0.0 2024-06-21 02:07:01,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=304379.1666666667, ans=0.125 2024-06-21 02:07:11,449 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 1.859e+02 2.024e+02 2.252e+02 2.858e+02, threshold=4.048e+02, percent-clipped=0.0 2024-06-21 02:07:18,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=304415.8333333333, ans=0.0 2024-06-21 02:07:18,859 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.66 vs. limit=22.5 2024-06-21 02:07:23,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=304415.8333333333, ans=0.125 2024-06-21 02:07:25,937 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.00 vs. limit=15.0 2024-06-21 02:07:35,686 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2024-06-21 02:07:40,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=304452.5, ans=0.125 2024-06-21 02:07:46,372 INFO [train.py:1028] (1/2) Epoch 17, batch 4200, loss[loss=0.2173, simple_loss=0.2569, pruned_loss=0.08883, over 13190.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2433, pruned_loss=0.07333, over 2580367.23 frames. ], batch size: 103, lr: 3.44e-03, grad_scale: 64.0 2024-06-21 02:07:48,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=304470.8333333333, ans=0.1 2024-06-21 02:07:53,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=304489.1666666667, ans=0.2 2024-06-21 02:08:04,124 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.71 vs. limit=15.0 2024-06-21 02:08:06,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=304507.5, ans=0.0 2024-06-21 02:08:07,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten.whitening_limit, batch_count=304507.5, ans=15.0 2024-06-21 02:08:09,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=304525.8333333333, ans=0.125 2024-06-21 02:08:17,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=304544.1666666667, ans=0.125 2024-06-21 02:08:26,079 INFO [train.py:1028] (1/2) Epoch 17, batch 4250, loss[loss=0.1925, simple_loss=0.2435, pruned_loss=0.07072, over 13317.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2434, pruned_loss=0.07325, over 2583056.28 frames. ], batch size: 46, lr: 3.44e-03, grad_scale: 32.0 2024-06-21 02:08:31,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=304562.5, ans=0.0 2024-06-21 02:08:32,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=304562.5, ans=0.125 2024-06-21 02:08:35,242 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 1.909e+02 2.016e+02 2.185e+02 3.151e+02, threshold=4.031e+02, percent-clipped=0.0 2024-06-21 02:08:38,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=304580.8333333333, ans=0.125 2024-06-21 02:08:44,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=304599.1666666667, ans=0.2 2024-06-21 02:09:04,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=304635.8333333333, ans=0.015 2024-06-21 02:09:05,481 INFO [train.py:1028] (1/2) Epoch 17, batch 4300, loss[loss=0.1952, simple_loss=0.2456, pruned_loss=0.07239, over 13206.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2428, pruned_loss=0.07295, over 2583586.53 frames. ], batch size: 59, lr: 3.44e-03, grad_scale: 32.0 2024-06-21 02:09:08,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=304654.1666666667, ans=0.025 2024-06-21 02:09:12,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=304672.5, ans=0.125 2024-06-21 02:09:24,521 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:09:30,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=304690.8333333333, ans=0.0 2024-06-21 02:09:42,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=304727.5, ans=0.2 2024-06-21 02:09:47,821 INFO [train.py:1028] (1/2) Epoch 17, batch 4350, loss[loss=0.1914, simple_loss=0.2487, pruned_loss=0.06709, over 13184.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2426, pruned_loss=0.07278, over 2587794.21 frames. ], batch size: 59, lr: 3.44e-03, grad_scale: 32.0 2024-06-21 02:09:53,021 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.17 vs. limit=22.5 2024-06-21 02:10:01,022 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.834e+02 1.960e+02 2.109e+02 2.921e+02, threshold=3.920e+02, percent-clipped=0.0 2024-06-21 02:10:17,579 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.83 vs. limit=22.5 2024-06-21 02:10:19,827 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.41 vs. limit=15.0 2024-06-21 02:10:20,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=304800.8333333333, ans=0.0 2024-06-21 02:10:26,601 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.93 vs. limit=15.0 2024-06-21 02:10:31,546 INFO [train.py:1028] (1/2) Epoch 17, batch 4400, loss[loss=0.1754, simple_loss=0.22, pruned_loss=0.06546, over 13251.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.2421, pruned_loss=0.07252, over 2588156.02 frames. ], batch size: 83, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:10:31,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=304837.5, ans=0.025 2024-06-21 02:10:33,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=304837.5, ans=0.125 2024-06-21 02:10:34,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=304837.5, ans=0.1 2024-06-21 02:10:35,883 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.91 vs. limit=15.0 2024-06-21 02:10:43,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=304855.8333333333, ans=0.0 2024-06-21 02:10:43,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=304855.8333333333, ans=0.125 2024-06-21 02:10:45,779 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.15 vs. limit=22.5 2024-06-21 02:10:46,009 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:10:52,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=304874.1666666667, ans=0.0 2024-06-21 02:10:53,508 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=15.0 2024-06-21 02:11:04,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=304910.8333333333, ans=0.0 2024-06-21 02:11:05,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=304910.8333333333, ans=0.125 2024-06-21 02:11:10,967 INFO [train.py:1028] (1/2) Epoch 17, batch 4450, loss[loss=0.1765, simple_loss=0.2402, pruned_loss=0.05642, over 12787.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2431, pruned_loss=0.07332, over 2582387.80 frames. ], batch size: 33, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:11:14,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=304929.1666666667, ans=0.0 2024-06-21 02:11:17,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=304947.5, ans=0.125 2024-06-21 02:11:19,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=304947.5, ans=0.125 2024-06-21 02:11:20,171 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.925e+02 2.122e+02 2.408e+02 3.035e+02, threshold=4.243e+02, percent-clipped=0.0 2024-06-21 02:11:20,804 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.59 vs. limit=22.5 2024-06-21 02:11:37,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=304984.1666666667, ans=0.125 2024-06-21 02:11:40,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=304984.1666666667, ans=0.125 2024-06-21 02:11:42,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=305002.5, ans=0.125 2024-06-21 02:11:47,035 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.581e-03 2024-06-21 02:11:50,021 INFO [train.py:1028] (1/2) Epoch 17, batch 4500, loss[loss=0.1876, simple_loss=0.232, pruned_loss=0.07161, over 13221.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2422, pruned_loss=0.07296, over 2586776.68 frames. ], batch size: 89, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:11:58,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=305020.8333333333, ans=0.025 2024-06-21 02:12:02,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=305039.1666666667, ans=0.1 2024-06-21 02:12:03,284 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2024-06-21 02:12:07,229 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.25 vs. limit=15.0 2024-06-21 02:12:07,358 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.91 vs. limit=15.0 2024-06-21 02:12:09,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=305057.5, ans=0.125 2024-06-21 02:12:09,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=305057.5, ans=0.035 2024-06-21 02:12:31,369 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.84 vs. limit=15.0 2024-06-21 02:12:31,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=305094.1666666667, ans=0.0 2024-06-21 02:12:35,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=305094.1666666667, ans=0.125 2024-06-21 02:12:37,365 INFO [train.py:1028] (1/2) Epoch 17, batch 4550, loss[loss=0.1806, simple_loss=0.2282, pruned_loss=0.06649, over 13226.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2416, pruned_loss=0.07249, over 2590661.70 frames. ], batch size: 52, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:12:42,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=305112.5, ans=0.0 2024-06-21 02:12:46,554 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.819e+02 1.917e+02 2.043e+02 2.609e+02, threshold=3.833e+02, percent-clipped=0.0 2024-06-21 02:12:49,324 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:13:01,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=305167.5, ans=0.0 2024-06-21 02:13:06,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=305167.5, ans=0.125 2024-06-21 02:13:10,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten.whitening_limit, batch_count=305185.8333333333, ans=15.0 2024-06-21 02:13:15,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=305185.8333333333, ans=0.125 2024-06-21 02:13:16,546 INFO [train.py:1028] (1/2) Epoch 17, batch 4600, loss[loss=0.2348, simple_loss=0.26, pruned_loss=0.1048, over 12574.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2416, pruned_loss=0.07251, over 2586790.05 frames. ], batch size: 202, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:13:19,403 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2024-06-21 02:13:22,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=305204.1666666667, ans=0.0 2024-06-21 02:13:22,519 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.98 vs. limit=22.5 2024-06-21 02:13:25,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=305222.5, ans=0.025 2024-06-21 02:13:43,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=305259.1666666667, ans=0.0 2024-06-21 02:13:46,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=305277.5, ans=10.0 2024-06-21 02:13:55,377 INFO [train.py:1028] (1/2) Epoch 17, batch 4650, loss[loss=0.1988, simple_loss=0.2418, pruned_loss=0.07786, over 13101.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2415, pruned_loss=0.07261, over 2589342.14 frames. ], batch size: 132, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:14:04,701 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 1.932e+02 2.073e+02 2.235e+02 3.046e+02, threshold=4.147e+02, percent-clipped=0.0 2024-06-21 02:14:10,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=305332.5, ans=0.07 2024-06-21 02:14:23,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=305350.8333333333, ans=0.0 2024-06-21 02:14:28,320 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.43 vs. limit=22.5 2024-06-21 02:14:30,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=305369.1666666667, ans=0.125 2024-06-21 02:14:34,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=305369.1666666667, ans=0.0 2024-06-21 02:14:35,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=305369.1666666667, ans=0.07 2024-06-21 02:14:36,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=305369.1666666667, ans=0.0 2024-06-21 02:14:39,164 INFO [train.py:1028] (1/2) Epoch 17, batch 4700, loss[loss=0.1685, simple_loss=0.2227, pruned_loss=0.05718, over 12358.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2421, pruned_loss=0.07292, over 2585437.44 frames. ], batch size: 25, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:14:41,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=305387.5, ans=0.125 2024-06-21 02:14:43,525 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2024-06-21 02:15:23,074 INFO [train.py:1028] (1/2) Epoch 17, batch 4750, loss[loss=0.2082, simple_loss=0.2561, pruned_loss=0.0802, over 12470.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2415, pruned_loss=0.07297, over 2581508.49 frames. ], batch size: 202, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:15:33,092 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 1.869e+02 2.004e+02 2.200e+02 2.630e+02, threshold=4.007e+02, percent-clipped=0.0 2024-06-21 02:15:34,999 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.29 vs. limit=22.5 2024-06-21 02:15:43,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=305515.8333333333, ans=0.125 2024-06-21 02:15:44,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=305515.8333333333, ans=0.04949747468305833 2024-06-21 02:15:48,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=305534.1666666667, ans=0.0 2024-06-21 02:15:52,991 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.92 vs. limit=15.0 2024-06-21 02:15:56,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=305552.5, ans=0.2 2024-06-21 02:16:03,038 INFO [train.py:1028] (1/2) Epoch 17, batch 4800, loss[loss=0.1994, simple_loss=0.2494, pruned_loss=0.07466, over 13288.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2411, pruned_loss=0.0725, over 2577576.43 frames. ], batch size: 63, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:16:04,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=305570.8333333333, ans=0.0 2024-06-21 02:16:26,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=305625.8333333333, ans=0.2 2024-06-21 02:16:28,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=305625.8333333333, ans=0.04949747468305833 2024-06-21 02:16:39,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=305644.1666666667, ans=0.0 2024-06-21 02:16:45,536 INFO [train.py:1028] (1/2) Epoch 17, batch 4850, loss[loss=0.174, simple_loss=0.224, pruned_loss=0.06197, over 13239.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2406, pruned_loss=0.07213, over 2575220.34 frames. ], batch size: 89, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:16:53,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=305680.8333333333, ans=0.125 2024-06-21 02:16:55,521 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.817e+02 1.926e+02 2.083e+02 2.693e+02, threshold=3.853e+02, percent-clipped=0.0 2024-06-21 02:17:03,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=305699.1666666667, ans=0.0 2024-06-21 02:17:03,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=305699.1666666667, ans=0.125 2024-06-21 02:17:24,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=305735.8333333333, ans=0.125 2024-06-21 02:17:27,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=305735.8333333333, ans=0.125 2024-06-21 02:17:32,060 INFO [train.py:1028] (1/2) Epoch 17, batch 4900, loss[loss=0.2134, simple_loss=0.2685, pruned_loss=0.07911, over 13204.00 frames. ], tot_loss[loss=0.193, simple_loss=0.241, pruned_loss=0.07248, over 2576282.09 frames. ], batch size: 59, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:17:38,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=305754.1666666667, ans=0.2 2024-06-21 02:17:44,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=305772.5, ans=0.025 2024-06-21 02:17:44,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=305772.5, ans=0.09899494936611666 2024-06-21 02:17:53,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=305790.8333333333, ans=0.125 2024-06-21 02:17:53,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=305790.8333333333, ans=0.1 2024-06-21 02:18:00,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=305809.1666666667, ans=0.2 2024-06-21 02:18:03,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=305809.1666666667, ans=0.0 2024-06-21 02:18:12,468 INFO [train.py:1028] (1/2) Epoch 17, batch 4950, loss[loss=0.2116, simple_loss=0.2449, pruned_loss=0.08918, over 11125.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2406, pruned_loss=0.07248, over 2570198.39 frames. ], batch size: 304, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:18:13,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=305845.8333333333, ans=0.025 2024-06-21 02:18:15,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=305845.8333333333, ans=0.0 2024-06-21 02:18:22,133 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.855e+02 1.973e+02 2.141e+02 2.935e+02, threshold=3.945e+02, percent-clipped=0.0 2024-06-21 02:18:29,106 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=15.0 2024-06-21 02:18:38,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=305900.8333333333, ans=0.1 2024-06-21 02:18:39,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=305900.8333333333, ans=0.125 2024-06-21 02:18:52,376 INFO [train.py:1028] (1/2) Epoch 17, batch 5000, loss[loss=0.203, simple_loss=0.2455, pruned_loss=0.08026, over 13141.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2408, pruned_loss=0.07248, over 2573655.80 frames. ], batch size: 95, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:18:53,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=305937.5, ans=0.0 2024-06-21 02:19:01,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=305955.8333333333, ans=0.125 2024-06-21 02:19:23,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=305992.5, ans=0.125 2024-06-21 02:19:29,543 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.02 vs. limit=15.0 2024-06-21 02:19:40,926 INFO [train.py:1028] (1/2) Epoch 17, batch 5050, loss[loss=0.2033, simple_loss=0.2606, pruned_loss=0.07301, over 12910.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2413, pruned_loss=0.07255, over 2572771.25 frames. ], batch size: 36, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:19:43,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=306029.1666666667, ans=0.0 2024-06-21 02:19:50,384 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 1.876e+02 1.996e+02 2.240e+02 2.941e+02, threshold=3.993e+02, percent-clipped=0.0 2024-06-21 02:20:10,792 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.31 vs. limit=15.0 2024-06-21 02:20:18,918 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:20:20,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=306102.5, ans=0.0 2024-06-21 02:20:21,926 INFO [train.py:1028] (1/2) Epoch 17, batch 5100, loss[loss=0.1922, simple_loss=0.2473, pruned_loss=0.06854, over 12931.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.241, pruned_loss=0.0726, over 2568619.17 frames. ], batch size: 39, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:20:24,076 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=8.53 vs. limit=12.0 2024-06-21 02:20:33,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=306139.1666666667, ans=0.125 2024-06-21 02:20:34,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=306139.1666666667, ans=0.125 2024-06-21 02:20:37,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306157.5, ans=0.1 2024-06-21 02:20:51,066 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.05 vs. limit=15.0 2024-06-21 02:20:51,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=306175.8333333333, ans=0.125 2024-06-21 02:21:01,061 INFO [train.py:1028] (1/2) Epoch 17, batch 5150, loss[loss=0.1948, simple_loss=0.2308, pruned_loss=0.07941, over 13092.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2409, pruned_loss=0.07281, over 2571467.63 frames. ], batch size: 132, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:21:08,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=306230.8333333333, ans=0.0 2024-06-21 02:21:10,232 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 1.934e+02 2.140e+02 2.309e+02 4.282e+02, threshold=4.279e+02, percent-clipped=1.0 2024-06-21 02:21:12,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=306230.8333333333, ans=0.5 2024-06-21 02:21:30,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=306267.5, ans=0.1 2024-06-21 02:21:38,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=306285.8333333333, ans=0.125 2024-06-21 02:21:43,254 INFO [train.py:1028] (1/2) Epoch 17, batch 5200, loss[loss=0.1872, simple_loss=0.232, pruned_loss=0.07119, over 13168.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2405, pruned_loss=0.07248, over 2575275.69 frames. ], batch size: 95, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:21:51,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306304.1666666667, ans=0.1 2024-06-21 02:21:58,687 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.198e+01 2024-06-21 02:21:59,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=306322.5, ans=0.125 2024-06-21 02:22:09,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=306340.8333333333, ans=0.0 2024-06-21 02:22:24,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=306377.5, ans=0.2 2024-06-21 02:22:27,107 INFO [train.py:1028] (1/2) Epoch 17, batch 5250, loss[loss=0.1937, simple_loss=0.2443, pruned_loss=0.0715, over 13248.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2404, pruned_loss=0.07239, over 2570885.91 frames. ], batch size: 52, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:22:35,179 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=306414.1666666667, ans=0.0 2024-06-21 02:22:36,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=306414.1666666667, ans=0.0 2024-06-21 02:22:36,590 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.58 vs. limit=10.0 2024-06-21 02:22:36,734 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 1.994e+02 2.125e+02 2.430e+02 3.341e+02, threshold=4.249e+02, percent-clipped=0.0 2024-06-21 02:22:44,695 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.60 vs. limit=15.0 2024-06-21 02:22:47,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=306432.5, ans=0.125 2024-06-21 02:22:56,208 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.09 vs. limit=12.0 2024-06-21 02:23:01,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=306469.1666666667, ans=0.2 2024-06-21 02:23:07,507 INFO [train.py:1028] (1/2) Epoch 17, batch 5300, loss[loss=0.193, simple_loss=0.2383, pruned_loss=0.07386, over 13008.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2403, pruned_loss=0.07242, over 2566737.45 frames. ], batch size: 144, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:23:13,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=306487.5, ans=0.1 2024-06-21 02:23:25,804 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=15.0 2024-06-21 02:23:26,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=306524.1666666667, ans=0.025 2024-06-21 02:23:28,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=306524.1666666667, ans=0.04949747468305833 2024-06-21 02:23:35,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=306542.5, ans=0.0 2024-06-21 02:23:35,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=306542.5, ans=0.125 2024-06-21 02:23:39,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=306560.8333333333, ans=0.2 2024-06-21 02:23:51,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=306579.1666666667, ans=0.0 2024-06-21 02:23:51,972 INFO [train.py:1028] (1/2) Epoch 17, batch 5350, loss[loss=0.1858, simple_loss=0.2456, pruned_loss=0.06295, over 11635.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2398, pruned_loss=0.07239, over 2573842.11 frames. ], batch size: 16, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:24:01,244 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 1.903e+02 2.064e+02 2.231e+02 2.904e+02, threshold=4.127e+02, percent-clipped=0.0 2024-06-21 02:24:22,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=306634.1666666667, ans=0.0 2024-06-21 02:24:23,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=306634.1666666667, ans=0.125 2024-06-21 02:24:34,273 INFO [train.py:1028] (1/2) Epoch 17, batch 5400, loss[loss=0.2138, simple_loss=0.25, pruned_loss=0.08881, over 12221.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2401, pruned_loss=0.07276, over 2566863.34 frames. ], batch size: 240, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:24:46,255 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.49 vs. limit=15.0 2024-06-21 02:24:56,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=306707.5, ans=0.125 2024-06-21 02:25:04,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=306725.8333333333, ans=0.125 2024-06-21 02:25:05,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=306725.8333333333, ans=0.1 2024-06-21 02:25:09,032 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.26 vs. limit=15.0 2024-06-21 02:25:12,950 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.74 vs. limit=15.0 2024-06-21 02:25:14,998 INFO [train.py:1028] (1/2) Epoch 17, batch 5450, loss[loss=0.1988, simple_loss=0.2548, pruned_loss=0.07143, over 12755.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.241, pruned_loss=0.07298, over 2571553.17 frames. ], batch size: 26, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:25:19,585 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.06 vs. limit=10.0 2024-06-21 02:25:21,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=306762.5, ans=0.125 2024-06-21 02:25:24,556 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 1.899e+02 2.055e+02 2.242e+02 3.252e+02, threshold=4.111e+02, percent-clipped=0.0 2024-06-21 02:25:26,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=306780.8333333333, ans=0.125 2024-06-21 02:25:27,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=306780.8333333333, ans=0.125 2024-06-21 02:25:27,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=306780.8333333333, ans=0.125 2024-06-21 02:25:30,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=306799.1666666667, ans=0.0 2024-06-21 02:25:39,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=306817.5, ans=0.125 2024-06-21 02:25:39,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=306817.5, ans=0.2 2024-06-21 02:25:45,722 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.95 vs. limit=22.5 2024-06-21 02:25:50,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=306835.8333333333, ans=0.09899494936611666 2024-06-21 02:25:54,843 INFO [train.py:1028] (1/2) Epoch 17, batch 5500, loss[loss=0.2175, simple_loss=0.252, pruned_loss=0.09148, over 12181.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.241, pruned_loss=0.07311, over 2564752.05 frames. ], batch size: 240, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:26:01,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=306872.5, ans=0.125 2024-06-21 02:26:11,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=306872.5, ans=15.0 2024-06-21 02:26:33,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=306927.5, ans=10.0 2024-06-21 02:26:37,081 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.50 vs. limit=15.0 2024-06-21 02:26:42,025 INFO [train.py:1028] (1/2) Epoch 17, batch 5550, loss[loss=0.1788, simple_loss=0.2299, pruned_loss=0.06384, over 13309.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2407, pruned_loss=0.07276, over 2568064.48 frames. ], batch size: 43, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:26:46,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=306945.8333333333, ans=0.125 2024-06-21 02:26:51,815 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 1.868e+02 1.995e+02 2.166e+02 3.302e+02, threshold=3.990e+02, percent-clipped=0.0 2024-06-21 02:26:52,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=306964.1666666667, ans=0.125 2024-06-21 02:27:03,983 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.96 vs. limit=15.0 2024-06-21 02:27:04,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=306982.5, ans=0.125 2024-06-21 02:27:21,681 INFO [train.py:1028] (1/2) Epoch 17, batch 5600, loss[loss=0.1848, simple_loss=0.2316, pruned_loss=0.06898, over 13263.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.24, pruned_loss=0.07224, over 2570003.92 frames. ], batch size: 89, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:27:28,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=307037.5, ans=0.125 2024-06-21 02:27:38,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=307074.1666666667, ans=0.125 2024-06-21 02:27:52,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=307092.5, ans=0.1 2024-06-21 02:27:54,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=307110.8333333333, ans=0.2 2024-06-21 02:28:01,612 INFO [train.py:1028] (1/2) Epoch 17, batch 5650, loss[loss=0.2079, simple_loss=0.254, pruned_loss=0.08094, over 12585.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2399, pruned_loss=0.07211, over 2575357.09 frames. ], batch size: 202, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:28:05,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=307129.1666666667, ans=0.125 2024-06-21 02:28:11,414 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.678e+02 1.916e+02 2.067e+02 2.261e+02 3.879e+02, threshold=4.134e+02, percent-clipped=0.0 2024-06-21 02:28:18,363 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.12 vs. limit=22.5 2024-06-21 02:28:27,247 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=307184.1666666667, ans=0.1 2024-06-21 02:28:29,962 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.78 vs. limit=6.0 2024-06-21 02:28:43,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=307202.5, ans=0.125 2024-06-21 02:28:44,839 INFO [train.py:1028] (1/2) Epoch 17, batch 5700, loss[loss=0.193, simple_loss=0.2469, pruned_loss=0.06961, over 13273.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2398, pruned_loss=0.07222, over 2578615.95 frames. ], batch size: 63, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:29:04,772 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.47 vs. limit=15.0 2024-06-21 02:29:14,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=307275.8333333333, ans=15.0 2024-06-21 02:29:16,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=307275.8333333333, ans=0.0 2024-06-21 02:29:28,292 INFO [train.py:1028] (1/2) Epoch 17, batch 5750, loss[loss=0.2157, simple_loss=0.2478, pruned_loss=0.09183, over 12892.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2406, pruned_loss=0.0725, over 2579188.82 frames. ], batch size: 177, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:29:29,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=307312.5, ans=0.0 2024-06-21 02:29:37,943 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 1.898e+02 2.007e+02 2.225e+02 3.342e+02, threshold=4.014e+02, percent-clipped=0.0 2024-06-21 02:29:42,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=307330.8333333333, ans=0.125 2024-06-21 02:29:47,603 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2024-06-21 02:29:56,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=307367.5, ans=0.0 2024-06-21 02:30:03,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=307385.8333333333, ans=0.125 2024-06-21 02:30:07,523 INFO [train.py:1028] (1/2) Epoch 17, batch 5800, loss[loss=0.1938, simple_loss=0.2455, pruned_loss=0.07101, over 12750.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2414, pruned_loss=0.07308, over 2577790.85 frames. ], batch size: 176, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:30:09,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=307404.1666666667, ans=0.0 2024-06-21 02:30:19,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=307422.5, ans=0.125 2024-06-21 02:30:33,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=307459.1666666667, ans=0.5 2024-06-21 02:30:48,646 INFO [train.py:1028] (1/2) Epoch 17, batch 5850, loss[loss=0.2157, simple_loss=0.2589, pruned_loss=0.08622, over 12457.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2434, pruned_loss=0.07397, over 2575634.12 frames. ], batch size: 202, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:30:56,665 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.28 vs. limit=6.0 2024-06-21 02:30:59,929 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.38 vs. limit=22.5 2024-06-21 02:31:02,688 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.723e+02 1.955e+02 2.132e+02 2.291e+02 3.003e+02, threshold=4.264e+02, percent-clipped=0.0 2024-06-21 02:31:04,712 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.32 vs. limit=10.0 2024-06-21 02:31:06,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=307514.1666666667, ans=0.125 2024-06-21 02:31:06,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=307514.1666666667, ans=0.125 2024-06-21 02:31:08,384 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.24 vs. limit=15.0 2024-06-21 02:31:27,161 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.59 vs. limit=12.0 2024-06-21 02:31:37,995 INFO [train.py:1028] (1/2) Epoch 17, batch 5900, loss[loss=0.191, simple_loss=0.2375, pruned_loss=0.07222, over 13123.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2453, pruned_loss=0.07449, over 2575578.13 frames. ], batch size: 121, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:31:54,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=307624.1666666667, ans=0.0 2024-06-21 02:31:57,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=307624.1666666667, ans=0.0 2024-06-21 02:32:10,790 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.64 vs. limit=6.0 2024-06-21 02:32:12,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=307660.8333333333, ans=0.1 2024-06-21 02:32:15,215 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.39 vs. limit=22.5 2024-06-21 02:32:18,772 INFO [train.py:1028] (1/2) Epoch 17, batch 5950, loss[loss=0.182, simple_loss=0.2284, pruned_loss=0.06781, over 13120.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2467, pruned_loss=0.07506, over 2580378.71 frames. ], batch size: 121, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:32:22,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=307679.1666666667, ans=0.125 2024-06-21 02:32:22,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=307679.1666666667, ans=0.0 2024-06-21 02:32:27,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=307697.5, ans=0.05 2024-06-21 02:32:28,133 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 1.957e+02 2.125e+02 2.371e+02 3.118e+02, threshold=4.249e+02, percent-clipped=0.0 2024-06-21 02:32:35,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=307715.8333333333, ans=0.0 2024-06-21 02:32:40,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=307715.8333333333, ans=0.025 2024-06-21 02:32:43,040 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:32:45,732 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.51 vs. limit=22.5 2024-06-21 02:32:54,768 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:32:58,741 INFO [train.py:1028] (1/2) Epoch 17, batch 6000, loss[loss=0.2482, simple_loss=0.2764, pruned_loss=0.11, over 12195.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.248, pruned_loss=0.07549, over 2573170.00 frames. ], batch size: 240, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:32:58,742 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 02:33:06,164 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([2.8109, 2.5182, 2.6045, 2.3935, 2.2266, 2.5639, 2.3897, 2.3170], device='cuda:1') 2024-06-21 02:33:07,630 INFO [train.py:1060] (1/2) Epoch 17, validation: loss=0.188, simple_loss=0.253, pruned_loss=0.06152, over 351949.00 frames. 2024-06-21 02:33:07,630 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 02:33:11,403 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=13.51 vs. limit=15.0 2024-06-21 02:33:18,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=307789.1666666667, ans=0.1 2024-06-21 02:33:21,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=307789.1666666667, ans=0.125 2024-06-21 02:33:24,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=307807.5, ans=0.125 2024-06-21 02:33:52,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=307844.1666666667, ans=0.125 2024-06-21 02:33:56,808 INFO [train.py:1028] (1/2) Epoch 17, batch 6050, loss[loss=0.1996, simple_loss=0.2505, pruned_loss=0.07431, over 13005.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2498, pruned_loss=0.07606, over 2576207.14 frames. ], batch size: 39, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:34:00,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=307862.5, ans=0.125 2024-06-21 02:34:06,397 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.704e+02 1.954e+02 2.082e+02 2.313e+02 3.295e+02, threshold=4.164e+02, percent-clipped=0.0 2024-06-21 02:34:07,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=307880.8333333333, ans=0.125 2024-06-21 02:34:22,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=307917.5, ans=0.1 2024-06-21 02:34:27,232 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.06 vs. limit=15.0 2024-06-21 02:34:28,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=307935.8333333333, ans=0.2 2024-06-21 02:34:36,963 INFO [train.py:1028] (1/2) Epoch 17, batch 6100, loss[loss=0.1894, simple_loss=0.2272, pruned_loss=0.07583, over 13086.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2506, pruned_loss=0.07633, over 2579559.31 frames. ], batch size: 121, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:34:38,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=307954.1666666667, ans=0.5 2024-06-21 02:35:05,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=307990.8333333333, ans=0.125 2024-06-21 02:35:07,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=308009.1666666667, ans=0.2 2024-06-21 02:35:12,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=308009.1666666667, ans=0.2 2024-06-21 02:35:12,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.20 vs. limit=15.0 2024-06-21 02:35:23,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=308045.8333333333, ans=0.0 2024-06-21 02:35:23,592 INFO [train.py:1028] (1/2) Epoch 17, batch 6150, loss[loss=0.2246, simple_loss=0.2604, pruned_loss=0.09441, over 10707.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2526, pruned_loss=0.07721, over 2578463.85 frames. ], batch size: 304, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:35:26,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308045.8333333333, ans=0.1 2024-06-21 02:35:33,180 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 2.006e+02 2.190e+02 2.618e+02 4.126e+02, threshold=4.380e+02, percent-clipped=0.0 2024-06-21 02:35:35,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=308064.1666666667, ans=0.0 2024-06-21 02:35:37,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=308064.1666666667, ans=0.0 2024-06-21 02:35:41,252 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=22.5 2024-06-21 02:35:45,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=308082.5, ans=0.125 2024-06-21 02:35:50,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=308100.8333333333, ans=0.0 2024-06-21 02:35:50,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=308100.8333333333, ans=0.125 2024-06-21 02:35:57,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=308100.8333333333, ans=0.125 2024-06-21 02:35:57,134 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:35:58,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=308119.1666666667, ans=0.0 2024-06-21 02:36:01,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=308119.1666666667, ans=0.125 2024-06-21 02:36:03,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308119.1666666667, ans=0.1 2024-06-21 02:36:07,198 INFO [train.py:1028] (1/2) Epoch 17, batch 6200, loss[loss=0.2383, simple_loss=0.2906, pruned_loss=0.09297, over 13306.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2543, pruned_loss=0.07776, over 2575560.08 frames. ], batch size: 89, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:36:37,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=308192.5, ans=0.5 2024-06-21 02:36:41,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=308192.5, ans=0.05 2024-06-21 02:36:43,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=308210.8333333333, ans=0.025 2024-06-21 02:36:44,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=308210.8333333333, ans=0.125 2024-06-21 02:36:51,775 INFO [train.py:1028] (1/2) Epoch 17, batch 6250, loss[loss=0.2201, simple_loss=0.2651, pruned_loss=0.08758, over 13211.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2551, pruned_loss=0.07828, over 2569279.96 frames. ], batch size: 83, lr: 3.42e-03, grad_scale: 64.0 2024-06-21 02:37:01,168 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 1.960e+02 2.071e+02 2.319e+02 2.862e+02, threshold=4.142e+02, percent-clipped=0.0 2024-06-21 02:37:15,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=308284.1666666667, ans=0.125 2024-06-21 02:37:17,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=308284.1666666667, ans=0.125 2024-06-21 02:37:19,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=308284.1666666667, ans=0.125 2024-06-21 02:37:24,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=308302.5, ans=0.0 2024-06-21 02:37:30,442 INFO [train.py:1028] (1/2) Epoch 17, batch 6300, loss[loss=0.2086, simple_loss=0.2641, pruned_loss=0.07656, over 11133.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2562, pruned_loss=0.07848, over 2564377.22 frames. ], batch size: 16, lr: 3.42e-03, grad_scale: 64.0 2024-06-21 02:37:30,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=308320.8333333333, ans=0.125 2024-06-21 02:37:34,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=308320.8333333333, ans=0.0 2024-06-21 02:38:00,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=308375.8333333333, ans=0.0 2024-06-21 02:38:01,105 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2024-06-21 02:38:02,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=308394.1666666667, ans=0.125 2024-06-21 02:38:03,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=308394.1666666667, ans=0.0 2024-06-21 02:38:10,100 INFO [train.py:1028] (1/2) Epoch 17, batch 6350, loss[loss=0.2531, simple_loss=0.2872, pruned_loss=0.1095, over 12588.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2578, pruned_loss=0.07886, over 2574294.92 frames. ], batch size: 202, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:38:17,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=308430.8333333333, ans=0.125 2024-06-21 02:38:19,298 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.700e+02 1.994e+02 2.158e+02 2.308e+02 3.507e+02, threshold=4.316e+02, percent-clipped=0.0 2024-06-21 02:38:47,974 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.55 vs. limit=15.0 2024-06-21 02:38:56,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=308504.1666666667, ans=0.125 2024-06-21 02:38:57,127 INFO [train.py:1028] (1/2) Epoch 17, batch 6400, loss[loss=0.1952, simple_loss=0.2513, pruned_loss=0.06954, over 13246.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2602, pruned_loss=0.07998, over 2576359.07 frames. ], batch size: 67, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:39:02,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=308504.1666666667, ans=0.125 2024-06-21 02:39:04,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=308504.1666666667, ans=0.0 2024-06-21 02:39:09,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=308522.5, ans=0.025 2024-06-21 02:39:13,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308540.8333333333, ans=0.1 2024-06-21 02:39:22,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=308559.1666666667, ans=0.0 2024-06-21 02:39:30,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=308577.5, ans=0.125 2024-06-21 02:39:37,316 INFO [train.py:1028] (1/2) Epoch 17, batch 6450, loss[loss=0.2365, simple_loss=0.2862, pruned_loss=0.09343, over 12472.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2619, pruned_loss=0.08064, over 2583165.01 frames. ], batch size: 202, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:39:37,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=308595.8333333333, ans=0.0 2024-06-21 02:39:46,726 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 2.033e+02 2.254e+02 2.536e+02 3.302e+02, threshold=4.508e+02, percent-clipped=0.0 2024-06-21 02:39:53,500 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.86 vs. limit=22.5 2024-06-21 02:40:13,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=308669.1666666667, ans=0.0 2024-06-21 02:40:14,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=308669.1666666667, ans=0.125 2024-06-21 02:40:16,260 INFO [train.py:1028] (1/2) Epoch 17, batch 6500, loss[loss=0.2398, simple_loss=0.2739, pruned_loss=0.1028, over 10891.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2634, pruned_loss=0.08079, over 2586227.25 frames. ], batch size: 304, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:40:25,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=308705.8333333333, ans=0.1 2024-06-21 02:40:33,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308724.1666666667, ans=0.1 2024-06-21 02:40:38,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=308724.1666666667, ans=0.125 2024-06-21 02:40:41,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=308742.5, ans=0.125 2024-06-21 02:40:48,027 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.63 vs. limit=22.5 2024-06-21 02:40:56,181 INFO [train.py:1028] (1/2) Epoch 17, batch 6550, loss[loss=0.1894, simple_loss=0.2442, pruned_loss=0.0673, over 12634.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.264, pruned_loss=0.08059, over 2589359.92 frames. ], batch size: 22, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:40:56,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=308779.1666666667, ans=0.125 2024-06-21 02:41:09,428 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 2.016e+02 2.193e+02 2.421e+02 2.919e+02, threshold=4.386e+02, percent-clipped=0.0 2024-06-21 02:41:09,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=308797.5, ans=0.025 2024-06-21 02:41:13,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=308797.5, ans=0.125 2024-06-21 02:41:13,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=308797.5, ans=0.2 2024-06-21 02:41:26,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=308815.8333333333, ans=0.2 2024-06-21 02:41:29,391 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.83 vs. limit=12.0 2024-06-21 02:41:32,027 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=15.0 2024-06-21 02:41:37,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=308852.5, ans=0.0 2024-06-21 02:41:43,226 INFO [train.py:1028] (1/2) Epoch 17, batch 6600, loss[loss=0.2022, simple_loss=0.2575, pruned_loss=0.07342, over 13274.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2638, pruned_loss=0.08029, over 2591304.40 frames. ], batch size: 72, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:41:45,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308870.8333333333, ans=0.1 2024-06-21 02:42:06,793 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2024-06-21 02:42:07,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=308925.8333333333, ans=0.125 2024-06-21 02:42:09,329 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.08 vs. limit=15.0 2024-06-21 02:42:14,108 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.19 vs. limit=22.5 2024-06-21 02:42:18,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=308944.1666666667, ans=0.0 2024-06-21 02:42:22,357 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=15.0 2024-06-21 02:42:22,526 INFO [train.py:1028] (1/2) Epoch 17, batch 6650, loss[loss=0.2217, simple_loss=0.2741, pruned_loss=0.08469, over 12944.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2654, pruned_loss=0.08098, over 2585623.37 frames. ], batch size: 158, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:42:22,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=308962.5, ans=0.0 2024-06-21 02:42:32,556 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 2.053e+02 2.192e+02 2.343e+02 3.136e+02, threshold=4.383e+02, percent-clipped=0.0 2024-06-21 02:42:38,685 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.42 vs. limit=12.0 2024-06-21 02:42:44,017 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.31 vs. limit=15.0 2024-06-21 02:42:55,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=309035.8333333333, ans=0.125 2024-06-21 02:42:59,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=309035.8333333333, ans=0.125 2024-06-21 02:43:02,828 INFO [train.py:1028] (1/2) Epoch 17, batch 6700, loss[loss=0.2285, simple_loss=0.2779, pruned_loss=0.08955, over 12751.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2674, pruned_loss=0.08209, over 2585217.45 frames. ], batch size: 176, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:43:03,390 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.84 vs. limit=15.0 2024-06-21 02:43:08,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=309054.1666666667, ans=10.0 2024-06-21 02:43:19,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=309090.8333333333, ans=0.025 2024-06-21 02:43:21,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=309090.8333333333, ans=0.125 2024-06-21 02:43:29,189 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.06 vs. limit=10.0 2024-06-21 02:43:39,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=309109.1666666667, ans=10.0 2024-06-21 02:43:50,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=309127.5, ans=0.125 2024-06-21 02:43:51,898 INFO [train.py:1028] (1/2) Epoch 17, batch 6750, loss[loss=0.2575, simple_loss=0.3021, pruned_loss=0.1065, over 12274.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2681, pruned_loss=0.08288, over 2578525.64 frames. ], batch size: 241, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:43:52,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=309145.8333333333, ans=0.0 2024-06-21 02:43:58,676 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:43:58,699 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:44:00,848 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.167e+02 2.337e+02 2.607e+02 3.712e+02, threshold=4.674e+02, percent-clipped=0.0 2024-06-21 02:44:07,753 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.71 vs. limit=15.0 2024-06-21 02:44:17,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2024-06-21 02:44:18,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=309200.8333333333, ans=0.2 2024-06-21 02:44:26,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=309219.1666666667, ans=0.0 2024-06-21 02:44:31,643 INFO [train.py:1028] (1/2) Epoch 17, batch 6800, loss[loss=0.2175, simple_loss=0.269, pruned_loss=0.08293, over 13195.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2693, pruned_loss=0.08307, over 2580359.42 frames. ], batch size: 67, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:44:34,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=309237.5, ans=0.025 2024-06-21 02:44:37,007 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.857e-01 2024-06-21 02:44:38,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=309255.8333333333, ans=0.0 2024-06-21 02:44:47,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=309274.1666666667, ans=0.0 2024-06-21 02:45:02,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=309310.8333333333, ans=0.125 2024-06-21 02:45:10,865 INFO [train.py:1028] (1/2) Epoch 17, batch 6850, loss[loss=0.2367, simple_loss=0.3003, pruned_loss=0.08658, over 13243.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2704, pruned_loss=0.0832, over 2583845.76 frames. ], batch size: 63, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:45:12,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=309329.1666666667, ans=0.125 2024-06-21 02:45:20,179 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 2.084e+02 2.229e+02 2.468e+02 3.335e+02, threshold=4.459e+02, percent-clipped=0.0 2024-06-21 02:45:23,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=309347.5, ans=0.0 2024-06-21 02:45:25,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=309347.5, ans=0.0 2024-06-21 02:45:40,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=309384.1666666667, ans=0.125 2024-06-21 02:45:40,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=309384.1666666667, ans=0.2 2024-06-21 02:45:45,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=309402.5, ans=0.1 2024-06-21 02:45:50,036 INFO [train.py:1028] (1/2) Epoch 17, batch 6900, loss[loss=0.2256, simple_loss=0.2767, pruned_loss=0.08728, over 13302.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2718, pruned_loss=0.084, over 2586093.00 frames. ], batch size: 49, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:45:50,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=309420.8333333333, ans=0.2 2024-06-21 02:45:54,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=309420.8333333333, ans=0.125 2024-06-21 02:45:55,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=309420.8333333333, ans=0.0 2024-06-21 02:46:02,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=309439.1666666667, ans=0.0 2024-06-21 02:46:22,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=309475.8333333333, ans=0.125 2024-06-21 02:46:23,774 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2024-06-21 02:46:26,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=309475.8333333333, ans=0.05 2024-06-21 02:46:38,060 INFO [train.py:1028] (1/2) Epoch 17, batch 6950, loss[loss=0.1913, simple_loss=0.2485, pruned_loss=0.06699, over 11059.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2718, pruned_loss=0.08364, over 2580715.44 frames. ], batch size: 16, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:46:42,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=309512.5, ans=0.0 2024-06-21 02:46:43,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=309512.5, ans=10.0 2024-06-21 02:46:46,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=309530.8333333333, ans=10.0 2024-06-21 02:46:47,305 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.090e+02 2.208e+02 2.461e+02 3.231e+02, threshold=4.415e+02, percent-clipped=0.0 2024-06-21 02:46:53,375 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:47:02,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=309567.5, ans=0.0 2024-06-21 02:47:04,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=309567.5, ans=0.0 2024-06-21 02:47:09,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=309585.8333333333, ans=0.125 2024-06-21 02:47:11,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=309585.8333333333, ans=0.1 2024-06-21 02:47:12,314 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.67 vs. limit=15.0 2024-06-21 02:47:17,281 INFO [train.py:1028] (1/2) Epoch 17, batch 7000, loss[loss=0.2301, simple_loss=0.2769, pruned_loss=0.09162, over 12908.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2717, pruned_loss=0.08361, over 2576950.50 frames. ], batch size: 158, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:47:34,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=309640.8333333333, ans=0.125 2024-06-21 02:47:48,480 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.13 vs. limit=15.0 2024-06-21 02:47:58,346 INFO [train.py:1028] (1/2) Epoch 17, batch 7050, loss[loss=0.2354, simple_loss=0.2825, pruned_loss=0.09415, over 12820.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2722, pruned_loss=0.08335, over 2583524.47 frames. ], batch size: 176, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:48:07,847 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 2.154e+02 2.339e+02 2.609e+02 3.493e+02, threshold=4.678e+02, percent-clipped=0.0 2024-06-21 02:48:08,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=309714.1666666667, ans=0.125 2024-06-21 02:48:08,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=309714.1666666667, ans=0.0 2024-06-21 02:48:14,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=309732.5, ans=0.05 2024-06-21 02:48:24,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=309750.8333333333, ans=0.125 2024-06-21 02:48:25,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=309750.8333333333, ans=0.125 2024-06-21 02:48:25,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=309750.8333333333, ans=0.035 2024-06-21 02:48:26,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=309750.8333333333, ans=12.0 2024-06-21 02:48:33,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=309769.1666666667, ans=0.1 2024-06-21 02:48:37,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=309769.1666666667, ans=0.2 2024-06-21 02:48:44,563 INFO [train.py:1028] (1/2) Epoch 17, batch 7100, loss[loss=0.2451, simple_loss=0.2977, pruned_loss=0.09629, over 13191.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2723, pruned_loss=0.0836, over 2575592.62 frames. ], batch size: 112, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:48:49,231 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.26 vs. limit=15.0 2024-06-21 02:49:22,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=309860.8333333333, ans=0.0 2024-06-21 02:49:24,658 INFO [train.py:1028] (1/2) Epoch 17, batch 7150, loss[loss=0.2563, simple_loss=0.3032, pruned_loss=0.1047, over 12495.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2731, pruned_loss=0.08393, over 2574798.76 frames. ], batch size: 202, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:49:27,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=309879.1666666667, ans=0.125 2024-06-21 02:49:31,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=309897.5, ans=0.125 2024-06-21 02:49:34,110 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+02 2.058e+02 2.196e+02 2.401e+02 3.381e+02, threshold=4.392e+02, percent-clipped=0.0 2024-06-21 02:49:42,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=309915.8333333333, ans=0.125 2024-06-21 02:50:03,891 INFO [train.py:1028] (1/2) Epoch 17, batch 7200, loss[loss=0.2308, simple_loss=0.2849, pruned_loss=0.08835, over 13177.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2743, pruned_loss=0.08409, over 2579412.59 frames. ], batch size: 112, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:50:04,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=309970.8333333333, ans=0.025 2024-06-21 02:50:08,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=309970.8333333333, ans=0.1 2024-06-21 02:50:12,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=309989.1666666667, ans=0.125 2024-06-21 02:50:14,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=309989.1666666667, ans=0.125 2024-06-21 02:50:16,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=309989.1666666667, ans=0.1 2024-06-21 02:50:21,000 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.93 vs. limit=10.0 2024-06-21 02:50:42,205 INFO [train.py:1028] (1/2) Epoch 17, batch 7250, loss[loss=0.1895, simple_loss=0.2514, pruned_loss=0.06378, over 12960.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2754, pruned_loss=0.08441, over 2579877.18 frames. ], batch size: 36, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:50:48,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=310062.5, ans=0.125 2024-06-21 02:50:51,419 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.039e+02 2.176e+02 2.441e+02 2.972e+02, threshold=4.353e+02, percent-clipped=0.0 2024-06-21 02:50:52,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=310080.8333333333, ans=0.125 2024-06-21 02:51:11,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=310099.1666666667, ans=0.0 2024-06-21 02:51:14,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=310117.5, ans=0.0 2024-06-21 02:51:32,048 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.74 vs. limit=15.0 2024-06-21 02:51:34,400 INFO [train.py:1028] (1/2) Epoch 17, batch 7300, loss[loss=0.2315, simple_loss=0.2845, pruned_loss=0.08927, over 12973.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2762, pruned_loss=0.08481, over 2580287.54 frames. ], batch size: 36, lr: 3.40e-03, grad_scale: 64.0 2024-06-21 02:51:39,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=310154.1666666667, ans=0.125 2024-06-21 02:51:43,159 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.29 vs. limit=10.0 2024-06-21 02:51:44,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=310172.5, ans=0.2 2024-06-21 02:51:52,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=310190.8333333333, ans=0.0 2024-06-21 02:51:55,517 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2024-06-21 02:52:00,861 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.58 vs. limit=22.5 2024-06-21 02:52:02,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=310209.1666666667, ans=0.0 2024-06-21 02:52:05,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=310209.1666666667, ans=0.125 2024-06-21 02:52:08,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=310227.5, ans=0.125 2024-06-21 02:52:09,586 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=12.0 2024-06-21 02:52:14,670 INFO [train.py:1028] (1/2) Epoch 17, batch 7350, loss[loss=0.2315, simple_loss=0.2836, pruned_loss=0.08968, over 13356.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2762, pruned_loss=0.08456, over 2581357.51 frames. ], batch size: 46, lr: 3.40e-03, grad_scale: 64.0 2024-06-21 02:52:21,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=310245.8333333333, ans=0.125 2024-06-21 02:52:23,841 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 2.069e+02 2.292e+02 2.446e+02 3.199e+02, threshold=4.584e+02, percent-clipped=0.0 2024-06-21 02:52:31,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=310282.5, ans=0.0 2024-06-21 02:52:34,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=310282.5, ans=0.125 2024-06-21 02:52:42,098 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.23 vs. limit=6.0 2024-06-21 02:52:45,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=310319.1666666667, ans=0.2 2024-06-21 02:52:51,979 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.73 vs. limit=22.5 2024-06-21 02:52:54,402 INFO [train.py:1028] (1/2) Epoch 17, batch 7400, loss[loss=0.24, simple_loss=0.3012, pruned_loss=0.08937, over 13282.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2769, pruned_loss=0.08479, over 2587678.10 frames. ], batch size: 63, lr: 3.40e-03, grad_scale: 64.0 2024-06-21 02:53:01,278 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.84 vs. limit=15.0 2024-06-21 02:53:03,934 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.95 vs. limit=12.0 2024-06-21 02:53:16,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=310374.1666666667, ans=0.125 2024-06-21 02:53:19,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=310392.5, ans=0.125 2024-06-21 02:53:20,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=310392.5, ans=0.1 2024-06-21 02:53:23,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=310392.5, ans=0.125 2024-06-21 02:53:29,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=310410.8333333333, ans=0.1 2024-06-21 02:53:31,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=310410.8333333333, ans=0.0 2024-06-21 02:53:31,445 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.10 vs. limit=15.0 2024-06-21 02:53:33,931 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.07 vs. limit=15.0 2024-06-21 02:53:34,234 INFO [train.py:1028] (1/2) Epoch 17, batch 7450, loss[loss=0.1867, simple_loss=0.2454, pruned_loss=0.06395, over 12628.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2768, pruned_loss=0.08466, over 2580803.27 frames. ], batch size: 29, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:53:52,695 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.071e+02 2.236e+02 2.556e+02 4.142e+02, threshold=4.472e+02, percent-clipped=0.0 2024-06-21 02:54:01,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=310465.8333333333, ans=0.2 2024-06-21 02:54:11,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=310484.1666666667, ans=0.5 2024-06-21 02:54:19,453 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=310502.5, ans=0.2 2024-06-21 02:54:23,178 INFO [train.py:1028] (1/2) Epoch 17, batch 7500, loss[loss=0.2416, simple_loss=0.2802, pruned_loss=0.1015, over 10772.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2781, pruned_loss=0.08539, over 2578791.09 frames. ], batch size: 303, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:54:23,649 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.71 vs. limit=15.0 2024-06-21 02:54:33,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=310539.1666666667, ans=0.125 2024-06-21 02:54:36,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=310539.1666666667, ans=0.125 2024-06-21 02:54:39,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=310557.5, ans=0.2 2024-06-21 02:54:52,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=310575.8333333333, ans=0.125 2024-06-21 02:54:58,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=310594.1666666667, ans=0.125 2024-06-21 02:54:59,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=310594.1666666667, ans=0.125 2024-06-21 02:55:02,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=310612.5, ans=0.125 2024-06-21 02:55:02,860 INFO [train.py:1028] (1/2) Epoch 17, batch 7550, loss[loss=0.2184, simple_loss=0.2674, pruned_loss=0.08471, over 12916.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2789, pruned_loss=0.08611, over 2577830.16 frames. ], batch size: 158, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:55:12,833 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.154e+02 2.395e+02 2.681e+02 4.237e+02, threshold=4.790e+02, percent-clipped=0.0 2024-06-21 02:55:16,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=310630.8333333333, ans=0.0 2024-06-21 02:55:22,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=310649.1666666667, ans=0.125 2024-06-21 02:55:30,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=310667.5, ans=0.125 2024-06-21 02:55:33,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=310685.8333333333, ans=0.0 2024-06-21 02:55:42,336 INFO [train.py:1028] (1/2) Epoch 17, batch 7600, loss[loss=0.2341, simple_loss=0.2828, pruned_loss=0.09271, over 13210.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2797, pruned_loss=0.08641, over 2577307.04 frames. ], batch size: 83, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:55:56,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=310722.5, ans=0.125 2024-06-21 02:56:30,874 INFO [train.py:1028] (1/2) Epoch 17, batch 7650, loss[loss=0.1786, simple_loss=0.2377, pruned_loss=0.05978, over 12952.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2796, pruned_loss=0.08631, over 2573503.02 frames. ], batch size: 33, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:56:40,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=310814.1666666667, ans=0.0 2024-06-21 02:56:41,476 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.115e+02 2.289e+02 2.506e+02 3.514e+02, threshold=4.578e+02, percent-clipped=0.0 2024-06-21 02:56:41,980 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.03 vs. limit=6.0 2024-06-21 02:56:42,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=310814.1666666667, ans=0.2 2024-06-21 02:56:45,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=310814.1666666667, ans=0.125 2024-06-21 02:56:47,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=310832.5, ans=0.2 2024-06-21 02:56:59,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=310850.8333333333, ans=0.0 2024-06-21 02:57:02,343 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.03 vs. limit=10.0 2024-06-21 02:57:10,703 INFO [train.py:1028] (1/2) Epoch 17, batch 7700, loss[loss=0.2311, simple_loss=0.2917, pruned_loss=0.0852, over 13242.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2802, pruned_loss=0.0867, over 2569867.70 frames. ], batch size: 63, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:57:13,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=310887.5, ans=0.07 2024-06-21 02:57:17,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=310905.8333333333, ans=0.125 2024-06-21 02:57:18,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=310905.8333333333, ans=0.0 2024-06-21 02:57:26,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=310924.1666666667, ans=0.1 2024-06-21 02:57:47,780 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2024-06-21 02:57:48,818 INFO [train.py:1028] (1/2) Epoch 17, batch 7750, loss[loss=0.2104, simple_loss=0.267, pruned_loss=0.07693, over 13274.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2805, pruned_loss=0.08689, over 2573638.98 frames. ], batch size: 72, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:57:59,004 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.29 vs. limit=15.0 2024-06-21 02:57:59,131 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.069e+02 2.191e+02 2.394e+02 3.000e+02, threshold=4.381e+02, percent-clipped=0.0 2024-06-21 02:57:59,896 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.84 vs. limit=10.0 2024-06-21 02:58:00,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=310997.5, ans=0.2 2024-06-21 02:58:13,010 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=6.78 vs. limit=12.0 2024-06-21 02:58:22,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=311034.1666666667, ans=0.125 2024-06-21 02:58:36,153 INFO [train.py:1028] (1/2) Epoch 17, batch 7800, loss[loss=0.2315, simple_loss=0.2864, pruned_loss=0.08834, over 13187.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2819, pruned_loss=0.08748, over 2579179.10 frames. ], batch size: 95, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:58:43,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=311089.1666666667, ans=0.0 2024-06-21 02:59:01,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=9.94 vs. limit=12.0 2024-06-21 02:59:05,912 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.98 vs. limit=15.0 2024-06-21 02:59:09,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=311144.1666666667, ans=0.1 2024-06-21 02:59:10,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=311144.1666666667, ans=0.125 2024-06-21 02:59:13,499 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2024-06-21 02:59:14,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=311144.1666666667, ans=0.125 2024-06-21 02:59:17,237 INFO [train.py:1028] (1/2) Epoch 17, batch 7850, loss[loss=0.1683, simple_loss=0.2182, pruned_loss=0.05926, over 11187.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2818, pruned_loss=0.08738, over 2572885.04 frames. ], batch size: 16, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:59:27,333 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.118e+02 2.274e+02 2.526e+02 3.590e+02, threshold=4.547e+02, percent-clipped=0.0 2024-06-21 02:59:28,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=311180.8333333333, ans=0.0 2024-06-21 02:59:41,233 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.37 vs. limit=22.5 2024-06-21 02:59:43,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=311217.5, ans=0.025 2024-06-21 02:59:43,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=311217.5, ans=0.125 2024-06-21 02:59:51,485 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.26 vs. limit=10.0 2024-06-21 02:59:56,336 INFO [train.py:1028] (1/2) Epoch 17, batch 7900, loss[loss=0.2179, simple_loss=0.28, pruned_loss=0.07786, over 13170.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2824, pruned_loss=0.08786, over 2571598.05 frames. ], batch size: 77, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:00:00,776 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.55 vs. limit=15.0 2024-06-21 03:00:02,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=311254.1666666667, ans=0.125 2024-06-21 03:00:03,340 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.73 vs. limit=22.5 2024-06-21 03:00:10,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=311272.5, ans=0.1 2024-06-21 03:00:19,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=311309.1666666667, ans=0.025 2024-06-21 03:00:20,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=311309.1666666667, ans=0.125 2024-06-21 03:00:20,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=311309.1666666667, ans=0.125 2024-06-21 03:00:40,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=311327.5, ans=0.0 2024-06-21 03:00:45,073 INFO [train.py:1028] (1/2) Epoch 17, batch 7950, loss[loss=0.243, simple_loss=0.2828, pruned_loss=0.1016, over 10740.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2825, pruned_loss=0.08784, over 2574406.52 frames. ], batch size: 304, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:00:55,929 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.094e+02 2.268e+02 2.488e+02 3.230e+02, threshold=4.536e+02, percent-clipped=0.0 2024-06-21 03:01:05,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=311382.5, ans=0.125 2024-06-21 03:01:07,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=311382.5, ans=0.1 2024-06-21 03:01:10,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=311400.8333333333, ans=15.0 2024-06-21 03:01:11,662 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.42 vs. limit=22.5 2024-06-21 03:01:17,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=311419.1666666667, ans=0.2 2024-06-21 03:01:24,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=311419.1666666667, ans=0.07 2024-06-21 03:01:26,385 INFO [train.py:1028] (1/2) Epoch 17, batch 8000, loss[loss=0.1969, simple_loss=0.2596, pruned_loss=0.06712, over 12628.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2833, pruned_loss=0.08808, over 2571376.13 frames. ], batch size: 29, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:01:29,597 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:01:33,252 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=12.0 2024-06-21 03:01:40,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.39 vs. limit=22.5 2024-06-21 03:01:40,831 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.97 vs. limit=10.0 2024-06-21 03:02:07,181 INFO [train.py:1028] (1/2) Epoch 17, batch 8050, loss[loss=0.2444, simple_loss=0.3036, pruned_loss=0.09262, over 13225.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.2829, pruned_loss=0.08781, over 2571246.81 frames. ], batch size: 83, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:02:15,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=311547.5, ans=0.2 2024-06-21 03:02:17,185 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.127e+02 2.265e+02 2.528e+02 3.662e+02, threshold=4.531e+02, percent-clipped=0.0 2024-06-21 03:02:24,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=311565.8333333333, ans=0.0 2024-06-21 03:02:25,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=311565.8333333333, ans=0.0 2024-06-21 03:02:34,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=311584.1666666667, ans=0.0 2024-06-21 03:02:36,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=311584.1666666667, ans=0.2 2024-06-21 03:02:39,999 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:02:43,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=311602.5, ans=0.0 2024-06-21 03:02:45,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=311620.8333333333, ans=0.125 2024-06-21 03:02:45,785 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.65 vs. limit=15.0 2024-06-21 03:02:46,103 INFO [train.py:1028] (1/2) Epoch 17, batch 8100, loss[loss=0.2129, simple_loss=0.2688, pruned_loss=0.07848, over 13155.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.284, pruned_loss=0.08826, over 2576097.72 frames. ], batch size: 112, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:02:55,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=311639.1666666667, ans=0.0 2024-06-21 03:02:55,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=311639.1666666667, ans=0.5 2024-06-21 03:02:55,823 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.18 vs. limit=10.0 2024-06-21 03:02:59,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=311639.1666666667, ans=0.025 2024-06-21 03:03:21,502 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.57 vs. limit=6.0 2024-06-21 03:03:24,887 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.18 vs. limit=15.0 2024-06-21 03:03:28,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=311694.1666666667, ans=0.025 2024-06-21 03:03:34,655 INFO [train.py:1028] (1/2) Epoch 17, batch 8150, loss[loss=0.2207, simple_loss=0.2691, pruned_loss=0.08616, over 13106.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.283, pruned_loss=0.08741, over 2579442.35 frames. ], batch size: 121, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:03:45,514 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.090e+02 2.194e+02 2.430e+02 2.940e+02, threshold=4.389e+02, percent-clipped=0.0 2024-06-21 03:03:48,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=311730.8333333333, ans=0.0 2024-06-21 03:03:50,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=311749.1666666667, ans=0.125 2024-06-21 03:03:59,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=311767.5, ans=0.1 2024-06-21 03:04:01,443 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.22 vs. limit=15.0 2024-06-21 03:04:05,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=311767.5, ans=0.125 2024-06-21 03:04:14,992 INFO [train.py:1028] (1/2) Epoch 17, batch 8200, loss[loss=0.2299, simple_loss=0.2865, pruned_loss=0.08666, over 13148.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2833, pruned_loss=0.0874, over 2582968.44 frames. ], batch size: 112, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:04:34,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=311840.8333333333, ans=0.125 2024-06-21 03:04:43,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=311859.1666666667, ans=0.125 2024-06-21 03:04:48,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=311877.5, ans=0.1 2024-06-21 03:04:54,349 INFO [train.py:1028] (1/2) Epoch 17, batch 8250, loss[loss=0.2159, simple_loss=0.2779, pruned_loss=0.07689, over 13297.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.284, pruned_loss=0.08778, over 2583187.10 frames. ], batch size: 52, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:04:55,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=311895.8333333333, ans=0.1 2024-06-21 03:04:57,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=311895.8333333333, ans=0.5 2024-06-21 03:05:04,411 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.107e+02 2.277e+02 2.503e+02 3.038e+02, threshold=4.554e+02, percent-clipped=0.0 2024-06-21 03:05:08,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=311914.1666666667, ans=0.0 2024-06-21 03:05:09,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=311932.5, ans=0.125 2024-06-21 03:05:31,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=311969.1666666667, ans=0.1 2024-06-21 03:05:40,988 INFO [train.py:1028] (1/2) Epoch 17, batch 8300, loss[loss=0.2169, simple_loss=0.2738, pruned_loss=0.07997, over 12973.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2827, pruned_loss=0.08723, over 2580705.48 frames. ], batch size: 102, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:05:45,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=311987.5, ans=0.2 2024-06-21 03:05:52,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=312005.8333333333, ans=0.0 2024-06-21 03:05:53,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=312005.8333333333, ans=0.1 2024-06-21 03:06:06,837 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.44 vs. limit=15.0 2024-06-21 03:06:20,585 INFO [train.py:1028] (1/2) Epoch 17, batch 8350, loss[loss=0.26, simple_loss=0.2992, pruned_loss=0.1104, over 13194.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2833, pruned_loss=0.08731, over 2580485.89 frames. ], batch size: 112, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:06:21,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=312079.1666666667, ans=0.0 2024-06-21 03:06:29,901 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2024-06-21 03:06:30,868 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.163e+02 2.342e+02 2.629e+02 3.742e+02, threshold=4.683e+02, percent-clipped=0.0 2024-06-21 03:06:34,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=312097.5, ans=0.1 2024-06-21 03:06:40,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=312115.8333333333, ans=0.1 2024-06-21 03:06:46,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=312134.1666666667, ans=0.125 2024-06-21 03:07:04,163 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2024-06-21 03:07:06,116 INFO [train.py:1028] (1/2) Epoch 17, batch 8400, loss[loss=0.2196, simple_loss=0.2736, pruned_loss=0.08277, over 12965.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2832, pruned_loss=0.08742, over 2577167.89 frames. ], batch size: 39, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:07:16,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=312189.1666666667, ans=0.05 2024-06-21 03:07:29,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=312207.5, ans=0.125 2024-06-21 03:07:36,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=312225.8333333333, ans=0.125 2024-06-21 03:07:44,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=312225.8333333333, ans=0.125 2024-06-21 03:07:45,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=312225.8333333333, ans=0.125 2024-06-21 03:07:58,621 INFO [train.py:1028] (1/2) Epoch 17, batch 8450, loss[loss=0.2333, simple_loss=0.2877, pruned_loss=0.08948, over 13119.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.2838, pruned_loss=0.08755, over 2579334.69 frames. ], batch size: 112, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:08:00,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=312262.5, ans=0.0 2024-06-21 03:08:05,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=312262.5, ans=0.125 2024-06-21 03:08:15,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=312280.8333333333, ans=0.1 2024-06-21 03:08:18,297 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.170e+02 2.340e+02 2.535e+02 3.087e+02, threshold=4.681e+02, percent-clipped=0.0 2024-06-21 03:08:32,649 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.85 vs. limit=15.0 2024-06-21 03:08:40,073 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2024-06-21 03:08:57,675 INFO [train.py:1028] (1/2) Epoch 17, batch 8500, loss[loss=0.2116, simple_loss=0.2771, pruned_loss=0.07308, over 12548.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.2849, pruned_loss=0.08799, over 2577264.47 frames. ], batch size: 29, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:09:00,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=312354.1666666667, ans=0.025 2024-06-21 03:09:17,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=312390.8333333333, ans=0.125 2024-06-21 03:09:19,729 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.00 vs. limit=15.0 2024-06-21 03:09:28,025 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=12.0 2024-06-21 03:09:34,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=312409.1666666667, ans=0.2 2024-06-21 03:09:48,089 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=22.5 2024-06-21 03:09:50,946 INFO [train.py:1028] (1/2) Epoch 17, batch 8550, loss[loss=0.2375, simple_loss=0.2873, pruned_loss=0.09383, over 12476.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.2848, pruned_loss=0.08782, over 2575713.29 frames. ], batch size: 22, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:09:53,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=312445.8333333333, ans=0.125 2024-06-21 03:09:56,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=312445.8333333333, ans=0.025 2024-06-21 03:09:57,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=312445.8333333333, ans=0.125 2024-06-21 03:09:57,513 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.92 vs. limit=12.0 2024-06-21 03:10:01,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=312464.1666666667, ans=0.025 2024-06-21 03:10:01,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=312464.1666666667, ans=0.2 2024-06-21 03:10:03,696 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.096e+02 2.193e+02 2.434e+02 3.488e+02, threshold=4.386e+02, percent-clipped=0.0 2024-06-21 03:10:33,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=312519.1666666667, ans=0.125 2024-06-21 03:10:38,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=312519.1666666667, ans=0.125 2024-06-21 03:10:41,969 INFO [train.py:1028] (1/2) Epoch 17, batch 8600, loss[loss=0.2373, simple_loss=0.2891, pruned_loss=0.09276, over 13066.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.2856, pruned_loss=0.08788, over 2573692.92 frames. ], batch size: 121, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:10:42,544 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.43 vs. limit=15.0 2024-06-21 03:10:43,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=312537.5, ans=0.0 2024-06-21 03:10:56,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=312555.8333333333, ans=0.025 2024-06-21 03:11:24,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=312592.5, ans=0.125 2024-06-21 03:11:47,701 INFO [train.py:1028] (1/2) Epoch 17, batch 8650, loss[loss=0.2387, simple_loss=0.2843, pruned_loss=0.09658, over 13043.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.2865, pruned_loss=0.08806, over 2576614.76 frames. ], batch size: 102, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:11:58,350 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.790e+02 2.110e+02 2.234e+02 2.522e+02 3.672e+02, threshold=4.469e+02, percent-clipped=0.0 2024-06-21 03:12:30,687 INFO [train.py:1028] (1/2) Epoch 17, batch 8700, loss[loss=0.2343, simple_loss=0.2969, pruned_loss=0.08591, over 13230.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.2867, pruned_loss=0.08849, over 2572651.33 frames. ], batch size: 59, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:12:47,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=312739.1666666667, ans=0.025 2024-06-21 03:13:10,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=312775.8333333333, ans=0.125 2024-06-21 03:13:14,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=312794.1666666667, ans=0.125 2024-06-21 03:13:17,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=312794.1666666667, ans=0.0 2024-06-21 03:13:19,401 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=14.37 vs. limit=15.0 2024-06-21 03:13:22,732 INFO [train.py:1028] (1/2) Epoch 17, batch 8750, loss[loss=0.2075, simple_loss=0.2644, pruned_loss=0.07533, over 13110.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.2854, pruned_loss=0.08794, over 2569038.43 frames. ], batch size: 121, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:13:36,038 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.131e+02 2.293e+02 2.540e+02 3.300e+02, threshold=4.586e+02, percent-clipped=0.0 2024-06-21 03:14:02,608 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.83 vs. limit=15.0 2024-06-21 03:14:05,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=312885.8333333333, ans=0.125 2024-06-21 03:14:21,335 INFO [train.py:1028] (1/2) Epoch 17, batch 8800, loss[loss=0.2287, simple_loss=0.2858, pruned_loss=0.08584, over 13277.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.286, pruned_loss=0.08807, over 2574881.44 frames. ], batch size: 72, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:14:23,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=312904.1666666667, ans=0.125 2024-06-21 03:14:36,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=312904.1666666667, ans=0.125 2024-06-21 03:15:05,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=312959.1666666667, ans=0.0 2024-06-21 03:15:08,444 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.49 vs. limit=10.0 2024-06-21 03:15:20,752 INFO [train.py:1028] (1/2) Epoch 17, batch 8850, loss[loss=0.2458, simple_loss=0.2973, pruned_loss=0.09712, over 12426.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.2865, pruned_loss=0.08847, over 2563296.64 frames. ], batch size: 202, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:15:27,979 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.85 vs. limit=22.5 2024-06-21 03:15:33,314 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.228e+02 2.381e+02 2.691e+02 3.706e+02, threshold=4.762e+02, percent-clipped=0.0 2024-06-21 03:15:46,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=313050.8333333333, ans=0.125 2024-06-21 03:15:58,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313069.1666666667, ans=0.1 2024-06-21 03:15:59,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=313069.1666666667, ans=0.2 2024-06-21 03:16:03,987 INFO [train.py:1028] (1/2) Epoch 17, batch 8900, loss[loss=0.212, simple_loss=0.2702, pruned_loss=0.07694, over 12980.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.2868, pruned_loss=0.08883, over 2561304.39 frames. ], batch size: 33, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:16:21,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=313105.8333333333, ans=0.0 2024-06-21 03:16:23,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=313124.1666666667, ans=0.125 2024-06-21 03:16:54,686 INFO [train.py:1028] (1/2) Epoch 17, batch 8950, loss[loss=0.2366, simple_loss=0.296, pruned_loss=0.0886, over 12622.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.2869, pruned_loss=0.08863, over 2561076.53 frames. ], batch size: 202, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:17:01,549 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.89 vs. limit=22.5 2024-06-21 03:17:08,185 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.099e+02 2.235e+02 2.416e+02 3.104e+02, threshold=4.471e+02, percent-clipped=0.0 2024-06-21 03:17:09,807 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.14 vs. limit=15.0 2024-06-21 03:17:25,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=313215.8333333333, ans=0.2 2024-06-21 03:17:36,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=313215.8333333333, ans=0.125 2024-06-21 03:17:38,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=313215.8333333333, ans=0.0 2024-06-21 03:17:46,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313234.1666666667, ans=0.1 2024-06-21 03:17:49,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=313234.1666666667, ans=0.0 2024-06-21 03:17:50,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=313252.5, ans=0.0 2024-06-21 03:18:01,792 INFO [train.py:1028] (1/2) Epoch 17, batch 9000, loss[loss=0.2189, simple_loss=0.2772, pruned_loss=0.08026, over 13293.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.2874, pruned_loss=0.08857, over 2567557.22 frames. ], batch size: 46, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:18:01,793 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 03:18:14,037 INFO [train.py:1060] (1/2) Epoch 17, validation: loss=0.1873, simple_loss=0.2522, pruned_loss=0.06125, over 351949.00 frames. 2024-06-21 03:18:14,039 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 03:18:18,874 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.41 vs. limit=15.0 2024-06-21 03:18:20,240 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.516e-03 2024-06-21 03:18:27,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=313289.1666666667, ans=0.125 2024-06-21 03:18:30,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=313289.1666666667, ans=0.125 2024-06-21 03:18:31,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=313289.1666666667, ans=0.125 2024-06-21 03:18:37,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=313307.5, ans=0.025 2024-06-21 03:18:37,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=313307.5, ans=0.0 2024-06-21 03:18:38,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=313307.5, ans=0.125 2024-06-21 03:18:43,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=313307.5, ans=0.0 2024-06-21 03:19:07,580 INFO [train.py:1028] (1/2) Epoch 17, batch 9050, loss[loss=0.1942, simple_loss=0.2485, pruned_loss=0.06994, over 11894.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2872, pruned_loss=0.08842, over 2566866.02 frames. ], batch size: 17, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:19:20,375 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.075e+02 2.198e+02 2.447e+02 3.562e+02, threshold=4.396e+02, percent-clipped=0.0 2024-06-21 03:19:41,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=313417.5, ans=10.0 2024-06-21 03:19:54,568 INFO [train.py:1028] (1/2) Epoch 17, batch 9100, loss[loss=0.2123, simple_loss=0.2731, pruned_loss=0.07579, over 13277.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.2868, pruned_loss=0.08815, over 2567121.42 frames. ], batch size: 72, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:19:57,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=313454.1666666667, ans=0.125 2024-06-21 03:20:09,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=313472.5, ans=0.125 2024-06-21 03:20:14,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=313490.8333333333, ans=0.0 2024-06-21 03:20:28,913 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.66 vs. limit=22.5 2024-06-21 03:20:29,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=313509.1666666667, ans=0.125 2024-06-21 03:20:36,612 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.39 vs. limit=10.0 2024-06-21 03:20:45,268 INFO [train.py:1028] (1/2) Epoch 17, batch 9150, loss[loss=0.2101, simple_loss=0.27, pruned_loss=0.07508, over 13171.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.2872, pruned_loss=0.08855, over 2568396.10 frames. ], batch size: 77, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:20:58,434 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 2.106e+02 2.233e+02 2.420e+02 3.001e+02, threshold=4.466e+02, percent-clipped=0.0 2024-06-21 03:20:58,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=313564.1666666667, ans=0.0 2024-06-21 03:21:02,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=313564.1666666667, ans=0.0 2024-06-21 03:21:08,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=313582.5, ans=0.125 2024-06-21 03:21:13,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=313600.8333333333, ans=0.0 2024-06-21 03:21:16,076 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.15 vs. limit=10.0 2024-06-21 03:21:17,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313600.8333333333, ans=0.1 2024-06-21 03:21:18,365 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2024-06-21 03:21:22,302 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.16 vs. limit=15.0 2024-06-21 03:21:31,247 INFO [train.py:1028] (1/2) Epoch 17, batch 9200, loss[loss=0.2041, simple_loss=0.2622, pruned_loss=0.07296, over 12994.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2871, pruned_loss=0.08788, over 2571325.58 frames. ], batch size: 36, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:21:31,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=313637.5, ans=0.125 2024-06-21 03:21:45,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=313655.8333333333, ans=0.04949747468305833 2024-06-21 03:21:52,681 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2024-06-21 03:21:54,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=313674.1666666667, ans=0.1 2024-06-21 03:21:54,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=313674.1666666667, ans=0.125 2024-06-21 03:21:57,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=313674.1666666667, ans=0.0 2024-06-21 03:22:04,163 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=12.0 2024-06-21 03:22:06,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=313692.5, ans=0.125 2024-06-21 03:22:19,684 INFO [train.py:1028] (1/2) Epoch 17, batch 9250, loss[loss=0.2343, simple_loss=0.2983, pruned_loss=0.08516, over 13220.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.2868, pruned_loss=0.08775, over 2573047.26 frames. ], batch size: 67, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:22:32,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=313747.5, ans=0.2 2024-06-21 03:22:32,887 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.071e+02 2.199e+02 2.336e+02 3.323e+02, threshold=4.399e+02, percent-clipped=0.0 2024-06-21 03:22:47,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=313765.8333333333, ans=0.025 2024-06-21 03:23:15,680 INFO [train.py:1028] (1/2) Epoch 17, batch 9300, loss[loss=0.209, simple_loss=0.2642, pruned_loss=0.07691, over 12954.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2867, pruned_loss=0.08767, over 2569371.29 frames. ], batch size: 39, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:23:17,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=313820.8333333333, ans=0.125 2024-06-21 03:23:18,253 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.04 vs. limit=22.5 2024-06-21 03:23:52,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=313894.1666666667, ans=0.2 2024-06-21 03:23:53,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=313894.1666666667, ans=0.0 2024-06-21 03:23:57,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=313894.1666666667, ans=0.0 2024-06-21 03:24:00,872 INFO [train.py:1028] (1/2) Epoch 17, batch 9350, loss[loss=0.2276, simple_loss=0.2852, pruned_loss=0.08504, over 12659.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.287, pruned_loss=0.08818, over 2566627.62 frames. ], batch size: 22, lr: 3.38e-03, grad_scale: 32.0 2024-06-21 03:24:14,170 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.155e+02 2.381e+02 2.659e+02 4.074e+02, threshold=4.762e+02, percent-clipped=0.0 2024-06-21 03:24:19,393 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=13.84 vs. limit=15.0 2024-06-21 03:24:24,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=313949.1666666667, ans=0.1 2024-06-21 03:24:38,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=313967.5, ans=0.0 2024-06-21 03:24:40,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=313985.8333333333, ans=0.125 2024-06-21 03:24:41,569 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:24:49,176 INFO [train.py:1028] (1/2) Epoch 17, batch 9400, loss[loss=0.2324, simple_loss=0.2948, pruned_loss=0.08499, over 13290.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.2872, pruned_loss=0.08807, over 2567214.40 frames. ], batch size: 52, lr: 3.38e-03, grad_scale: 32.0 2024-06-21 03:25:09,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2024-06-21 03:25:17,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=314059.1666666667, ans=0.125 2024-06-21 03:25:34,501 INFO [train.py:1028] (1/2) Epoch 17, batch 9450, loss[loss=0.2423, simple_loss=0.3024, pruned_loss=0.09115, over 12479.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.288, pruned_loss=0.08857, over 2567557.73 frames. ], batch size: 22, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:25:34,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=314095.8333333333, ans=0.0 2024-06-21 03:25:37,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=314095.8333333333, ans=0.125 2024-06-21 03:25:42,074 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.240e+01 2024-06-21 03:25:46,512 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.054e+02 2.206e+02 2.385e+02 3.032e+02, threshold=4.411e+02, percent-clipped=0.0 2024-06-21 03:25:53,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=314132.5, ans=0.025 2024-06-21 03:25:57,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=314132.5, ans=0.2 2024-06-21 03:26:01,517 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.06 vs. limit=15.0 2024-06-21 03:26:01,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=314150.8333333333, ans=0.125 2024-06-21 03:26:02,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=314150.8333333333, ans=0.125 2024-06-21 03:26:07,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=314150.8333333333, ans=0.04949747468305833 2024-06-21 03:26:08,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=314169.1666666667, ans=0.125 2024-06-21 03:26:09,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=314169.1666666667, ans=0.125 2024-06-21 03:26:17,275 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.76 vs. limit=15.0 2024-06-21 03:26:18,411 INFO [train.py:1028] (1/2) Epoch 17, batch 9500, loss[loss=0.228, simple_loss=0.2884, pruned_loss=0.08379, over 13257.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.2875, pruned_loss=0.08808, over 2575975.37 frames. ], batch size: 43, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:26:18,910 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2024-06-21 03:26:32,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=314205.8333333333, ans=0.125 2024-06-21 03:26:49,351 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=15.0 2024-06-21 03:26:53,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=314260.8333333333, ans=0.125 2024-06-21 03:26:54,090 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.17 vs. limit=15.0 2024-06-21 03:27:01,533 INFO [train.py:1028] (1/2) Epoch 17, batch 9550, loss[loss=0.2151, simple_loss=0.2728, pruned_loss=0.07865, over 12952.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.2871, pruned_loss=0.08772, over 2571092.52 frames. ], batch size: 39, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:27:03,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=314279.1666666667, ans=0.0 2024-06-21 03:27:13,308 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.068e+02 2.209e+02 2.456e+02 2.946e+02, threshold=4.418e+02, percent-clipped=0.0 2024-06-21 03:27:13,883 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.17 vs. limit=12.0 2024-06-21 03:27:51,374 INFO [train.py:1028] (1/2) Epoch 17, batch 9600, loss[loss=0.2244, simple_loss=0.2696, pruned_loss=0.08964, over 10682.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.2872, pruned_loss=0.08798, over 2570511.64 frames. ], batch size: 303, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:28:00,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=314389.1666666667, ans=0.04949747468305833 2024-06-21 03:28:02,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=314389.1666666667, ans=0.0 2024-06-21 03:28:05,629 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.30 vs. limit=10.0 2024-06-21 03:28:12,116 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=12.0 2024-06-21 03:28:19,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=314425.8333333333, ans=0.125 2024-06-21 03:28:38,760 INFO [train.py:1028] (1/2) Epoch 17, batch 9650, loss[loss=0.2523, simple_loss=0.2952, pruned_loss=0.1047, over 13091.00 frames. ], tot_loss[loss=0.233, simple_loss=0.2879, pruned_loss=0.08904, over 2563521.79 frames. ], batch size: 132, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:28:48,323 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 2.119e+02 2.324e+02 2.560e+02 3.634e+02, threshold=4.647e+02, percent-clipped=0.0 2024-06-21 03:28:55,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=314499.1666666667, ans=0.0 2024-06-21 03:28:55,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=314499.1666666667, ans=0.125 2024-06-21 03:29:05,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=314517.5, ans=0.0 2024-06-21 03:29:16,036 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.38 vs. limit=10.0 2024-06-21 03:29:23,261 INFO [train.py:1028] (1/2) Epoch 17, batch 9700, loss[loss=0.2337, simple_loss=0.286, pruned_loss=0.09068, over 13037.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.2871, pruned_loss=0.08875, over 2559090.93 frames. ], batch size: 144, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:29:31,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=314572.5, ans=0.0 2024-06-21 03:29:37,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=314572.5, ans=0.125 2024-06-21 03:29:55,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=314609.1666666667, ans=0.0 2024-06-21 03:30:09,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=314627.5, ans=0.125 2024-06-21 03:30:12,795 INFO [train.py:1028] (1/2) Epoch 17, batch 9750, loss[loss=0.2429, simple_loss=0.2912, pruned_loss=0.09726, over 13093.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2859, pruned_loss=0.08804, over 2555123.75 frames. ], batch size: 132, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:30:15,335 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.09 vs. limit=10.0 2024-06-21 03:30:17,759 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.19 vs. limit=22.5 2024-06-21 03:30:24,129 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.873e+02 2.066e+02 2.274e+02 2.497e+02 3.042e+02, threshold=4.547e+02, percent-clipped=0.0 2024-06-21 03:30:25,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=314664.1666666667, ans=0.0 2024-06-21 03:30:35,893 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=314682.5, ans=0.125 2024-06-21 03:30:36,747 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.60 vs. limit=15.0 2024-06-21 03:30:40,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=314700.8333333333, ans=0.025 2024-06-21 03:30:48,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=314719.1666666667, ans=0.125 2024-06-21 03:30:49,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=314719.1666666667, ans=0.0 2024-06-21 03:30:52,816 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.74 vs. limit=15.0 2024-06-21 03:30:53,037 INFO [train.py:1028] (1/2) Epoch 17, batch 9800, loss[loss=0.22, simple_loss=0.2819, pruned_loss=0.07903, over 12988.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.2854, pruned_loss=0.08773, over 2547496.06 frames. ], batch size: 39, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:30:55,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=314737.5, ans=0.125 2024-06-21 03:31:04,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=314755.8333333333, ans=0.125 2024-06-21 03:31:10,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=314774.1666666667, ans=0.125 2024-06-21 03:31:11,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=314774.1666666667, ans=0.125 2024-06-21 03:31:14,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=314774.1666666667, ans=0.125 2024-06-21 03:31:25,021 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.19 vs. limit=6.0 2024-06-21 03:31:31,864 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:31:38,901 INFO [train.py:1028] (1/2) Epoch 17, batch 9850, loss[loss=0.2092, simple_loss=0.259, pruned_loss=0.07975, over 13053.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2844, pruned_loss=0.08747, over 2539400.84 frames. ], batch size: 102, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:31:42,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=314829.1666666667, ans=15.0 2024-06-21 03:31:50,936 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.098e+02 2.218e+02 2.413e+02 3.321e+02, threshold=4.437e+02, percent-clipped=0.0 2024-06-21 03:31:59,030 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.54 vs. limit=15.0 2024-06-21 03:32:24,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=314902.5, ans=0.0 2024-06-21 03:32:26,099 INFO [train.py:1028] (1/2) Epoch 17, batch 9900, loss[loss=0.2218, simple_loss=0.2806, pruned_loss=0.08148, over 12947.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.283, pruned_loss=0.0869, over 2530895.33 frames. ], batch size: 39, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:32:44,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=314957.5, ans=0.0 2024-06-21 03:32:49,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=314957.5, ans=0.125 2024-06-21 03:33:05,110 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.98 vs. limit=15.0 2024-06-21 03:33:11,963 INFO [train.py:1028] (1/2) Epoch 17, batch 9950, loss[loss=0.2137, simple_loss=0.2807, pruned_loss=0.07333, over 12970.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2819, pruned_loss=0.08679, over 2523904.42 frames. ], batch size: 30, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:33:24,112 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.068e+02 2.291e+02 2.525e+02 3.594e+02, threshold=4.582e+02, percent-clipped=0.0 2024-06-21 03:33:25,501 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.14 vs. limit=15.0 2024-06-21 03:33:37,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=315049.1666666667, ans=0.125 2024-06-21 03:33:42,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=315067.5, ans=0.2 2024-06-21 03:33:48,605 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.25 vs. limit=15.0 2024-06-21 03:33:51,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=315085.8333333333, ans=0.125 2024-06-21 03:33:55,735 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=27.11 vs. limit=22.5 2024-06-21 03:33:58,150 INFO [train.py:1028] (1/2) Epoch 17, batch 10000, loss[loss=0.2313, simple_loss=0.2888, pruned_loss=0.0869, over 12669.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2823, pruned_loss=0.08743, over 2486060.28 frames. ], batch size: 22, lr: 3.38e-03, grad_scale: 32.0 2024-06-21 03:34:22,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=315140.8333333333, ans=0.125 2024-06-21 03:34:43,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=315195.8333333333, ans=0.1 2024-06-21 03:34:44,476 INFO [train.py:1028] (1/2) Epoch 17, batch 10050, loss[loss=0.2351, simple_loss=0.2957, pruned_loss=0.08723, over 12558.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.283, pruned_loss=0.08816, over 2443877.40 frames. ], batch size: 22, lr: 3.38e-03, grad_scale: 32.0 2024-06-21 03:34:49,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=315195.8333333333, ans=0.125 2024-06-21 03:34:55,520 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.178e+02 2.340e+02 2.506e+02 4.052e+02, threshold=4.680e+02, percent-clipped=0.0 2024-06-21 03:35:03,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=315232.5, ans=0.1 2024-06-21 03:35:03,049 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:35:04,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=315232.5, ans=0.025 2024-06-21 03:35:09,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=315250.8333333333, ans=0.1 2024-06-21 03:35:12,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=315250.8333333333, ans=0.125 2024-06-21 03:35:22,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=315269.1666666667, ans=0.0 2024-06-21 03:35:26,259 INFO [train.py:1028] (1/2) Epoch 17, batch 10100, loss[loss=0.2105, simple_loss=0.2738, pruned_loss=0.07365, over 11599.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2824, pruned_loss=0.08765, over 2426771.21 frames. ], batch size: 17, lr: 3.38e-03, grad_scale: 32.0 2024-06-21 03:38:50,504 INFO [train.py:1028] (1/2) Epoch 18, batch 0, loss[loss=0.186, simple_loss=0.2436, pruned_loss=0.06426, over 12893.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2436, pruned_loss=0.06426, over 12893.00 frames. ], batch size: 36, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:38:50,505 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 03:38:58,571 INFO [train.py:1060] (1/2) Epoch 18, validation: loss=0.1887, simple_loss=0.2537, pruned_loss=0.0619, over 351949.00 frames. 2024-06-21 03:38:58,572 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 03:39:03,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=315318.6666666667, ans=0.125 2024-06-21 03:39:25,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=315355.3333333333, ans=0.0 2024-06-21 03:39:31,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=315355.3333333333, ans=0.2 2024-06-21 03:39:42,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=315373.6666666667, ans=0.2 2024-06-21 03:39:49,951 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.54 vs. limit=12.0 2024-06-21 03:39:54,028 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 1.973e+02 2.112e+02 2.294e+02 2.900e+02, threshold=4.224e+02, percent-clipped=0.0 2024-06-21 03:39:57,290 INFO [train.py:1028] (1/2) Epoch 18, batch 50, loss[loss=0.2207, simple_loss=0.2737, pruned_loss=0.08383, over 12594.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.266, pruned_loss=0.08136, over 574344.64 frames. ], batch size: 29, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:39:58,936 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.43 vs. limit=15.0 2024-06-21 03:40:15,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=315428.6666666667, ans=0.125 2024-06-21 03:40:43,453 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.89 vs. limit=15.0 2024-06-21 03:40:43,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=315483.6666666667, ans=0.025 2024-06-21 03:40:44,348 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.34 vs. limit=10.0 2024-06-21 03:40:49,973 INFO [train.py:1028] (1/2) Epoch 18, batch 100, loss[loss=0.1933, simple_loss=0.2523, pruned_loss=0.06712, over 13335.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.263, pruned_loss=0.07994, over 1016863.15 frames. ], batch size: 46, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:40:50,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=315502.0, ans=0.1 2024-06-21 03:41:07,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=315538.6666666667, ans=0.125 2024-06-21 03:41:09,405 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.03 vs. limit=15.0 2024-06-21 03:41:36,870 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.69 vs. limit=15.0 2024-06-21 03:41:37,846 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.10 vs. limit=15.0 2024-06-21 03:41:40,140 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 2.014e+02 2.173e+02 2.471e+02 3.470e+02, threshold=4.347e+02, percent-clipped=0.0 2024-06-21 03:41:40,715 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.09 vs. limit=10.0 2024-06-21 03:41:41,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=315575.3333333333, ans=0.2 2024-06-21 03:41:43,421 INFO [train.py:1028] (1/2) Epoch 18, batch 150, loss[loss=0.2155, simple_loss=0.2774, pruned_loss=0.07678, over 12711.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2633, pruned_loss=0.07912, over 1364632.65 frames. ], batch size: 29, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:41:51,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=315612.0, ans=0.0 2024-06-21 03:41:54,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=315612.0, ans=0.0 2024-06-21 03:41:55,735 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.43 vs. limit=15.0 2024-06-21 03:42:05,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=315630.3333333333, ans=0.0 2024-06-21 03:42:08,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=315630.3333333333, ans=0.125 2024-06-21 03:42:15,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=315648.6666666667, ans=0.1 2024-06-21 03:42:34,085 INFO [train.py:1028] (1/2) Epoch 18, batch 200, loss[loss=0.2445, simple_loss=0.2883, pruned_loss=0.1003, over 12507.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.264, pruned_loss=0.0797, over 1633733.07 frames. ], batch size: 202, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:42:43,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=315703.6666666667, ans=0.125 2024-06-21 03:42:55,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=315722.0, ans=0.05 2024-06-21 03:43:02,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=315740.3333333333, ans=0.1 2024-06-21 03:43:07,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=315740.3333333333, ans=0.125 2024-06-21 03:43:08,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=315740.3333333333, ans=0.125 2024-06-21 03:43:18,739 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 1.985e+02 2.121e+02 2.296e+02 2.873e+02, threshold=4.243e+02, percent-clipped=0.0 2024-06-21 03:43:21,981 INFO [train.py:1028] (1/2) Epoch 18, batch 250, loss[loss=0.2026, simple_loss=0.2489, pruned_loss=0.07815, over 12969.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2633, pruned_loss=0.07943, over 1845400.25 frames. ], batch size: 144, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:43:31,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=315777.0, ans=0.2 2024-06-21 03:43:47,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=315795.3333333333, ans=0.0 2024-06-21 03:44:21,762 INFO [train.py:1028] (1/2) Epoch 18, batch 300, loss[loss=0.2291, simple_loss=0.2736, pruned_loss=0.09228, over 13164.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2636, pruned_loss=0.07973, over 2008230.35 frames. ], batch size: 112, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:45:11,333 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 2.010e+02 2.150e+02 2.373e+02 3.036e+02, threshold=4.299e+02, percent-clipped=0.0 2024-06-21 03:45:11,884 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.80 vs. limit=15.0 2024-06-21 03:45:14,539 INFO [train.py:1028] (1/2) Epoch 18, batch 350, loss[loss=0.2113, simple_loss=0.2728, pruned_loss=0.07492, over 12982.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2631, pruned_loss=0.07926, over 2138718.47 frames. ], batch size: 33, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:45:48,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=316015.3333333333, ans=0.2 2024-06-21 03:45:56,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=316033.6666666667, ans=15.0 2024-06-21 03:46:04,956 INFO [train.py:1028] (1/2) Epoch 18, batch 400, loss[loss=0.196, simple_loss=0.2548, pruned_loss=0.06859, over 13312.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2629, pruned_loss=0.0789, over 2240259.63 frames. ], batch size: 63, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:46:46,850 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.978e+02 2.083e+02 2.198e+02 3.635e+02, threshold=4.166e+02, percent-clipped=0.0 2024-06-21 03:46:48,204 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:46:58,386 INFO [train.py:1028] (1/2) Epoch 18, batch 450, loss[loss=0.197, simple_loss=0.2515, pruned_loss=0.07126, over 13183.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2629, pruned_loss=0.07876, over 2314176.18 frames. ], batch size: 67, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:46:59,913 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.36 vs. limit=15.0 2024-06-21 03:47:40,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=316198.6666666667, ans=0.2 2024-06-21 03:47:54,683 INFO [train.py:1028] (1/2) Epoch 18, batch 500, loss[loss=0.2125, simple_loss=0.2557, pruned_loss=0.08465, over 13138.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2631, pruned_loss=0.07864, over 2376133.39 frames. ], batch size: 121, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:48:01,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=316235.3333333333, ans=0.125 2024-06-21 03:48:03,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=316253.6666666667, ans=0.125 2024-06-21 03:48:13,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=316253.6666666667, ans=0.05 2024-06-21 03:48:13,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=316253.6666666667, ans=0.1 2024-06-21 03:48:36,310 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=316308.6666666667, ans=0.0 2024-06-21 03:48:38,213 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 1.997e+02 2.144e+02 2.377e+02 3.208e+02, threshold=4.288e+02, percent-clipped=0.0 2024-06-21 03:48:38,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=316308.6666666667, ans=0.0 2024-06-21 03:48:41,201 INFO [train.py:1028] (1/2) Epoch 18, batch 550, loss[loss=0.1944, simple_loss=0.243, pruned_loss=0.07295, over 12950.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2624, pruned_loss=0.07821, over 2420320.96 frames. ], batch size: 158, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:48:59,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=316345.3333333333, ans=0.125 2024-06-21 03:49:04,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=316363.6666666667, ans=0.05 2024-06-21 03:49:05,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=316363.6666666667, ans=0.125 2024-06-21 03:49:08,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=316363.6666666667, ans=0.125 2024-06-21 03:49:11,075 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:49:23,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=316400.3333333333, ans=0.025 2024-06-21 03:49:30,135 INFO [train.py:1028] (1/2) Epoch 18, batch 600, loss[loss=0.2117, simple_loss=0.2584, pruned_loss=0.08255, over 12991.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2625, pruned_loss=0.07783, over 2458290.24 frames. ], batch size: 144, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:49:32,083 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.29 vs. limit=6.0 2024-06-21 03:49:33,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=316418.6666666667, ans=0.125 2024-06-21 03:49:44,951 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2024-06-21 03:49:55,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=316455.3333333333, ans=0.0 2024-06-21 03:50:07,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=316473.6666666667, ans=0.125 2024-06-21 03:50:19,528 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 1.935e+02 2.075e+02 2.295e+02 3.429e+02, threshold=4.150e+02, percent-clipped=0.0 2024-06-21 03:50:22,022 INFO [train.py:1028] (1/2) Epoch 18, batch 650, loss[loss=0.2133, simple_loss=0.2694, pruned_loss=0.07858, over 13209.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2629, pruned_loss=0.07753, over 2490518.67 frames. ], batch size: 59, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:50:29,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=316528.6666666667, ans=0.0 2024-06-21 03:51:03,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=316565.3333333333, ans=0.125 2024-06-21 03:51:15,201 INFO [train.py:1028] (1/2) Epoch 18, batch 700, loss[loss=0.2103, simple_loss=0.2613, pruned_loss=0.07969, over 13254.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2623, pruned_loss=0.07761, over 2512687.08 frames. ], batch size: 46, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:51:42,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=316638.6666666667, ans=0.1 2024-06-21 03:51:47,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=316657.0, ans=0.025 2024-06-21 03:52:00,678 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 1.943e+02 2.100e+02 2.255e+02 2.773e+02, threshold=4.199e+02, percent-clipped=0.0 2024-06-21 03:52:03,309 INFO [train.py:1028] (1/2) Epoch 18, batch 750, loss[loss=0.2092, simple_loss=0.2662, pruned_loss=0.07611, over 13270.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2627, pruned_loss=0.07745, over 2527734.69 frames. ], batch size: 63, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:52:14,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=316712.0, ans=0.1 2024-06-21 03:52:22,937 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.79 vs. limit=15.0 2024-06-21 03:52:23,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=316730.3333333333, ans=0.2 2024-06-21 03:52:24,286 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.41 vs. limit=15.0 2024-06-21 03:52:32,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=316767.0, ans=0.125 2024-06-21 03:52:32,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=316767.0, ans=0.125 2024-06-21 03:52:40,270 INFO [train.py:1028] (1/2) Epoch 18, batch 800, loss[loss=0.2015, simple_loss=0.2618, pruned_loss=0.07062, over 12889.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2624, pruned_loss=0.07759, over 2540228.78 frames. ], batch size: 36, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:52:41,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=316785.3333333333, ans=0.0 2024-06-21 03:53:06,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=316822.0, ans=0.125 2024-06-21 03:53:30,011 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.965e+02 2.084e+02 2.227e+02 2.823e+02, threshold=4.168e+02, percent-clipped=0.0 2024-06-21 03:53:33,393 INFO [train.py:1028] (1/2) Epoch 18, batch 850, loss[loss=0.1935, simple_loss=0.2509, pruned_loss=0.06798, over 13139.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2623, pruned_loss=0.07735, over 2550876.23 frames. ], batch size: 95, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:53:34,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=316877.0, ans=0.125 2024-06-21 03:53:34,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=316877.0, ans=0.0 2024-06-21 03:53:43,767 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=22.5 2024-06-21 03:54:03,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=316913.6666666667, ans=0.09899494936611666 2024-06-21 03:54:11,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=316932.0, ans=0.125 2024-06-21 03:54:17,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=316950.3333333333, ans=0.0 2024-06-21 03:54:25,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=316950.3333333333, ans=0.1 2024-06-21 03:54:26,109 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.27 vs. limit=15.0 2024-06-21 03:54:28,694 INFO [train.py:1028] (1/2) Epoch 18, batch 900, loss[loss=0.1939, simple_loss=0.251, pruned_loss=0.06836, over 12936.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2618, pruned_loss=0.07738, over 2556184.69 frames. ], batch size: 36, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:54:28,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=316968.6666666667, ans=0.0 2024-06-21 03:54:36,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=316968.6666666667, ans=0.125 2024-06-21 03:55:06,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=317023.6666666667, ans=0.1 2024-06-21 03:55:15,350 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 2.009e+02 2.079e+02 2.246e+02 2.911e+02, threshold=4.159e+02, percent-clipped=0.0 2024-06-21 03:55:18,314 INFO [train.py:1028] (1/2) Epoch 18, batch 950, loss[loss=0.205, simple_loss=0.267, pruned_loss=0.07144, over 12995.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2623, pruned_loss=0.07782, over 2559496.39 frames. ], batch size: 39, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:55:22,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=317060.3333333333, ans=0.025 2024-06-21 03:55:31,188 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2024-06-21 03:55:49,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=317115.3333333333, ans=0.0 2024-06-21 03:55:50,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=317115.3333333333, ans=0.035 2024-06-21 03:55:51,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=317115.3333333333, ans=0.2 2024-06-21 03:55:56,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=317115.3333333333, ans=0.2 2024-06-21 03:56:13,233 INFO [train.py:1028] (1/2) Epoch 18, batch 1000, loss[loss=0.2295, simple_loss=0.2832, pruned_loss=0.08793, over 13299.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2625, pruned_loss=0.07828, over 2560651.46 frames. ], batch size: 49, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:56:17,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=317152.0, ans=0.125 2024-06-21 03:56:33,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=317170.3333333333, ans=0.0 2024-06-21 03:56:35,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=317188.6666666667, ans=0.125 2024-06-21 03:56:36,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=317188.6666666667, ans=0.0 2024-06-21 03:56:53,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=317207.0, ans=0.125 2024-06-21 03:57:05,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=317225.3333333333, ans=0.1 2024-06-21 03:57:08,965 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 2.027e+02 2.140e+02 2.367e+02 3.053e+02, threshold=4.280e+02, percent-clipped=0.0 2024-06-21 03:57:10,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=317225.3333333333, ans=0.2 2024-06-21 03:57:10,692 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:57:12,719 INFO [train.py:1028] (1/2) Epoch 18, batch 1050, loss[loss=0.2034, simple_loss=0.2566, pruned_loss=0.07508, over 13178.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2632, pruned_loss=0.07846, over 2563081.67 frames. ], batch size: 77, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:57:16,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=317243.6666666667, ans=0.2 2024-06-21 03:57:16,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=317243.6666666667, ans=10.0 2024-06-21 03:57:56,239 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=12.0 2024-06-21 03:57:59,579 INFO [train.py:1028] (1/2) Epoch 18, batch 1100, loss[loss=0.2269, simple_loss=0.2825, pruned_loss=0.08565, over 13252.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2638, pruned_loss=0.07851, over 2568489.72 frames. ], batch size: 52, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:58:00,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=317335.3333333333, ans=0.2 2024-06-21 03:58:06,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=317335.3333333333, ans=0.0 2024-06-21 03:58:14,885 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.25 vs. limit=15.0 2024-06-21 03:58:15,632 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=22.5 2024-06-21 03:58:33,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=317390.3333333333, ans=0.125 2024-06-21 03:58:44,199 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 1.958e+02 2.069e+02 2.214e+02 2.943e+02, threshold=4.138e+02, percent-clipped=0.0 2024-06-21 03:58:47,398 INFO [train.py:1028] (1/2) Epoch 18, batch 1150, loss[loss=0.1973, simple_loss=0.2527, pruned_loss=0.07101, over 13304.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2636, pruned_loss=0.07858, over 2570204.22 frames. ], batch size: 52, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:58:49,479 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.48 vs. limit=15.0 2024-06-21 03:59:08,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=317445.3333333333, ans=0.2 2024-06-21 03:59:11,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=317445.3333333333, ans=0.125 2024-06-21 03:59:30,028 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2024-06-21 03:59:39,325 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.28 vs. limit=15.0 2024-06-21 03:59:55,681 INFO [train.py:1028] (1/2) Epoch 18, batch 1200, loss[loss=0.2208, simple_loss=0.2759, pruned_loss=0.08291, over 13117.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2639, pruned_loss=0.07919, over 2572368.28 frames. ], batch size: 77, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:59:58,645 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.06 vs. limit=15.0 2024-06-21 04:00:06,882 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.04 vs. limit=15.0 2024-06-21 04:00:12,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=317537.0, ans=0.5 2024-06-21 04:00:15,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=317555.3333333333, ans=0.0 2024-06-21 04:00:23,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=317555.3333333333, ans=0.025 2024-06-21 04:00:25,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=317573.6666666667, ans=0.0 2024-06-21 04:00:25,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.83 vs. limit=22.5 2024-06-21 04:00:32,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=317573.6666666667, ans=0.125 2024-06-21 04:00:35,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=317592.0, ans=0.125 2024-06-21 04:00:35,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=317592.0, ans=0.125 2024-06-21 04:00:41,799 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 1.994e+02 2.072e+02 2.262e+02 2.890e+02, threshold=4.144e+02, percent-clipped=0.0 2024-06-21 04:00:44,566 INFO [train.py:1028] (1/2) Epoch 18, batch 1250, loss[loss=0.21, simple_loss=0.2676, pruned_loss=0.07622, over 13190.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.263, pruned_loss=0.07856, over 2582112.70 frames. ], batch size: 112, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:00:59,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=317628.6666666667, ans=0.025 2024-06-21 04:01:33,982 INFO [train.py:1028] (1/2) Epoch 18, batch 1300, loss[loss=0.2298, simple_loss=0.2788, pruned_loss=0.09036, over 12813.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2636, pruned_loss=0.07884, over 2582562.76 frames. ], batch size: 176, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:01:53,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=317738.6666666667, ans=0.125 2024-06-21 04:01:59,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=317757.0, ans=0.125 2024-06-21 04:02:17,541 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 1.974e+02 2.072e+02 2.201e+02 3.078e+02, threshold=4.143e+02, percent-clipped=0.0 2024-06-21 04:02:20,476 INFO [train.py:1028] (1/2) Epoch 18, batch 1350, loss[loss=0.2067, simple_loss=0.2656, pruned_loss=0.07392, over 13230.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2636, pruned_loss=0.07865, over 2584853.69 frames. ], batch size: 59, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:02:32,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=317793.6666666667, ans=0.125 2024-06-21 04:02:51,709 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.74 vs. limit=15.0 2024-06-21 04:03:27,227 INFO [train.py:1028] (1/2) Epoch 18, batch 1400, loss[loss=0.216, simple_loss=0.2723, pruned_loss=0.07982, over 12549.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2639, pruned_loss=0.07881, over 2586256.03 frames. ], batch size: 25, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:03:32,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=317885.3333333333, ans=0.0 2024-06-21 04:03:34,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=317885.3333333333, ans=0.0 2024-06-21 04:03:46,425 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.84 vs. limit=15.0 2024-06-21 04:04:03,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=317958.6666666667, ans=0.2 2024-06-21 04:04:07,886 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 1.979e+02 2.093e+02 2.234e+02 2.831e+02, threshold=4.186e+02, percent-clipped=0.0 2024-06-21 04:04:11,002 INFO [train.py:1028] (1/2) Epoch 18, batch 1450, loss[loss=0.196, simple_loss=0.2463, pruned_loss=0.07282, over 13091.00 frames. ], tot_loss[loss=0.211, simple_loss=0.264, pruned_loss=0.079, over 2586186.10 frames. ], batch size: 121, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:04:12,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=317977.0, ans=0.125 2024-06-21 04:04:29,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=317995.3333333333, ans=0.0 2024-06-21 04:04:29,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=317995.3333333333, ans=0.125 2024-06-21 04:04:57,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318050.3333333333, ans=0.1 2024-06-21 04:05:00,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=318050.3333333333, ans=0.125 2024-06-21 04:05:01,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=318068.6666666667, ans=0.125 2024-06-21 04:05:02,491 INFO [train.py:1028] (1/2) Epoch 18, batch 1500, loss[loss=0.204, simple_loss=0.2588, pruned_loss=0.0746, over 13270.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2645, pruned_loss=0.07921, over 2589173.01 frames. ], batch size: 83, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:05:04,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.27 vs. limit=10.0 2024-06-21 04:05:04,702 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.04 vs. limit=15.0 2024-06-21 04:05:06,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=318068.6666666667, ans=0.0 2024-06-21 04:05:10,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=318087.0, ans=0.09899494936611666 2024-06-21 04:05:13,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=318087.0, ans=0.125 2024-06-21 04:05:14,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=318087.0, ans=0.125 2024-06-21 04:05:16,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=318087.0, ans=0.125 2024-06-21 04:05:41,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=318123.6666666667, ans=0.0 2024-06-21 04:05:53,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=318142.0, ans=0.1 2024-06-21 04:05:53,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=318142.0, ans=0.1 2024-06-21 04:05:55,277 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 1.979e+02 2.158e+02 2.473e+02 3.670e+02, threshold=4.315e+02, percent-clipped=0.0 2024-06-21 04:05:57,658 INFO [train.py:1028] (1/2) Epoch 18, batch 1550, loss[loss=0.2171, simple_loss=0.2606, pruned_loss=0.08677, over 13013.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2651, pruned_loss=0.07957, over 2584206.68 frames. ], batch size: 102, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:06:10,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=318160.3333333333, ans=0.125 2024-06-21 04:06:11,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=318160.3333333333, ans=0.2 2024-06-21 04:06:17,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318178.6666666667, ans=0.1 2024-06-21 04:06:21,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=318178.6666666667, ans=0.0 2024-06-21 04:06:25,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=318197.0, ans=0.0 2024-06-21 04:06:26,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=318197.0, ans=0.125 2024-06-21 04:06:34,355 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.79 vs. limit=15.0 2024-06-21 04:06:39,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=318215.3333333333, ans=0.0 2024-06-21 04:06:45,958 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2024-06-21 04:06:55,346 INFO [train.py:1028] (1/2) Epoch 18, batch 1600, loss[loss=0.2155, simple_loss=0.2706, pruned_loss=0.08023, over 13159.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2647, pruned_loss=0.07943, over 2579782.43 frames. ], batch size: 77, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:06:59,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=318252.0, ans=0.1 2024-06-21 04:07:09,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=318270.3333333333, ans=0.0 2024-06-21 04:07:23,088 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:07:39,778 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=318325.3333333333, ans=0.0 2024-06-21 04:07:41,433 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 1.961e+02 2.083e+02 2.311e+02 3.038e+02, threshold=4.165e+02, percent-clipped=0.0 2024-06-21 04:07:44,039 INFO [train.py:1028] (1/2) Epoch 18, batch 1650, loss[loss=0.208, simple_loss=0.2617, pruned_loss=0.07722, over 13114.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2651, pruned_loss=0.07981, over 2575350.09 frames. ], batch size: 95, lr: 3.26e-03, grad_scale: 32.0 2024-06-21 04:07:45,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=318343.6666666667, ans=0.125 2024-06-21 04:07:48,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=318343.6666666667, ans=0.1 2024-06-21 04:07:56,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=318362.0, ans=0.0 2024-06-21 04:08:15,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=318398.6666666667, ans=0.125 2024-06-21 04:08:15,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=318398.6666666667, ans=0.125 2024-06-21 04:08:42,623 INFO [train.py:1028] (1/2) Epoch 18, batch 1700, loss[loss=0.2192, simple_loss=0.2814, pruned_loss=0.07847, over 12841.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2643, pruned_loss=0.07912, over 2581000.52 frames. ], batch size: 26, lr: 3.26e-03, grad_scale: 32.0 2024-06-21 04:09:04,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=318472.0, ans=0.07 2024-06-21 04:09:16,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.03 vs. limit=15.0 2024-06-21 04:09:31,512 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 1.963e+02 2.125e+02 2.422e+02 3.069e+02, threshold=4.250e+02, percent-clipped=0.0 2024-06-21 04:09:35,004 INFO [train.py:1028] (1/2) Epoch 18, batch 1750, loss[loss=0.2257, simple_loss=0.285, pruned_loss=0.08325, over 12503.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2649, pruned_loss=0.07928, over 2581918.69 frames. ], batch size: 22, lr: 3.26e-03, grad_scale: 32.0 2024-06-21 04:09:37,281 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:09:47,443 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.93 vs. limit=22.5 2024-06-21 04:09:55,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=318563.6666666667, ans=0.125 2024-06-21 04:09:58,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=318563.6666666667, ans=0.0 2024-06-21 04:10:00,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=318563.6666666667, ans=0.0 2024-06-21 04:10:12,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=318582.0, ans=0.2 2024-06-21 04:10:17,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=318600.3333333333, ans=0.125 2024-06-21 04:10:24,417 INFO [train.py:1028] (1/2) Epoch 18, batch 1800, loss[loss=0.2134, simple_loss=0.2658, pruned_loss=0.08048, over 13167.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2646, pruned_loss=0.07925, over 2582077.72 frames. ], batch size: 67, lr: 3.26e-03, grad_scale: 32.0 2024-06-21 04:10:27,327 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.36 vs. limit=15.0 2024-06-21 04:10:31,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=318618.6666666667, ans=0.125 2024-06-21 04:10:40,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=318637.0, ans=0.125 2024-06-21 04:10:44,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=318655.3333333333, ans=0.125 2024-06-21 04:10:52,869 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.33 vs. limit=22.5 2024-06-21 04:10:59,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=318673.6666666667, ans=0.0 2024-06-21 04:11:04,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=318673.6666666667, ans=0.0 2024-06-21 04:11:04,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=318673.6666666667, ans=0.125 2024-06-21 04:11:12,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=318692.0, ans=0.0 2024-06-21 04:11:13,269 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 1.986e+02 2.122e+02 2.342e+02 2.807e+02, threshold=4.244e+02, percent-clipped=0.0 2024-06-21 04:11:16,300 INFO [train.py:1028] (1/2) Epoch 18, batch 1850, loss[loss=0.1895, simple_loss=0.2482, pruned_loss=0.06539, over 13259.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2647, pruned_loss=0.07933, over 2583589.08 frames. ], batch size: 83, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:11:42,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318747.0, ans=0.1 2024-06-21 04:11:56,872 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.64 vs. limit=22.5 2024-06-21 04:11:58,069 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.59 vs. limit=15.0 2024-06-21 04:11:59,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=318783.6666666667, ans=0.0 2024-06-21 04:12:03,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=318783.6666666667, ans=0.125 2024-06-21 04:12:08,104 INFO [train.py:1028] (1/2) Epoch 18, batch 1900, loss[loss=0.2128, simple_loss=0.2614, pruned_loss=0.08214, over 13119.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2642, pruned_loss=0.07929, over 2585859.97 frames. ], batch size: 95, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:12:43,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318838.6666666667, ans=0.1 2024-06-21 04:12:47,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=318857.0, ans=0.0 2024-06-21 04:13:02,779 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.945e+02 2.054e+02 2.248e+02 2.808e+02, threshold=4.108e+02, percent-clipped=0.0 2024-06-21 04:13:04,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=318875.3333333333, ans=0.02 2024-06-21 04:13:06,537 INFO [train.py:1028] (1/2) Epoch 18, batch 1950, loss[loss=0.2024, simple_loss=0.2631, pruned_loss=0.07087, over 13244.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2636, pruned_loss=0.07926, over 2591410.64 frames. ], batch size: 52, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:13:10,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=318893.6666666667, ans=0.125 2024-06-21 04:13:28,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=318930.3333333333, ans=0.125 2024-06-21 04:13:31,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=318930.3333333333, ans=0.125 2024-06-21 04:13:32,044 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.37 vs. limit=15.0 2024-06-21 04:13:35,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=318948.6666666667, ans=15.0 2024-06-21 04:13:43,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=318967.0, ans=0.125 2024-06-21 04:13:46,135 INFO [train.py:1028] (1/2) Epoch 18, batch 2000, loss[loss=0.2409, simple_loss=0.2985, pruned_loss=0.09163, over 12692.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2632, pruned_loss=0.079, over 2587745.01 frames. ], batch size: 22, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:14:05,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=319022.0, ans=0.125 2024-06-21 04:14:12,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=319040.3333333333, ans=10.0 2024-06-21 04:14:23,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=319040.3333333333, ans=0.1 2024-06-21 04:14:25,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=319058.6666666667, ans=0.05 2024-06-21 04:14:32,451 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.083e+02 2.254e+02 2.464e+02 3.221e+02, threshold=4.509e+02, percent-clipped=0.0 2024-06-21 04:14:35,540 INFO [train.py:1028] (1/2) Epoch 18, batch 2050, loss[loss=0.2147, simple_loss=0.2688, pruned_loss=0.08029, over 12619.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2637, pruned_loss=0.07938, over 2583782.09 frames. ], batch size: 29, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:14:59,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=319113.6666666667, ans=0.2 2024-06-21 04:15:02,237 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2024-06-21 04:15:31,705 INFO [train.py:1028] (1/2) Epoch 18, batch 2100, loss[loss=0.2187, simple_loss=0.2795, pruned_loss=0.07894, over 13166.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2642, pruned_loss=0.07942, over 2586226.62 frames. ], batch size: 59, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:15:35,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=319168.6666666667, ans=0.2 2024-06-21 04:15:35,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=319168.6666666667, ans=0.025 2024-06-21 04:15:40,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=319168.6666666667, ans=0.025 2024-06-21 04:15:42,080 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.02 vs. limit=22.5 2024-06-21 04:15:50,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=319187.0, ans=0.025 2024-06-21 04:15:52,868 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.44 vs. limit=22.5 2024-06-21 04:15:56,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=319205.3333333333, ans=0.125 2024-06-21 04:16:17,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=319242.0, ans=0.125 2024-06-21 04:16:20,942 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.007e+02 2.126e+02 2.379e+02 3.400e+02, threshold=4.252e+02, percent-clipped=0.0 2024-06-21 04:16:22,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=319242.0, ans=0.125 2024-06-21 04:16:24,136 INFO [train.py:1028] (1/2) Epoch 18, batch 2150, loss[loss=0.2265, simple_loss=0.2839, pruned_loss=0.08452, over 13278.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2641, pruned_loss=0.07948, over 2589510.73 frames. ], batch size: 52, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:16:27,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=319260.3333333333, ans=0.0 2024-06-21 04:16:37,094 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.28 vs. limit=15.0 2024-06-21 04:16:38,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=319278.6666666667, ans=0.2 2024-06-21 04:16:40,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=319278.6666666667, ans=0.1 2024-06-21 04:16:43,376 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.09 vs. limit=15.0 2024-06-21 04:16:47,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=319297.0, ans=0.125 2024-06-21 04:16:50,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=319297.0, ans=0.0 2024-06-21 04:16:50,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=319297.0, ans=0.1 2024-06-21 04:16:57,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=319315.3333333333, ans=0.2 2024-06-21 04:17:01,029 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.54 vs. limit=10.0 2024-06-21 04:17:14,688 INFO [train.py:1028] (1/2) Epoch 18, batch 2200, loss[loss=0.2024, simple_loss=0.2512, pruned_loss=0.07676, over 13257.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2644, pruned_loss=0.07965, over 2589615.39 frames. ], batch size: 83, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:17:18,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=319352.0, ans=0.035 2024-06-21 04:17:56,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=319425.3333333333, ans=0.125 2024-06-21 04:18:04,361 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 1.992e+02 2.153e+02 2.453e+02 3.553e+02, threshold=4.305e+02, percent-clipped=0.0 2024-06-21 04:18:07,243 INFO [train.py:1028] (1/2) Epoch 18, batch 2250, loss[loss=0.1971, simple_loss=0.2536, pruned_loss=0.07031, over 13270.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2641, pruned_loss=0.07918, over 2588207.08 frames. ], batch size: 63, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:18:21,578 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.74 vs. limit=15.0 2024-06-21 04:18:22,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=319462.0, ans=0.2 2024-06-21 04:18:36,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=319480.3333333333, ans=0.125 2024-06-21 04:18:40,596 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.56 vs. limit=15.0 2024-06-21 04:18:50,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=319517.0, ans=0.2 2024-06-21 04:19:00,766 INFO [train.py:1028] (1/2) Epoch 18, batch 2300, loss[loss=0.2011, simple_loss=0.2578, pruned_loss=0.07219, over 12911.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2644, pruned_loss=0.07914, over 2582366.17 frames. ], batch size: 33, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:19:04,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=319535.3333333333, ans=0.025 2024-06-21 04:19:39,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=319608.6666666667, ans=0.0 2024-06-21 04:19:47,602 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 1.976e+02 2.143e+02 2.352e+02 3.273e+02, threshold=4.286e+02, percent-clipped=0.0 2024-06-21 04:19:50,312 INFO [train.py:1028] (1/2) Epoch 18, batch 2350, loss[loss=0.2074, simple_loss=0.2622, pruned_loss=0.07628, over 13203.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2647, pruned_loss=0.07956, over 2584129.98 frames. ], batch size: 67, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:19:52,820 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.13 vs. limit=15.0 2024-06-21 04:20:04,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=319645.3333333333, ans=0.0 2024-06-21 04:20:22,439 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.11 vs. limit=15.0 2024-06-21 04:20:41,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=319700.3333333333, ans=0.025 2024-06-21 04:20:41,561 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.86 vs. limit=10.0 2024-06-21 04:20:49,571 INFO [train.py:1028] (1/2) Epoch 18, batch 2400, loss[loss=0.2178, simple_loss=0.2738, pruned_loss=0.08093, over 13267.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2632, pruned_loss=0.07888, over 2586825.57 frames. ], batch size: 46, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:20:51,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=319718.6666666667, ans=0.125 2024-06-21 04:20:56,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=319718.6666666667, ans=0.95 2024-06-21 04:21:12,542 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.19 vs. limit=22.5 2024-06-21 04:21:39,795 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 1.990e+02 2.134e+02 2.325e+02 3.270e+02, threshold=4.269e+02, percent-clipped=0.0 2024-06-21 04:21:42,745 INFO [train.py:1028] (1/2) Epoch 18, batch 2450, loss[loss=0.206, simple_loss=0.2555, pruned_loss=0.07824, over 13269.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2627, pruned_loss=0.07918, over 2583026.74 frames. ], batch size: 63, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:22:05,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=319847.0, ans=0.2 2024-06-21 04:22:09,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=319847.0, ans=0.125 2024-06-21 04:22:13,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=319865.3333333333, ans=0.05 2024-06-21 04:22:25,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=319883.6666666667, ans=0.0 2024-06-21 04:22:26,795 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.82 vs. limit=15.0 2024-06-21 04:22:34,113 INFO [train.py:1028] (1/2) Epoch 18, batch 2500, loss[loss=0.2014, simple_loss=0.2501, pruned_loss=0.07629, over 13197.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2619, pruned_loss=0.0787, over 2586267.43 frames. ], batch size: 83, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:22:37,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=319902.0, ans=0.0 2024-06-21 04:22:54,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=319938.6666666667, ans=10.0 2024-06-21 04:22:55,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=319938.6666666667, ans=0.125 2024-06-21 04:22:57,072 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=22.5 2024-06-21 04:22:57,109 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.81 vs. limit=15.0 2024-06-21 04:22:58,278 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.13 vs. limit=10.0 2024-06-21 04:23:13,725 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.23 vs. limit=22.5 2024-06-21 04:23:20,980 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 1.935e+02 2.017e+02 2.201e+02 2.844e+02, threshold=4.033e+02, percent-clipped=0.0 2024-06-21 04:23:23,266 INFO [train.py:1028] (1/2) Epoch 18, batch 2550, loss[loss=0.1886, simple_loss=0.243, pruned_loss=0.06709, over 12493.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2606, pruned_loss=0.07838, over 2588318.05 frames. ], batch size: 22, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:23:23,864 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.33 vs. limit=6.0 2024-06-21 04:23:24,636 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=12.0 2024-06-21 04:23:26,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=319993.6666666667, ans=0.125 2024-06-21 04:23:29,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=319993.6666666667, ans=0.125 2024-06-21 04:24:18,603 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.10 vs. limit=15.0 2024-06-21 04:24:18,886 INFO [train.py:1028] (1/2) Epoch 18, batch 2600, loss[loss=0.1851, simple_loss=0.2466, pruned_loss=0.06174, over 13272.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2595, pruned_loss=0.07825, over 2588600.48 frames. ], batch size: 52, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:24:23,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=320085.3333333333, ans=0.125 2024-06-21 04:24:26,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=320085.3333333333, ans=0.125 2024-06-21 04:24:32,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=320103.6666666667, ans=0.125 2024-06-21 04:24:37,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=320122.0, ans=0.0 2024-06-21 04:25:05,971 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 1.956e+02 2.099e+02 2.313e+02 2.691e+02, threshold=4.199e+02, percent-clipped=0.0 2024-06-21 04:25:08,906 INFO [train.py:1028] (1/2) Epoch 18, batch 2650, loss[loss=0.2111, simple_loss=0.254, pruned_loss=0.08408, over 13064.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2583, pruned_loss=0.07791, over 2588192.86 frames. ], batch size: 144, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:25:57,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=320250.3333333333, ans=0.125 2024-06-21 04:26:00,625 INFO [train.py:1028] (1/2) Epoch 18, batch 2700, loss[loss=0.2081, simple_loss=0.2464, pruned_loss=0.08491, over 13253.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.257, pruned_loss=0.07772, over 2585480.75 frames. ], batch size: 89, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:26:34,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=320305.3333333333, ans=0.125 2024-06-21 04:26:40,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=320323.6666666667, ans=0.125 2024-06-21 04:26:53,879 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.30 vs. limit=15.0 2024-06-21 04:26:54,933 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 1.921e+02 2.032e+02 2.140e+02 2.489e+02, threshold=4.065e+02, percent-clipped=0.0 2024-06-21 04:26:57,548 INFO [train.py:1028] (1/2) Epoch 18, batch 2750, loss[loss=0.2201, simple_loss=0.269, pruned_loss=0.08562, over 13256.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2551, pruned_loss=0.07656, over 2582588.54 frames. ], batch size: 43, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:27:14,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=320378.6666666667, ans=0.025 2024-06-21 04:27:21,685 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.25 vs. limit=15.0 2024-06-21 04:27:27,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=320397.0, ans=0.025 2024-06-21 04:27:46,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=320433.6666666667, ans=0.0 2024-06-21 04:27:51,695 INFO [train.py:1028] (1/2) Epoch 18, batch 2800, loss[loss=0.208, simple_loss=0.2482, pruned_loss=0.08389, over 10907.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.255, pruned_loss=0.07681, over 2580418.38 frames. ], batch size: 304, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:28:02,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=320470.3333333333, ans=0.125 2024-06-21 04:28:14,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=320488.6666666667, ans=0.025 2024-06-21 04:28:15,194 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.48 vs. limit=6.0 2024-06-21 04:28:16,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=320488.6666666667, ans=0.125 2024-06-21 04:28:28,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=320507.0, ans=0.125 2024-06-21 04:28:38,113 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.761e+02 1.969e+02 2.138e+02 2.322e+02 3.413e+02, threshold=4.276e+02, percent-clipped=0.0 2024-06-21 04:28:41,294 INFO [train.py:1028] (1/2) Epoch 18, batch 2850, loss[loss=0.1812, simple_loss=0.2422, pruned_loss=0.06011, over 13029.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2543, pruned_loss=0.07655, over 2577361.50 frames. ], batch size: 48, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:28:44,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=320543.6666666667, ans=0.0 2024-06-21 04:29:11,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=320580.3333333333, ans=0.0 2024-06-21 04:29:11,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=320580.3333333333, ans=0.2 2024-06-21 04:29:14,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=320598.6666666667, ans=0.125 2024-06-21 04:29:18,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=320598.6666666667, ans=0.1 2024-06-21 04:29:30,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=320617.0, ans=0.2 2024-06-21 04:29:34,117 INFO [train.py:1028] (1/2) Epoch 18, batch 2900, loss[loss=0.2148, simple_loss=0.2644, pruned_loss=0.08262, over 13156.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2533, pruned_loss=0.0761, over 2584729.22 frames. ], batch size: 55, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:29:51,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=320653.6666666667, ans=0.125 2024-06-21 04:30:03,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=320672.0, ans=0.0 2024-06-21 04:30:03,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=320672.0, ans=0.125 2024-06-21 04:30:04,310 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.08 vs. limit=22.5 2024-06-21 04:30:12,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=320690.3333333333, ans=0.0 2024-06-21 04:30:25,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=320708.6666666667, ans=0.125 2024-06-21 04:30:30,168 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 1.961e+02 2.108e+02 2.347e+02 3.699e+02, threshold=4.217e+02, percent-clipped=0.0 2024-06-21 04:30:33,220 INFO [train.py:1028] (1/2) Epoch 18, batch 2950, loss[loss=0.1966, simple_loss=0.2516, pruned_loss=0.0708, over 13319.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2532, pruned_loss=0.07603, over 2580317.52 frames. ], batch size: 43, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:30:52,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=320745.3333333333, ans=0.0 2024-06-21 04:30:54,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=320763.6666666667, ans=0.125 2024-06-21 04:31:00,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=320763.6666666667, ans=0.125 2024-06-21 04:31:16,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=320800.3333333333, ans=0.125 2024-06-21 04:31:25,159 INFO [train.py:1028] (1/2) Epoch 18, batch 3000, loss[loss=0.2104, simple_loss=0.2661, pruned_loss=0.0773, over 13231.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2525, pruned_loss=0.07567, over 2577991.31 frames. ], batch size: 59, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:31:25,160 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 04:31:36,611 INFO [train.py:1060] (1/2) Epoch 18, validation: loss=0.1863, simple_loss=0.2512, pruned_loss=0.06076, over 351949.00 frames. 2024-06-21 04:31:36,612 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 04:31:50,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=320837.0, ans=0.04949747468305833 2024-06-21 04:31:56,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=320855.3333333333, ans=0.125 2024-06-21 04:32:30,134 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.710e+02 1.946e+02 2.076e+02 2.279e+02 3.062e+02, threshold=4.152e+02, percent-clipped=0.0 2024-06-21 04:32:32,065 INFO [train.py:1028] (1/2) Epoch 18, batch 3050, loss[loss=0.2025, simple_loss=0.2524, pruned_loss=0.07629, over 13248.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2516, pruned_loss=0.07557, over 2576891.35 frames. ], batch size: 46, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:32:53,596 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=15.0 2024-06-21 04:32:56,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=320947.0, ans=0.125 2024-06-21 04:33:12,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=320965.3333333333, ans=0.2 2024-06-21 04:33:27,778 INFO [train.py:1028] (1/2) Epoch 18, batch 3100, loss[loss=0.1936, simple_loss=0.2448, pruned_loss=0.07118, over 13045.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2506, pruned_loss=0.07482, over 2578391.29 frames. ], batch size: 144, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:34:03,292 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.05 vs. limit=22.5 2024-06-21 04:34:04,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=321057.0, ans=0.2 2024-06-21 04:34:16,285 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 1.953e+02 2.151e+02 2.324e+02 3.004e+02, threshold=4.303e+02, percent-clipped=0.0 2024-06-21 04:34:18,871 INFO [train.py:1028] (1/2) Epoch 18, batch 3150, loss[loss=0.1893, simple_loss=0.2358, pruned_loss=0.07136, over 12959.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2498, pruned_loss=0.07467, over 2581428.27 frames. ], batch size: 158, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:35:15,579 INFO [train.py:1028] (1/2) Epoch 18, batch 3200, loss[loss=0.2146, simple_loss=0.266, pruned_loss=0.08157, over 13156.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2499, pruned_loss=0.07448, over 2582241.69 frames. ], batch size: 55, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:35:16,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=321185.3333333333, ans=0.0 2024-06-21 04:35:19,454 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.15 vs. limit=22.5 2024-06-21 04:35:29,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=321203.6666666667, ans=0.125 2024-06-21 04:35:34,496 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.34 vs. limit=15.0 2024-06-21 04:35:46,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=321240.3333333333, ans=0.2 2024-06-21 04:35:46,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=321240.3333333333, ans=0.0 2024-06-21 04:35:54,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=321240.3333333333, ans=0.1 2024-06-21 04:36:00,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=321258.6666666667, ans=0.0 2024-06-21 04:36:04,376 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 1.939e+02 2.093e+02 2.285e+02 2.778e+02, threshold=4.186e+02, percent-clipped=0.0 2024-06-21 04:36:06,878 INFO [train.py:1028] (1/2) Epoch 18, batch 3250, loss[loss=0.1932, simple_loss=0.2479, pruned_loss=0.06927, over 13237.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2491, pruned_loss=0.07445, over 2586898.25 frames. ], batch size: 72, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:36:19,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=321277.0, ans=0.125 2024-06-21 04:36:19,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=321277.0, ans=0.125 2024-06-21 04:36:19,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=321277.0, ans=0.125 2024-06-21 04:36:23,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=321277.0, ans=0.0 2024-06-21 04:36:26,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=321295.3333333333, ans=0.0 2024-06-21 04:36:32,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=321295.3333333333, ans=0.2 2024-06-21 04:36:33,806 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:37:02,962 INFO [train.py:1028] (1/2) Epoch 18, batch 3300, loss[loss=0.1952, simple_loss=0.249, pruned_loss=0.0707, over 12844.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2484, pruned_loss=0.07382, over 2583393.43 frames. ], batch size: 176, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:37:05,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=321368.6666666667, ans=0.0 2024-06-21 04:37:06,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=321368.6666666667, ans=0.025 2024-06-21 04:37:07,500 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=15.0 2024-06-21 04:37:14,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=321387.0, ans=0.025 2024-06-21 04:37:19,224 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.24 vs. limit=22.5 2024-06-21 04:37:29,215 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.55 vs. limit=15.0 2024-06-21 04:37:38,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=321423.6666666667, ans=0.1 2024-06-21 04:37:50,357 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.728e+02 1.927e+02 2.055e+02 2.184e+02 2.821e+02, threshold=4.110e+02, percent-clipped=0.0 2024-06-21 04:37:52,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=321460.3333333333, ans=0.2 2024-06-21 04:37:53,520 INFO [train.py:1028] (1/2) Epoch 18, batch 3350, loss[loss=0.1909, simple_loss=0.2393, pruned_loss=0.07131, over 12930.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2482, pruned_loss=0.07403, over 2577694.19 frames. ], batch size: 158, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:37:56,868 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.62 vs. limit=15.0 2024-06-21 04:38:06,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=321460.3333333333, ans=0.1 2024-06-21 04:38:12,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=321478.6666666667, ans=0.0 2024-06-21 04:38:12,835 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=22.5 2024-06-21 04:38:18,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=321478.6666666667, ans=0.125 2024-06-21 04:38:19,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=321497.0, ans=0.1 2024-06-21 04:38:41,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=321533.6666666667, ans=0.0 2024-06-21 04:38:44,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=321533.6666666667, ans=0.125 2024-06-21 04:38:49,717 INFO [train.py:1028] (1/2) Epoch 18, batch 3400, loss[loss=0.214, simple_loss=0.2665, pruned_loss=0.08081, over 12456.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.248, pruned_loss=0.07425, over 2576157.24 frames. ], batch size: 22, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:39:04,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=321552.0, ans=0.125 2024-06-21 04:39:07,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=321570.3333333333, ans=0.0 2024-06-21 04:39:14,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=321570.3333333333, ans=0.125 2024-06-21 04:39:16,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=321588.6666666667, ans=0.0 2024-06-21 04:39:19,189 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.94 vs. limit=15.0 2024-06-21 04:39:30,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=321607.0, ans=0.125 2024-06-21 04:39:47,681 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 1.929e+02 2.081e+02 2.268e+02 3.200e+02, threshold=4.162e+02, percent-clipped=0.0 2024-06-21 04:39:49,426 INFO [train.py:1028] (1/2) Epoch 18, batch 3450, loss[loss=0.2078, simple_loss=0.2544, pruned_loss=0.08063, over 12786.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2476, pruned_loss=0.07406, over 2577507.74 frames. ], batch size: 176, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:39:51,039 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2024-06-21 04:39:51,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=321643.6666666667, ans=0.125 2024-06-21 04:39:52,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=321643.6666666667, ans=0.2 2024-06-21 04:40:02,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=321662.0, ans=0.125 2024-06-21 04:40:10,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=321680.3333333333, ans=0.5 2024-06-21 04:40:14,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=321680.3333333333, ans=0.025 2024-06-21 04:40:27,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=321717.0, ans=0.05 2024-06-21 04:40:30,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=321717.0, ans=0.0 2024-06-21 04:40:38,867 INFO [train.py:1028] (1/2) Epoch 18, batch 3500, loss[loss=0.2041, simple_loss=0.255, pruned_loss=0.07665, over 12824.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2472, pruned_loss=0.07372, over 2576862.72 frames. ], batch size: 33, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:40:40,722 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.53 vs. limit=15.0 2024-06-21 04:40:46,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=321735.3333333333, ans=0.1 2024-06-21 04:40:48,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=321753.6666666667, ans=0.1 2024-06-21 04:40:55,896 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.82 vs. limit=15.0 2024-06-21 04:41:02,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=321772.0, ans=0.07 2024-06-21 04:41:20,963 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.53 vs. limit=15.0 2024-06-21 04:41:30,655 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:41:35,475 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.876e+02 1.977e+02 2.125e+02 3.648e+02, threshold=3.955e+02, percent-clipped=0.0 2024-06-21 04:41:36,285 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=15.0 2024-06-21 04:41:38,084 INFO [train.py:1028] (1/2) Epoch 18, batch 3550, loss[loss=0.1836, simple_loss=0.2334, pruned_loss=0.06691, over 13170.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2465, pruned_loss=0.0731, over 2577207.46 frames. ], batch size: 95, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:41:39,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=321827.0, ans=0.1 2024-06-21 04:41:48,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=321845.3333333333, ans=0.125 2024-06-21 04:42:06,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=321882.0, ans=0.0 2024-06-21 04:42:08,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=321882.0, ans=0.07 2024-06-21 04:42:13,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=321882.0, ans=0.2 2024-06-21 04:42:34,913 INFO [train.py:1028] (1/2) Epoch 18, batch 3600, loss[loss=0.1856, simple_loss=0.2449, pruned_loss=0.06318, over 13301.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2457, pruned_loss=0.07302, over 2580996.09 frames. ], batch size: 49, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:42:40,814 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.46 vs. limit=22.5 2024-06-21 04:42:41,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=321918.6666666667, ans=0.0 2024-06-21 04:42:47,483 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2024-06-21 04:43:20,132 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.21 vs. limit=22.5 2024-06-21 04:43:23,208 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.914e+02 2.081e+02 2.227e+02 3.119e+02, threshold=4.162e+02, percent-clipped=0.0 2024-06-21 04:43:23,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=321992.0, ans=0.1 2024-06-21 04:43:25,642 INFO [train.py:1028] (1/2) Epoch 18, batch 3650, loss[loss=0.2018, simple_loss=0.2483, pruned_loss=0.07765, over 13004.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2455, pruned_loss=0.07265, over 2578030.46 frames. ], batch size: 102, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:43:34,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=322028.6666666667, ans=0.1 2024-06-21 04:43:45,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=322047.0, ans=0.2 2024-06-21 04:43:51,848 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.99 vs. limit=15.0 2024-06-21 04:43:53,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=322047.0, ans=0.1 2024-06-21 04:43:56,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=322065.3333333333, ans=0.125 2024-06-21 04:44:07,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=322083.6666666667, ans=0.125 2024-06-21 04:44:08,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=322083.6666666667, ans=0.1 2024-06-21 04:44:15,853 INFO [train.py:1028] (1/2) Epoch 18, batch 3700, loss[loss=0.183, simple_loss=0.2372, pruned_loss=0.06441, over 13266.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2448, pruned_loss=0.07249, over 2583801.86 frames. ], batch size: 72, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:44:29,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=322102.0, ans=0.125 2024-06-21 04:44:55,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=322157.0, ans=0.125 2024-06-21 04:45:11,398 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 1.941e+02 2.050e+02 2.205e+02 2.895e+02, threshold=4.099e+02, percent-clipped=0.0 2024-06-21 04:45:13,315 INFO [train.py:1028] (1/2) Epoch 18, batch 3750, loss[loss=0.2147, simple_loss=0.2601, pruned_loss=0.08465, over 12535.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.244, pruned_loss=0.07227, over 2585980.94 frames. ], batch size: 22, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:45:14,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=322193.6666666667, ans=0.0 2024-06-21 04:45:17,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=322193.6666666667, ans=0.2 2024-06-21 04:45:21,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=322212.0, ans=0.125 2024-06-21 04:45:25,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=322212.0, ans=0.0 2024-06-21 04:45:40,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=322230.3333333333, ans=0.125 2024-06-21 04:45:47,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=322248.6666666667, ans=6.0 2024-06-21 04:45:49,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=322248.6666666667, ans=0.0 2024-06-21 04:45:51,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=322248.6666666667, ans=10.0 2024-06-21 04:46:03,962 INFO [train.py:1028] (1/2) Epoch 18, batch 3800, loss[loss=0.182, simple_loss=0.2326, pruned_loss=0.06564, over 13257.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2435, pruned_loss=0.07203, over 2585645.53 frames. ], batch size: 83, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:46:05,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=322285.3333333333, ans=0.125 2024-06-21 04:46:08,146 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.92 vs. limit=22.5 2024-06-21 04:46:10,123 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.93 vs. limit=15.0 2024-06-21 04:46:21,608 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.94 vs. limit=15.0 2024-06-21 04:46:53,829 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 1.875e+02 1.986e+02 2.144e+02 2.561e+02, threshold=3.973e+02, percent-clipped=0.0 2024-06-21 04:46:55,826 INFO [train.py:1028] (1/2) Epoch 18, batch 3850, loss[loss=0.1802, simple_loss=0.229, pruned_loss=0.06576, over 12992.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2429, pruned_loss=0.07159, over 2585494.66 frames. ], batch size: 144, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:47:15,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=322413.6666666667, ans=0.1 2024-06-21 04:47:47,128 INFO [train.py:1028] (1/2) Epoch 18, batch 3900, loss[loss=0.2136, simple_loss=0.2599, pruned_loss=0.08367, over 13212.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.2433, pruned_loss=0.07198, over 2588647.06 frames. ], batch size: 83, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:47:50,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=322468.6666666667, ans=0.1 2024-06-21 04:48:03,443 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.11 vs. limit=8.0 2024-06-21 04:48:08,984 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=12.0 2024-06-21 04:48:12,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=322505.3333333333, ans=0.1 2024-06-21 04:48:16,103 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.93 vs. limit=15.0 2024-06-21 04:48:32,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=322542.0, ans=0.04949747468305833 2024-06-21 04:48:34,015 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 1.935e+02 2.070e+02 2.369e+02 3.349e+02, threshold=4.140e+02, percent-clipped=0.0 2024-06-21 04:48:35,881 INFO [train.py:1028] (1/2) Epoch 18, batch 3950, loss[loss=0.1879, simple_loss=0.2357, pruned_loss=0.06999, over 13092.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2424, pruned_loss=0.07156, over 2591005.88 frames. ], batch size: 132, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:48:44,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=322560.3333333333, ans=0.2 2024-06-21 04:48:54,485 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.25 vs. limit=12.0 2024-06-21 04:48:55,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=322578.6666666667, ans=0.125 2024-06-21 04:49:06,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=322597.0, ans=0.125 2024-06-21 04:49:08,390 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=322597.0, ans=0.125 2024-06-21 04:49:27,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=322633.6666666667, ans=0.125 2024-06-21 04:49:30,383 INFO [train.py:1028] (1/2) Epoch 18, batch 4000, loss[loss=0.2087, simple_loss=0.2559, pruned_loss=0.08077, over 12943.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2424, pruned_loss=0.07193, over 2585859.36 frames. ], batch size: 39, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:50:00,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=322688.6666666667, ans=0.125 2024-06-21 04:50:06,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=322707.0, ans=0.0 2024-06-21 04:50:11,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=322707.0, ans=0.0 2024-06-21 04:50:11,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=322707.0, ans=0.0 2024-06-21 04:50:21,406 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:50:25,829 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 1.910e+02 2.057e+02 2.320e+02 3.640e+02, threshold=4.114e+02, percent-clipped=0.0 2024-06-21 04:50:28,262 INFO [train.py:1028] (1/2) Epoch 18, batch 4050, loss[loss=0.203, simple_loss=0.2412, pruned_loss=0.0824, over 10975.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2422, pruned_loss=0.07186, over 2582892.96 frames. ], batch size: 304, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:50:29,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=322743.6666666667, ans=0.125 2024-06-21 04:50:46,375 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.72 vs. limit=15.0 2024-06-21 04:50:47,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=322762.0, ans=0.025 2024-06-21 04:50:58,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=322780.3333333333, ans=0.125 2024-06-21 04:51:00,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=322780.3333333333, ans=0.125 2024-06-21 04:51:09,765 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.07 vs. limit=15.0 2024-06-21 04:51:10,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=322798.6666666667, ans=0.0 2024-06-21 04:51:19,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=322817.0, ans=0.125 2024-06-21 04:51:26,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=322817.0, ans=0.0 2024-06-21 04:51:27,832 INFO [train.py:1028] (1/2) Epoch 18, batch 4100, loss[loss=0.2075, simple_loss=0.2447, pruned_loss=0.08512, over 13011.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2421, pruned_loss=0.07201, over 2578698.56 frames. ], batch size: 102, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:51:32,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=322835.3333333333, ans=10.0 2024-06-21 04:51:36,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=322835.3333333333, ans=0.125 2024-06-21 04:52:11,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=322890.3333333333, ans=0.125 2024-06-21 04:52:23,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=322908.6666666667, ans=0.125 2024-06-21 04:52:24,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=322908.6666666667, ans=0.025 2024-06-21 04:52:26,976 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 1.888e+02 2.036e+02 2.230e+02 2.989e+02, threshold=4.073e+02, percent-clipped=0.0 2024-06-21 04:52:29,164 INFO [train.py:1028] (1/2) Epoch 18, batch 4150, loss[loss=0.2106, simple_loss=0.2607, pruned_loss=0.08029, over 13203.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2424, pruned_loss=0.07201, over 2578616.57 frames. ], batch size: 55, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:52:58,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=322963.6666666667, ans=0.0 2024-06-21 04:53:02,703 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.03 vs. limit=6.0 2024-06-21 04:53:10,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=323000.3333333333, ans=0.025 2024-06-21 04:53:21,112 INFO [train.py:1028] (1/2) Epoch 18, batch 4200, loss[loss=0.2094, simple_loss=0.2521, pruned_loss=0.08335, over 13100.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2419, pruned_loss=0.07167, over 2580457.14 frames. ], batch size: 103, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:53:28,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=323018.6666666667, ans=0.015 2024-06-21 04:53:34,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=323037.0, ans=0.125 2024-06-21 04:53:45,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=323055.3333333333, ans=0.125 2024-06-21 04:53:46,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=323055.3333333333, ans=0.025 2024-06-21 04:54:07,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=323092.0, ans=0.2 2024-06-21 04:54:13,189 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.660e+02 1.922e+02 2.011e+02 2.222e+02 2.925e+02, threshold=4.023e+02, percent-clipped=0.0 2024-06-21 04:54:14,922 INFO [train.py:1028] (1/2) Epoch 18, batch 4250, loss[loss=0.165, simple_loss=0.223, pruned_loss=0.05356, over 13291.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2418, pruned_loss=0.07157, over 2583377.70 frames. ], batch size: 46, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:54:19,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=323110.3333333333, ans=0.5 2024-06-21 04:54:25,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=323128.6666666667, ans=0.1 2024-06-21 04:54:30,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=323128.6666666667, ans=15.0 2024-06-21 04:54:56,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=323183.6666666667, ans=0.125 2024-06-21 04:54:57,210 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=15.0 2024-06-21 04:55:08,626 INFO [train.py:1028] (1/2) Epoch 18, batch 4300, loss[loss=0.1941, simple_loss=0.2475, pruned_loss=0.07037, over 13215.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2412, pruned_loss=0.07124, over 2583123.14 frames. ], batch size: 59, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:55:22,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=323202.0, ans=0.2 2024-06-21 04:55:43,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=323238.6666666667, ans=0.0 2024-06-21 04:55:58,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=323275.3333333333, ans=0.2 2024-06-21 04:56:00,243 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 1.879e+02 2.012e+02 2.167e+02 2.997e+02, threshold=4.024e+02, percent-clipped=0.0 2024-06-21 04:56:01,918 INFO [train.py:1028] (1/2) Epoch 18, batch 4350, loss[loss=0.1833, simple_loss=0.2336, pruned_loss=0.06651, over 13150.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2409, pruned_loss=0.07124, over 2586972.09 frames. ], batch size: 59, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:56:02,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=323293.6666666667, ans=0.2 2024-06-21 04:56:12,251 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2024-06-21 04:56:20,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=323312.0, ans=0.125 2024-06-21 04:56:28,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=323330.3333333333, ans=0.1 2024-06-21 04:56:31,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=323348.6666666667, ans=0.125 2024-06-21 04:56:43,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=323367.0, ans=0.125 2024-06-21 04:56:53,818 INFO [train.py:1028] (1/2) Epoch 18, batch 4400, loss[loss=0.1885, simple_loss=0.2306, pruned_loss=0.07321, over 13183.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.24, pruned_loss=0.07104, over 2586829.10 frames. ], batch size: 83, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:56:58,978 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.46 vs. limit=6.0 2024-06-21 04:57:36,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=323440.3333333333, ans=0.0 2024-06-21 04:57:42,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=323440.3333333333, ans=0.0 2024-06-21 04:57:51,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=323458.6666666667, ans=0.2 2024-06-21 04:57:53,001 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.920e+02 2.048e+02 2.257e+02 2.924e+02, threshold=4.096e+02, percent-clipped=0.0 2024-06-21 04:57:54,969 INFO [train.py:1028] (1/2) Epoch 18, batch 4450, loss[loss=0.1931, simple_loss=0.2441, pruned_loss=0.07102, over 12914.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2404, pruned_loss=0.07125, over 2582562.68 frames. ], batch size: 33, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:58:10,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=323495.3333333333, ans=12.0 2024-06-21 04:58:26,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=323513.6666666667, ans=0.07 2024-06-21 04:58:31,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=323513.6666666667, ans=0.0 2024-06-21 04:58:40,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=323532.0, ans=0.2 2024-06-21 04:58:54,557 INFO [train.py:1028] (1/2) Epoch 18, batch 4500, loss[loss=0.1766, simple_loss=0.2208, pruned_loss=0.06623, over 13211.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2392, pruned_loss=0.07068, over 2586577.39 frames. ], batch size: 89, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:59:13,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=323587.0, ans=0.0 2024-06-21 04:59:22,377 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.78 vs. limit=6.0 2024-06-21 04:59:26,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=323623.6666666667, ans=0.125 2024-06-21 04:59:37,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=323642.0, ans=0.0 2024-06-21 04:59:42,745 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 1.882e+02 2.014e+02 2.131e+02 2.807e+02, threshold=4.027e+02, percent-clipped=0.0 2024-06-21 04:59:45,207 INFO [train.py:1028] (1/2) Epoch 18, batch 4550, loss[loss=0.2039, simple_loss=0.2492, pruned_loss=0.07934, over 13247.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2393, pruned_loss=0.07071, over 2590473.55 frames. ], batch size: 52, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:59:45,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=323660.3333333333, ans=0.0 2024-06-21 04:59:51,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=323660.3333333333, ans=0.0 2024-06-21 05:00:05,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=323678.6666666667, ans=0.025 2024-06-21 05:00:28,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=323715.3333333333, ans=0.125 2024-06-21 05:00:38,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=323733.6666666667, ans=15.0 2024-06-21 05:00:41,690 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2024-06-21 05:00:47,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=323752.0, ans=0.125 2024-06-21 05:00:48,278 INFO [train.py:1028] (1/2) Epoch 18, batch 4600, loss[loss=0.1947, simple_loss=0.2431, pruned_loss=0.07318, over 12507.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2395, pruned_loss=0.07076, over 2585846.71 frames. ], batch size: 202, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 05:00:49,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=323752.0, ans=0.0 2024-06-21 05:00:56,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=323752.0, ans=0.1 2024-06-21 05:00:59,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=323770.3333333333, ans=0.0 2024-06-21 05:01:00,263 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.11 vs. limit=15.0 2024-06-21 05:01:22,662 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.39 vs. limit=22.5 2024-06-21 05:01:33,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=323825.3333333333, ans=0.0 2024-06-21 05:01:35,447 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.658e+02 1.863e+02 1.968e+02 2.146e+02 3.437e+02, threshold=3.936e+02, percent-clipped=0.0 2024-06-21 05:01:37,040 INFO [train.py:1028] (1/2) Epoch 18, batch 4650, loss[loss=0.1898, simple_loss=0.2289, pruned_loss=0.07534, over 13138.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.239, pruned_loss=0.0706, over 2588789.16 frames. ], batch size: 132, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 05:01:40,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=323843.6666666667, ans=0.07 2024-06-21 05:01:44,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=323843.6666666667, ans=0.2 2024-06-21 05:01:45,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=323843.6666666667, ans=0.05 2024-06-21 05:01:54,071 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.45 vs. limit=10.0 2024-06-21 05:01:57,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=323880.3333333333, ans=0.125 2024-06-21 05:02:20,794 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.91 vs. limit=15.0 2024-06-21 05:02:26,291 INFO [train.py:1028] (1/2) Epoch 18, batch 4700, loss[loss=0.184, simple_loss=0.2328, pruned_loss=0.06755, over 12393.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2393, pruned_loss=0.07073, over 2584255.51 frames. ], batch size: 25, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 05:02:54,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=323972.0, ans=0.125 2024-06-21 05:02:58,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=323990.3333333333, ans=0.0 2024-06-21 05:03:10,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=324008.6666666667, ans=0.125 2024-06-21 05:03:14,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2.whitening_limit, batch_count=324008.6666666667, ans=15.0 2024-06-21 05:03:16,766 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.837e+02 1.943e+02 2.074e+02 2.626e+02, threshold=3.886e+02, percent-clipped=0.0 2024-06-21 05:03:18,542 INFO [train.py:1028] (1/2) Epoch 18, batch 4750, loss[loss=0.2163, simple_loss=0.2589, pruned_loss=0.08684, over 12535.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2392, pruned_loss=0.07087, over 2580490.77 frames. ], batch size: 202, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 05:03:22,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=324027.0, ans=0.04949747468305833 2024-06-21 05:03:24,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=324027.0, ans=0.04949747468305833 2024-06-21 05:03:25,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=324027.0, ans=0.025 2024-06-21 05:03:36,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=324045.3333333333, ans=0.125 2024-06-21 05:03:46,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=324063.6666666667, ans=0.025 2024-06-21 05:03:57,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=324082.0, ans=0.0 2024-06-21 05:04:01,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=324100.3333333333, ans=0.1 2024-06-21 05:04:05,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=324100.3333333333, ans=0.125 2024-06-21 05:04:10,152 INFO [train.py:1028] (1/2) Epoch 18, batch 4800, loss[loss=0.1937, simple_loss=0.2495, pruned_loss=0.06894, over 13281.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2393, pruned_loss=0.07083, over 2576721.30 frames. ], batch size: 63, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 05:04:16,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=324118.6666666667, ans=0.0 2024-06-21 05:04:30,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=324137.0, ans=0.1 2024-06-21 05:04:34,306 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.21 vs. limit=10.0 2024-06-21 05:05:08,975 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.902e+02 2.110e+02 2.331e+02 3.355e+02, threshold=4.220e+02, percent-clipped=0.0 2024-06-21 05:05:09,513 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.64 vs. limit=15.0 2024-06-21 05:05:11,026 INFO [train.py:1028] (1/2) Epoch 18, batch 4850, loss[loss=0.1951, simple_loss=0.2419, pruned_loss=0.07414, over 13222.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2389, pruned_loss=0.07058, over 2574466.57 frames. ], batch size: 89, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 05:05:18,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=324210.3333333333, ans=0.04949747468305833 2024-06-21 05:05:25,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=324228.6666666667, ans=0.125 2024-06-21 05:05:32,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=324247.0, ans=0.125 2024-06-21 05:05:45,052 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.53 vs. limit=6.0 2024-06-21 05:05:55,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=324283.6666666667, ans=0.125 2024-06-21 05:05:58,927 INFO [train.py:1028] (1/2) Epoch 18, batch 4900, loss[loss=0.1669, simple_loss=0.2227, pruned_loss=0.05558, over 13224.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2384, pruned_loss=0.07026, over 2577041.48 frames. ], batch size: 59, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:06:27,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=324338.6666666667, ans=0.125 2024-06-21 05:06:32,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=324338.6666666667, ans=0.0 2024-06-21 05:06:35,469 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.75 vs. limit=22.5 2024-06-21 05:06:57,557 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 1.885e+02 1.988e+02 2.119e+02 3.268e+02, threshold=3.976e+02, percent-clipped=0.0 2024-06-21 05:06:59,564 INFO [train.py:1028] (1/2) Epoch 18, batch 4950, loss[loss=0.2066, simple_loss=0.2487, pruned_loss=0.08222, over 11121.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2388, pruned_loss=0.07067, over 2570781.22 frames. ], batch size: 304, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:07:09,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=324412.0, ans=0.125 2024-06-21 05:07:09,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=324412.0, ans=0.035 2024-06-21 05:07:45,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=324448.6666666667, ans=0.07 2024-06-21 05:07:50,525 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.72 vs. limit=15.0 2024-06-21 05:07:56,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=324467.0, ans=0.2 2024-06-21 05:07:58,914 INFO [train.py:1028] (1/2) Epoch 18, batch 5000, loss[loss=0.1779, simple_loss=0.2242, pruned_loss=0.06586, over 13174.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2384, pruned_loss=0.07049, over 2574859.48 frames. ], batch size: 95, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:08:02,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=324485.3333333333, ans=0.025 2024-06-21 05:08:19,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=324522.0, ans=0.125 2024-06-21 05:08:25,338 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.81 vs. limit=22.5 2024-06-21 05:08:48,704 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 1.885e+02 2.009e+02 2.183e+02 3.110e+02, threshold=4.017e+02, percent-clipped=0.0 2024-06-21 05:08:50,533 INFO [train.py:1028] (1/2) Epoch 18, batch 5050, loss[loss=0.1944, simple_loss=0.2528, pruned_loss=0.06803, over 12863.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2382, pruned_loss=0.07005, over 2574824.68 frames. ], batch size: 36, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:08:55,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=324577.0, ans=0.125 2024-06-21 05:09:01,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=324595.3333333333, ans=0.0 2024-06-21 05:09:01,601 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2024-06-21 05:09:05,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=324595.3333333333, ans=0.025 2024-06-21 05:09:21,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=324632.0, ans=0.125 2024-06-21 05:09:29,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=324632.0, ans=0.0 2024-06-21 05:09:34,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=324650.3333333333, ans=0.025 2024-06-21 05:09:35,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=324650.3333333333, ans=0.125 2024-06-21 05:09:39,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=324650.3333333333, ans=0.0 2024-06-21 05:09:42,553 INFO [train.py:1028] (1/2) Epoch 18, batch 5100, loss[loss=0.1926, simple_loss=0.2446, pruned_loss=0.07025, over 12950.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2384, pruned_loss=0.07067, over 2570881.29 frames. ], batch size: 39, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:09:47,566 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.40 vs. limit=22.5 2024-06-21 05:10:00,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=324687.0, ans=0.125 2024-06-21 05:10:15,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=324723.6666666667, ans=0.2 2024-06-21 05:10:40,085 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 1.919e+02 2.015e+02 2.153e+02 3.751e+02, threshold=4.030e+02, percent-clipped=0.0 2024-06-21 05:10:41,789 INFO [train.py:1028] (1/2) Epoch 18, batch 5150, loss[loss=0.1915, simple_loss=0.2331, pruned_loss=0.07495, over 13096.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2381, pruned_loss=0.0706, over 2571272.97 frames. ], batch size: 132, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:10:47,523 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.97 vs. limit=15.0 2024-06-21 05:10:48,606 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.85 vs. limit=10.0 2024-06-21 05:10:50,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=324760.3333333333, ans=0.125 2024-06-21 05:11:01,349 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.20 vs. limit=22.5 2024-06-21 05:11:03,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=324797.0, ans=0.2 2024-06-21 05:11:04,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=324797.0, ans=0.2 2024-06-21 05:11:04,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=324797.0, ans=0.0 2024-06-21 05:11:08,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=324797.0, ans=0.5 2024-06-21 05:11:09,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=324815.3333333333, ans=0.05 2024-06-21 05:11:12,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=324815.3333333333, ans=0.0 2024-06-21 05:11:16,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=324815.3333333333, ans=0.1 2024-06-21 05:11:28,340 INFO [train.py:1028] (1/2) Epoch 18, batch 5200, loss[loss=0.1913, simple_loss=0.2335, pruned_loss=0.07454, over 13186.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2381, pruned_loss=0.07059, over 2574027.21 frames. ], batch size: 95, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:11:33,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.11 vs. limit=15.0 2024-06-21 05:11:35,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=324870.3333333333, ans=0.125 2024-06-21 05:11:37,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=324870.3333333333, ans=0.125 2024-06-21 05:11:42,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=324870.3333333333, ans=0.125 2024-06-21 05:11:53,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=324907.0, ans=0.0 2024-06-21 05:11:53,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=324907.0, ans=0.1 2024-06-21 05:12:02,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=324925.3333333333, ans=0.125 2024-06-21 05:12:03,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=324925.3333333333, ans=0.1 2024-06-21 05:12:08,707 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.891e+02 2.033e+02 2.163e+02 3.297e+02, threshold=4.067e+02, percent-clipped=0.0 2024-06-21 05:12:18,027 INFO [train.py:1028] (1/2) Epoch 18, batch 5250, loss[loss=0.1589, simple_loss=0.2059, pruned_loss=0.05597, over 13314.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2383, pruned_loss=0.07068, over 2570158.53 frames. ], batch size: 52, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:12:27,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=8.34 vs. limit=12.0 2024-06-21 05:12:31,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=324962.0, ans=0.125 2024-06-21 05:12:36,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=324962.0, ans=0.125 2024-06-21 05:12:50,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=324998.6666666667, ans=0.05 2024-06-21 05:12:58,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=325017.0, ans=0.0 2024-06-21 05:13:08,416 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.17 vs. limit=15.0 2024-06-21 05:13:09,679 INFO [train.py:1028] (1/2) Epoch 18, batch 5300, loss[loss=0.1818, simple_loss=0.2264, pruned_loss=0.06862, over 13049.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2385, pruned_loss=0.07096, over 2566342.50 frames. ], batch size: 144, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:13:39,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=325072.0, ans=0.125 2024-06-21 05:13:41,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=325072.0, ans=0.125 2024-06-21 05:13:43,122 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.01 vs. limit=15.0 2024-06-21 05:13:43,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325072.0, ans=0.1 2024-06-21 05:14:00,715 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 1.920e+02 2.030e+02 2.183e+02 3.054e+02, threshold=4.060e+02, percent-clipped=0.0 2024-06-21 05:14:02,105 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.98 vs. limit=12.0 2024-06-21 05:14:02,516 INFO [train.py:1028] (1/2) Epoch 18, batch 5350, loss[loss=0.213, simple_loss=0.2672, pruned_loss=0.07944, over 11489.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2384, pruned_loss=0.07083, over 2573547.73 frames. ], batch size: 17, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:14:25,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=325163.6666666667, ans=0.125 2024-06-21 05:14:33,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=325182.0, ans=0.2 2024-06-21 05:14:52,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=325218.6666666667, ans=0.0 2024-06-21 05:14:53,169 INFO [train.py:1028] (1/2) Epoch 18, batch 5400, loss[loss=0.197, simple_loss=0.2387, pruned_loss=0.0776, over 12205.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2382, pruned_loss=0.07085, over 2566629.89 frames. ], batch size: 240, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:14:54,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=325218.6666666667, ans=0.025 2024-06-21 05:15:16,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=325237.0, ans=0.1 2024-06-21 05:15:25,567 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2024-06-21 05:15:27,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=325255.3333333333, ans=0.09899494936611666 2024-06-21 05:15:39,061 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=325273.6666666667, ans=0.125 2024-06-21 05:15:43,295 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.54 vs. limit=15.0 2024-06-21 05:15:45,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=325292.0, ans=0.125 2024-06-21 05:15:45,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=325292.0, ans=0.0 2024-06-21 05:15:47,445 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.700e+02 1.938e+02 2.079e+02 2.224e+02 2.503e+02, threshold=4.158e+02, percent-clipped=0.0 2024-06-21 05:15:49,126 INFO [train.py:1028] (1/2) Epoch 18, batch 5450, loss[loss=0.1961, simple_loss=0.2445, pruned_loss=0.07391, over 12512.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2388, pruned_loss=0.071, over 2570847.28 frames. ], batch size: 25, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:15:55,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=325310.3333333333, ans=0.125 2024-06-21 05:15:57,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=325328.6666666667, ans=0.125 2024-06-21 05:16:03,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=325328.6666666667, ans=0.025 2024-06-21 05:16:35,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=325383.6666666667, ans=0.1 2024-06-21 05:16:45,789 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.00 vs. limit=22.5 2024-06-21 05:16:46,093 INFO [train.py:1028] (1/2) Epoch 18, batch 5500, loss[loss=0.2119, simple_loss=0.2467, pruned_loss=0.08857, over 12234.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2383, pruned_loss=0.07103, over 2565035.45 frames. ], batch size: 241, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:16:56,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=325420.3333333333, ans=0.125 2024-06-21 05:17:02,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=325420.3333333333, ans=0.125 2024-06-21 05:17:15,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=325438.6666666667, ans=0.125 2024-06-21 05:17:38,353 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 1.900e+02 2.074e+02 2.229e+02 3.003e+02, threshold=4.148e+02, percent-clipped=0.0 2024-06-21 05:17:39,955 INFO [train.py:1028] (1/2) Epoch 18, batch 5550, loss[loss=0.1771, simple_loss=0.2318, pruned_loss=0.0612, over 13333.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2384, pruned_loss=0.07071, over 2568277.60 frames. ], batch size: 43, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:17:54,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=325530.3333333333, ans=0.125 2024-06-21 05:18:31,136 INFO [train.py:1028] (1/2) Epoch 18, batch 5600, loss[loss=0.1925, simple_loss=0.2333, pruned_loss=0.07588, over 13274.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2379, pruned_loss=0.07052, over 2570980.26 frames. ], batch size: 89, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:18:46,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=325603.6666666667, ans=0.0 2024-06-21 05:18:47,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=325603.6666666667, ans=0.125 2024-06-21 05:18:55,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=325622.0, ans=0.0 2024-06-21 05:18:57,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325622.0, ans=0.1 2024-06-21 05:18:58,136 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.47 vs. limit=22.5 2024-06-21 05:19:04,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=325640.3333333333, ans=0.125 2024-06-21 05:19:06,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=325640.3333333333, ans=0.2 2024-06-21 05:19:19,858 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.38 vs. limit=15.0 2024-06-21 05:19:22,594 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.922e+02 2.058e+02 2.231e+02 3.529e+02, threshold=4.117e+02, percent-clipped=0.0 2024-06-21 05:19:24,525 INFO [train.py:1028] (1/2) Epoch 18, batch 5650, loss[loss=0.212, simple_loss=0.2537, pruned_loss=0.08514, over 12626.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2376, pruned_loss=0.07044, over 2575739.14 frames. ], batch size: 202, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:19:28,804 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2024-06-21 05:20:03,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=325732.0, ans=0.025 2024-06-21 05:20:13,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=325750.3333333333, ans=0.2 2024-06-21 05:20:15,779 INFO [train.py:1028] (1/2) Epoch 18, batch 5700, loss[loss=0.194, simple_loss=0.2492, pruned_loss=0.0694, over 13310.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2375, pruned_loss=0.07019, over 2578537.70 frames. ], batch size: 63, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:20:16,304 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.59 vs. limit=22.5 2024-06-21 05:20:45,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=325823.6666666667, ans=0.125 2024-06-21 05:20:49,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=325823.6666666667, ans=0.1 2024-06-21 05:21:04,680 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.928e+02 2.087e+02 2.285e+02 3.074e+02, threshold=4.174e+02, percent-clipped=0.0 2024-06-21 05:21:06,651 INFO [train.py:1028] (1/2) Epoch 18, batch 5750, loss[loss=0.2237, simple_loss=0.2639, pruned_loss=0.09174, over 12772.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.239, pruned_loss=0.07085, over 2579597.40 frames. ], batch size: 176, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:21:15,559 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.45 vs. limit=15.0 2024-06-21 05:21:17,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=325860.3333333333, ans=0.125 2024-06-21 05:21:33,393 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2024-06-21 05:21:41,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=325915.3333333333, ans=0.2 2024-06-21 05:21:44,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=325915.3333333333, ans=0.125 2024-06-21 05:21:58,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=325952.0, ans=0.125 2024-06-21 05:21:59,409 INFO [train.py:1028] (1/2) Epoch 18, batch 5800, loss[loss=0.1911, simple_loss=0.2401, pruned_loss=0.07103, over 12810.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2401, pruned_loss=0.07148, over 2578963.26 frames. ], batch size: 177, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:22:38,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=326007.0, ans=0.125 2024-06-21 05:22:40,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=326007.0, ans=0.2 2024-06-21 05:22:40,458 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.76 vs. limit=15.0 2024-06-21 05:22:49,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=326025.3333333333, ans=0.1 2024-06-21 05:22:49,973 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.85 vs. limit=22.5 2024-06-21 05:22:57,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=326025.3333333333, ans=0.2 2024-06-21 05:22:57,896 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 1.973e+02 2.141e+02 2.299e+02 3.812e+02, threshold=4.282e+02, percent-clipped=0.0 2024-06-21 05:22:59,860 INFO [train.py:1028] (1/2) Epoch 18, batch 5850, loss[loss=0.2113, simple_loss=0.2666, pruned_loss=0.07801, over 12574.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2418, pruned_loss=0.07226, over 2576970.89 frames. ], batch size: 202, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:23:01,301 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2024-06-21 05:23:03,406 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2024-06-21 05:23:08,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=326043.6666666667, ans=0.0 2024-06-21 05:23:25,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=326080.3333333333, ans=0.2 2024-06-21 05:23:31,878 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2024-06-21 05:23:47,172 INFO [train.py:1028] (1/2) Epoch 18, batch 5900, loss[loss=0.173, simple_loss=0.221, pruned_loss=0.0625, over 13097.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2439, pruned_loss=0.07289, over 2577134.84 frames. ], batch size: 121, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:23:47,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=326135.3333333333, ans=0.2 2024-06-21 05:24:27,714 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.56 vs. limit=22.5 2024-06-21 05:24:33,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=326208.6666666667, ans=0.1 2024-06-21 05:24:39,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=326208.6666666667, ans=0.2 2024-06-21 05:24:43,441 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 1.966e+02 2.155e+02 2.357e+02 3.049e+02, threshold=4.310e+02, percent-clipped=0.0 2024-06-21 05:24:44,702 INFO [train.py:1028] (1/2) Epoch 18, batch 5950, loss[loss=0.1803, simple_loss=0.2248, pruned_loss=0.06785, over 13091.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2451, pruned_loss=0.07344, over 2581983.54 frames. ], batch size: 121, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:25:09,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=326263.6666666667, ans=0.2 2024-06-21 05:25:18,189 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.49 vs. limit=12.0 2024-06-21 05:25:29,229 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.58 vs. limit=15.0 2024-06-21 05:25:29,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=326300.3333333333, ans=0.125 2024-06-21 05:25:32,150 INFO [train.py:1028] (1/2) Epoch 18, batch 6000, loss[loss=0.2149, simple_loss=0.2613, pruned_loss=0.0842, over 12228.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2469, pruned_loss=0.07418, over 2576328.02 frames. ], batch size: 240, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:25:32,150 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 05:25:44,035 INFO [train.py:1060] (1/2) Epoch 18, validation: loss=0.1882, simple_loss=0.2528, pruned_loss=0.06183, over 351949.00 frames. 2024-06-21 05:25:44,036 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 05:26:23,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=326373.6666666667, ans=0.125 2024-06-21 05:26:25,842 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.39 vs. limit=15.0 2024-06-21 05:26:26,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=326392.0, ans=0.125 2024-06-21 05:26:37,572 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 1.974e+02 2.109e+02 2.263e+02 3.024e+02, threshold=4.218e+02, percent-clipped=0.0 2024-06-21 05:26:37,616 INFO [train.py:1028] (1/2) Epoch 18, batch 6050, loss[loss=0.1903, simple_loss=0.2494, pruned_loss=0.06566, over 12946.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2484, pruned_loss=0.07473, over 2578217.50 frames. ], batch size: 39, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:26:37,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=326410.3333333333, ans=0.1 2024-06-21 05:26:38,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=326410.3333333333, ans=0.125 2024-06-21 05:26:52,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=326428.6666666667, ans=0.1 2024-06-21 05:26:53,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=326428.6666666667, ans=0.1 2024-06-21 05:27:06,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=326447.0, ans=15.0 2024-06-21 05:27:08,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=326465.3333333333, ans=0.125 2024-06-21 05:27:10,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=326465.3333333333, ans=0.2 2024-06-21 05:27:10,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=326465.3333333333, ans=0.125 2024-06-21 05:27:12,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=326465.3333333333, ans=0.125 2024-06-21 05:27:15,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=326483.6666666667, ans=0.125 2024-06-21 05:27:21,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=326483.6666666667, ans=0.0 2024-06-21 05:27:24,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=326483.6666666667, ans=0.2 2024-06-21 05:27:27,823 INFO [train.py:1028] (1/2) Epoch 18, batch 6100, loss[loss=0.1975, simple_loss=0.2401, pruned_loss=0.07751, over 13130.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2499, pruned_loss=0.0753, over 2580870.71 frames. ], batch size: 121, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:27:30,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=326502.0, ans=0.0 2024-06-21 05:27:40,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=326520.3333333333, ans=0.1 2024-06-21 05:27:55,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=326538.6666666667, ans=0.125 2024-06-21 05:28:18,445 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 1.958e+02 2.118e+02 2.331e+02 3.403e+02, threshold=4.236e+02, percent-clipped=0.0 2024-06-21 05:28:18,481 INFO [train.py:1028] (1/2) Epoch 18, batch 6150, loss[loss=0.2174, simple_loss=0.2585, pruned_loss=0.08814, over 11085.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2516, pruned_loss=0.07603, over 2578837.62 frames. ], batch size: 304, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:28:28,067 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.07 vs. limit=12.0 2024-06-21 05:28:30,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=326612.0, ans=0.2 2024-06-21 05:28:51,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=326630.3333333333, ans=0.035 2024-06-21 05:28:52,755 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.76 vs. limit=15.0 2024-06-21 05:29:05,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=326648.6666666667, ans=0.2 2024-06-21 05:29:11,724 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.15 vs. limit=15.0 2024-06-21 05:29:18,506 INFO [train.py:1028] (1/2) Epoch 18, batch 6200, loss[loss=0.2156, simple_loss=0.2683, pruned_loss=0.08146, over 13267.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2532, pruned_loss=0.07646, over 2578003.97 frames. ], batch size: 89, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:29:19,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=326685.3333333333, ans=0.0 2024-06-21 05:29:19,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=326685.3333333333, ans=0.5 2024-06-21 05:29:32,786 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.89 vs. limit=22.5 2024-06-21 05:29:44,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=326740.3333333333, ans=0.0 2024-06-21 05:29:51,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=326740.3333333333, ans=0.2 2024-06-21 05:29:56,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=326758.6666666667, ans=0.125 2024-06-21 05:30:05,321 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 05:30:07,664 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 2.002e+02 2.194e+02 2.572e+02 4.041e+02, threshold=4.388e+02, percent-clipped=0.0 2024-06-21 05:30:07,699 INFO [train.py:1028] (1/2) Epoch 18, batch 6250, loss[loss=0.2029, simple_loss=0.2548, pruned_loss=0.07554, over 13173.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2543, pruned_loss=0.07694, over 2570206.62 frames. ], batch size: 83, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:30:24,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=326777.0, ans=0.0 2024-06-21 05:30:28,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=326795.3333333333, ans=0.0 2024-06-21 05:30:42,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=326813.6666666667, ans=0.125 2024-06-21 05:30:55,742 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.43 vs. limit=15.0 2024-06-21 05:31:05,806 INFO [train.py:1028] (1/2) Epoch 18, batch 6300, loss[loss=0.1853, simple_loss=0.2359, pruned_loss=0.06736, over 11549.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2547, pruned_loss=0.07719, over 2564744.24 frames. ], batch size: 16, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:31:09,194 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.70 vs. limit=15.0 2024-06-21 05:31:13,908 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.27 vs. limit=15.0 2024-06-21 05:31:19,103 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.21 vs. limit=22.5 2024-06-21 05:31:30,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=326905.3333333333, ans=0.125 2024-06-21 05:31:58,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=326942.0, ans=0.125 2024-06-21 05:32:00,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=326960.3333333333, ans=0.1 2024-06-21 05:32:01,549 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 1.982e+02 2.137e+02 2.387e+02 3.416e+02, threshold=4.274e+02, percent-clipped=0.0 2024-06-21 05:32:01,594 INFO [train.py:1028] (1/2) Epoch 18, batch 6350, loss[loss=0.2183, simple_loss=0.2695, pruned_loss=0.08353, over 12499.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2561, pruned_loss=0.07719, over 2573625.58 frames. ], batch size: 202, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:32:06,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=326960.3333333333, ans=0.125 2024-06-21 05:32:17,536 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.67 vs. limit=15.0 2024-06-21 05:32:23,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=326997.0, ans=0.0 2024-06-21 05:32:40,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=327015.3333333333, ans=0.125 2024-06-21 05:32:40,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.88 vs. limit=15.0 2024-06-21 05:32:43,161 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.67 vs. limit=15.0 2024-06-21 05:32:53,932 INFO [train.py:1028] (1/2) Epoch 18, batch 6400, loss[loss=0.2044, simple_loss=0.2515, pruned_loss=0.07867, over 13162.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.258, pruned_loss=0.07789, over 2575530.74 frames. ], batch size: 67, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:33:02,673 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.56 vs. limit=15.0 2024-06-21 05:33:45,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=327125.3333333333, ans=0.2 2024-06-21 05:33:47,708 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.074e+02 2.211e+02 2.412e+02 2.962e+02, threshold=4.421e+02, percent-clipped=0.0 2024-06-21 05:33:47,742 INFO [train.py:1028] (1/2) Epoch 18, batch 6450, loss[loss=0.2247, simple_loss=0.2753, pruned_loss=0.08705, over 12486.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2594, pruned_loss=0.0787, over 2580474.56 frames. ], batch size: 202, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:33:55,601 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.69 vs. limit=12.0 2024-06-21 05:33:58,039 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.52 vs. limit=22.5 2024-06-21 05:34:17,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=327180.3333333333, ans=0.04949747468305833 2024-06-21 05:34:19,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=327198.6666666667, ans=0.125 2024-06-21 05:34:40,616 INFO [train.py:1028] (1/2) Epoch 18, batch 6500, loss[loss=0.243, simple_loss=0.2818, pruned_loss=0.1021, over 10889.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2614, pruned_loss=0.07906, over 2583665.22 frames. ], batch size: 304, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:34:46,101 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.99 vs. limit=15.0 2024-06-21 05:35:03,178 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.54 vs. limit=15.0 2024-06-21 05:35:15,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.27 vs. limit=12.0 2024-06-21 05:35:27,949 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.038e+02 2.196e+02 2.474e+02 3.910e+02, threshold=4.392e+02, percent-clipped=0.0 2024-06-21 05:35:27,978 INFO [train.py:1028] (1/2) Epoch 18, batch 6550, loss[loss=0.1872, simple_loss=0.2473, pruned_loss=0.0635, over 12771.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2622, pruned_loss=0.07884, over 2588047.25 frames. ], batch size: 22, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:35:36,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=327327.0, ans=0.1 2024-06-21 05:36:10,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=327400.3333333333, ans=0.0 2024-06-21 05:36:12,556 INFO [train.py:1028] (1/2) Epoch 18, batch 6600, loss[loss=0.2063, simple_loss=0.2657, pruned_loss=0.07342, over 13231.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2623, pruned_loss=0.07884, over 2590097.33 frames. ], batch size: 72, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:36:14,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=327418.6666666667, ans=0.2 2024-06-21 05:36:23,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=327437.0, ans=0.0 2024-06-21 05:37:05,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=327492.0, ans=0.09899494936611666 2024-06-21 05:37:10,614 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 2.042e+02 2.176e+02 2.372e+02 3.948e+02, threshold=4.351e+02, percent-clipped=0.0 2024-06-21 05:37:10,657 INFO [train.py:1028] (1/2) Epoch 18, batch 6650, loss[loss=0.2347, simple_loss=0.2785, pruned_loss=0.09548, over 12956.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2644, pruned_loss=0.07991, over 2584977.78 frames. ], batch size: 158, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:37:31,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327547.0, ans=0.1 2024-06-21 05:37:40,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=327547.0, ans=0.04949747468305833 2024-06-21 05:37:45,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=327565.3333333333, ans=0.0 2024-06-21 05:37:48,749 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.42 vs. limit=12.0 2024-06-21 05:37:53,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327583.6666666667, ans=0.1 2024-06-21 05:37:53,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=327583.6666666667, ans=0.125 2024-06-21 05:38:06,276 INFO [train.py:1028] (1/2) Epoch 18, batch 6700, loss[loss=0.2203, simple_loss=0.2701, pruned_loss=0.08523, over 12682.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2658, pruned_loss=0.0806, over 2584500.48 frames. ], batch size: 176, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:38:10,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=327602.0, ans=0.0 2024-06-21 05:38:13,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=327620.3333333333, ans=10.0 2024-06-21 05:38:15,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=327620.3333333333, ans=0.07 2024-06-21 05:38:18,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=327620.3333333333, ans=0.2 2024-06-21 05:38:19,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=327620.3333333333, ans=10.0 2024-06-21 05:38:40,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=327657.0, ans=0.2 2024-06-21 05:38:44,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=327675.3333333333, ans=0.125 2024-06-21 05:38:54,349 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.107e+02 2.360e+02 2.692e+02 3.618e+02, threshold=4.720e+02, percent-clipped=0.0 2024-06-21 05:38:54,387 INFO [train.py:1028] (1/2) Epoch 18, batch 6750, loss[loss=0.2451, simple_loss=0.2895, pruned_loss=0.1004, over 12199.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2665, pruned_loss=0.08082, over 2578635.09 frames. ], batch size: 240, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:38:56,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=327693.6666666667, ans=0.125 2024-06-21 05:38:56,675 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.63 vs. limit=22.5 2024-06-21 05:38:59,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=327693.6666666667, ans=0.0 2024-06-21 05:39:00,048 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 05:39:03,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=327712.0, ans=0.035 2024-06-21 05:39:08,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=327712.0, ans=0.125 2024-06-21 05:39:08,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=327712.0, ans=0.125 2024-06-21 05:39:18,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=327730.3333333333, ans=0.0 2024-06-21 05:39:18,233 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2024-06-21 05:39:23,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=327748.6666666667, ans=0.125 2024-06-21 05:39:24,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=327748.6666666667, ans=0.0 2024-06-21 05:39:27,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=327748.6666666667, ans=0.125 2024-06-21 05:39:28,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327748.6666666667, ans=0.1 2024-06-21 05:39:30,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=327748.6666666667, ans=0.0 2024-06-21 05:39:52,374 INFO [train.py:1028] (1/2) Epoch 18, batch 6800, loss[loss=0.2127, simple_loss=0.2702, pruned_loss=0.07761, over 13221.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2677, pruned_loss=0.08134, over 2581098.79 frames. ], batch size: 67, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:40:03,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=327803.6666666667, ans=0.125 2024-06-21 05:40:16,539 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2024-06-21 05:40:32,458 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 05:40:35,503 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.849e+02 2.060e+02 2.214e+02 2.441e+02 3.106e+02, threshold=4.427e+02, percent-clipped=0.0 2024-06-21 05:40:35,554 INFO [train.py:1028] (1/2) Epoch 18, batch 6850, loss[loss=0.2392, simple_loss=0.2939, pruned_loss=0.09225, over 13267.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2684, pruned_loss=0.08145, over 2585059.57 frames. ], batch size: 63, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:40:43,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=327877.0, ans=0.125 2024-06-21 05:41:18,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=327932.0, ans=0.125 2024-06-21 05:41:19,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327932.0, ans=0.1 2024-06-21 05:41:27,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=327950.3333333333, ans=0.125 2024-06-21 05:41:28,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=327950.3333333333, ans=0.125 2024-06-21 05:41:32,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=327950.3333333333, ans=0.0 2024-06-21 05:41:35,081 INFO [train.py:1028] (1/2) Epoch 18, batch 6900, loss[loss=0.2256, simple_loss=0.2868, pruned_loss=0.08223, over 13031.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2696, pruned_loss=0.08183, over 2586281.03 frames. ], batch size: 48, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:41:41,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=327968.6666666667, ans=0.0 2024-06-21 05:41:50,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=327987.0, ans=0.025 2024-06-21 05:41:55,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=328005.3333333333, ans=10.0 2024-06-21 05:41:57,844 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.00 vs. limit=15.0 2024-06-21 05:42:02,401 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.774e-01 2024-06-21 05:42:04,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=328023.6666666667, ans=0.5 2024-06-21 05:42:09,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=328023.6666666667, ans=0.125 2024-06-21 05:42:20,348 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.036e+02 2.155e+02 2.392e+02 3.183e+02, threshold=4.310e+02, percent-clipped=0.0 2024-06-21 05:42:20,391 INFO [train.py:1028] (1/2) Epoch 18, batch 6950, loss[loss=0.1909, simple_loss=0.2454, pruned_loss=0.06817, over 11988.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2696, pruned_loss=0.08176, over 2580299.97 frames. ], batch size: 17, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:42:33,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=328078.6666666667, ans=0.0 2024-06-21 05:42:38,075 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.53 vs. limit=22.5 2024-06-21 05:42:38,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=328078.6666666667, ans=0.125 2024-06-21 05:42:45,408 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.67 vs. limit=10.0 2024-06-21 05:42:45,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=328097.0, ans=0.0 2024-06-21 05:42:59,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=25.45 vs. limit=22.5 2024-06-21 05:43:20,084 INFO [train.py:1028] (1/2) Epoch 18, batch 7000, loss[loss=0.2325, simple_loss=0.2836, pruned_loss=0.0907, over 12937.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2693, pruned_loss=0.08111, over 2574657.70 frames. ], batch size: 158, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:43:26,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=328152.0, ans=0.0 2024-06-21 05:43:36,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=328170.3333333333, ans=0.2 2024-06-21 05:43:43,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=328188.6666666667, ans=0.0 2024-06-21 05:43:54,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=328207.0, ans=10.0 2024-06-21 05:44:13,368 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.044e+02 2.195e+02 2.306e+02 3.731e+02, threshold=4.390e+02, percent-clipped=0.0 2024-06-21 05:44:13,405 INFO [train.py:1028] (1/2) Epoch 18, batch 7050, loss[loss=0.2338, simple_loss=0.2824, pruned_loss=0.09255, over 12756.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2704, pruned_loss=0.08164, over 2582150.70 frames. ], batch size: 176, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:44:15,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=328243.6666666667, ans=0.0 2024-06-21 05:44:20,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=328243.6666666667, ans=0.125 2024-06-21 05:44:20,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=328243.6666666667, ans=0.2 2024-06-21 05:44:35,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=328280.3333333333, ans=0.0 2024-06-21 05:44:36,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=328280.3333333333, ans=0.0 2024-06-21 05:44:39,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=328280.3333333333, ans=0.1 2024-06-21 05:44:40,182 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.66 vs. limit=22.5 2024-06-21 05:45:04,755 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.59 vs. limit=15.0 2024-06-21 05:45:05,240 INFO [train.py:1028] (1/2) Epoch 18, batch 7100, loss[loss=0.2338, simple_loss=0.2786, pruned_loss=0.09446, over 13143.00 frames. ], tot_loss[loss=0.218, simple_loss=0.271, pruned_loss=0.08244, over 2575166.00 frames. ], batch size: 112, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:45:17,699 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.88 vs. limit=15.0 2024-06-21 05:45:21,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=328353.6666666667, ans=0.0 2024-06-21 05:45:26,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=328372.0, ans=0.1 2024-06-21 05:45:28,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=328372.0, ans=0.0 2024-06-21 05:45:35,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=328372.0, ans=0.2 2024-06-21 05:46:07,673 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 2.083e+02 2.243e+02 2.462e+02 3.681e+02, threshold=4.487e+02, percent-clipped=0.0 2024-06-21 05:46:07,708 INFO [train.py:1028] (1/2) Epoch 18, batch 7150, loss[loss=0.2487, simple_loss=0.2897, pruned_loss=0.1038, over 12545.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2714, pruned_loss=0.08191, over 2573428.05 frames. ], batch size: 202, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:46:13,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=328427.0, ans=0.1 2024-06-21 05:46:27,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=328463.6666666667, ans=0.025 2024-06-21 05:46:29,064 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 05:46:30,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=328463.6666666667, ans=0.125 2024-06-21 05:46:41,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=328482.0, ans=0.0 2024-06-21 05:46:59,156 INFO [train.py:1028] (1/2) Epoch 18, batch 7200, loss[loss=0.2409, simple_loss=0.2852, pruned_loss=0.0983, over 13174.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2727, pruned_loss=0.0822, over 2578552.96 frames. ], batch size: 112, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:47:03,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=328518.6666666667, ans=0.0 2024-06-21 05:47:12,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=328537.0, ans=0.125 2024-06-21 05:47:51,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=328592.0, ans=0.0 2024-06-21 05:47:57,774 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.844e+02 2.139e+02 2.254e+02 2.492e+02 2.952e+02, threshold=4.508e+02, percent-clipped=0.0 2024-06-21 05:47:57,810 INFO [train.py:1028] (1/2) Epoch 18, batch 7250, loss[loss=0.2284, simple_loss=0.2827, pruned_loss=0.08711, over 12949.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2736, pruned_loss=0.08231, over 2578585.31 frames. ], batch size: 36, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:48:03,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=328610.3333333333, ans=0.125 2024-06-21 05:48:41,032 INFO [train.py:1028] (1/2) Epoch 18, batch 7300, loss[loss=0.2135, simple_loss=0.2701, pruned_loss=0.07851, over 12919.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2751, pruned_loss=0.08315, over 2579143.67 frames. ], batch size: 36, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:48:42,678 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.35 vs. limit=10.0 2024-06-21 05:48:48,294 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.82 vs. limit=15.0 2024-06-21 05:48:51,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.12 vs. limit=15.0 2024-06-21 05:49:17,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=328757.0, ans=0.125 2024-06-21 05:49:22,449 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.32 vs. limit=12.0 2024-06-21 05:49:28,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=328775.3333333333, ans=0.125 2024-06-21 05:49:35,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=328793.6666666667, ans=0.1 2024-06-21 05:49:36,475 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.146e+02 2.288e+02 2.461e+02 3.426e+02, threshold=4.576e+02, percent-clipped=0.0 2024-06-21 05:49:36,516 INFO [train.py:1028] (1/2) Epoch 18, batch 7350, loss[loss=0.2522, simple_loss=0.3026, pruned_loss=0.1009, over 13324.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2752, pruned_loss=0.08342, over 2581959.28 frames. ], batch size: 46, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:49:47,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=328812.0, ans=0.07 2024-06-21 05:49:48,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.00 vs. limit=10.0 2024-06-21 05:49:50,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=328812.0, ans=0.025 2024-06-21 05:50:05,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=328848.6666666667, ans=0.025 2024-06-21 05:50:20,248 INFO [train.py:1028] (1/2) Epoch 18, batch 7400, loss[loss=0.2328, simple_loss=0.2871, pruned_loss=0.08923, over 13264.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2745, pruned_loss=0.08288, over 2587106.54 frames. ], batch size: 63, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:50:28,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=328885.3333333333, ans=0.0 2024-06-21 05:50:33,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=328903.6666666667, ans=0.125 2024-06-21 05:51:20,889 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.098e+02 2.281e+02 2.540e+02 3.820e+02, threshold=4.562e+02, percent-clipped=0.0 2024-06-21 05:51:20,946 INFO [train.py:1028] (1/2) Epoch 18, batch 7450, loss[loss=0.2134, simple_loss=0.2682, pruned_loss=0.07928, over 12714.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2744, pruned_loss=0.08267, over 2580410.77 frames. ], batch size: 29, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:51:29,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=328977.0, ans=0.0 2024-06-21 05:51:54,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=329032.0, ans=0.125 2024-06-21 05:52:00,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.84 vs. limit=10.0 2024-06-21 05:52:19,501 INFO [train.py:1028] (1/2) Epoch 18, batch 7500, loss[loss=0.2047, simple_loss=0.2519, pruned_loss=0.07878, over 10399.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2757, pruned_loss=0.0834, over 2577779.53 frames. ], batch size: 304, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:52:35,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=329087.0, ans=0.125 2024-06-21 05:52:42,718 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.82 vs. limit=12.0 2024-06-21 05:52:43,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=329105.3333333333, ans=0.0 2024-06-21 05:52:46,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=329105.3333333333, ans=0.0 2024-06-21 05:52:54,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=329123.6666666667, ans=0.125 2024-06-21 05:53:10,307 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.092e+02 2.265e+02 2.570e+02 3.672e+02, threshold=4.530e+02, percent-clipped=0.0 2024-06-21 05:53:10,341 INFO [train.py:1028] (1/2) Epoch 18, batch 7550, loss[loss=0.2267, simple_loss=0.2742, pruned_loss=0.08959, over 12977.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2762, pruned_loss=0.08369, over 2577033.96 frames. ], batch size: 158, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:53:14,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329160.3333333333, ans=0.1 2024-06-21 05:53:24,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=329178.6666666667, ans=0.1 2024-06-21 05:53:40,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=329215.3333333333, ans=0.025 2024-06-21 05:53:48,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=329215.3333333333, ans=0.125 2024-06-21 05:53:48,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=329215.3333333333, ans=0.1 2024-06-21 05:54:02,661 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.47 vs. limit=15.0 2024-06-21 05:54:06,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=329233.6666666667, ans=0.0 2024-06-21 05:54:11,954 INFO [train.py:1028] (1/2) Epoch 18, batch 7600, loss[loss=0.2318, simple_loss=0.2837, pruned_loss=0.08996, over 13244.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2764, pruned_loss=0.08408, over 2576296.72 frames. ], batch size: 83, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:54:12,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=329252.0, ans=0.0 2024-06-21 05:54:14,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=329252.0, ans=0.04949747468305833 2024-06-21 05:54:29,770 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.99 vs. limit=15.0 2024-06-21 05:54:41,975 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.51 vs. limit=22.5 2024-06-21 05:54:46,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=329307.0, ans=0.125 2024-06-21 05:54:47,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=329307.0, ans=0.2 2024-06-21 05:54:56,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=329325.3333333333, ans=0.125 2024-06-21 05:55:03,187 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=329325.3333333333, ans=0.125 2024-06-21 05:55:06,620 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2024-06-21 05:55:07,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=329343.6666666667, ans=0.2 2024-06-21 05:55:07,888 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+02 2.147e+02 2.403e+02 2.598e+02 3.884e+02, threshold=4.805e+02, percent-clipped=0.0 2024-06-21 05:55:07,923 INFO [train.py:1028] (1/2) Epoch 18, batch 7650, loss[loss=0.2341, simple_loss=0.2905, pruned_loss=0.08878, over 12984.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2765, pruned_loss=0.08408, over 2573823.51 frames. ], batch size: 33, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:55:18,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=329362.0, ans=0.125 2024-06-21 05:55:30,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=329380.3333333333, ans=0.0 2024-06-21 05:55:32,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=329380.3333333333, ans=0.2 2024-06-21 05:55:39,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=329398.6666666667, ans=0.125 2024-06-21 05:55:46,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=329398.6666666667, ans=0.125 2024-06-21 05:55:52,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=329417.0, ans=0.125 2024-06-21 05:56:00,822 INFO [train.py:1028] (1/2) Epoch 18, batch 7700, loss[loss=0.2095, simple_loss=0.2738, pruned_loss=0.07265, over 13261.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2773, pruned_loss=0.08446, over 2571046.94 frames. ], batch size: 63, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:56:07,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=329435.3333333333, ans=0.125 2024-06-21 05:56:10,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=329453.6666666667, ans=0.125 2024-06-21 05:56:15,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329453.6666666667, ans=0.1 2024-06-21 05:56:22,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=329472.0, ans=0.125 2024-06-21 05:56:25,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=329472.0, ans=0.0 2024-06-21 05:56:44,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329508.6666666667, ans=0.1 2024-06-21 05:56:52,432 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.091e+02 2.239e+02 2.424e+02 3.231e+02, threshold=4.478e+02, percent-clipped=0.0 2024-06-21 05:56:52,483 INFO [train.py:1028] (1/2) Epoch 18, batch 7750, loss[loss=0.2213, simple_loss=0.2821, pruned_loss=0.08026, over 13204.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2785, pruned_loss=0.08537, over 2574591.51 frames. ], batch size: 72, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:56:53,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=329527.0, ans=0.125 2024-06-21 05:57:08,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=329527.0, ans=0.0 2024-06-21 05:57:34,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=329582.0, ans=0.2 2024-06-21 05:57:42,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=329600.3333333333, ans=0.125 2024-06-21 05:57:42,454 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.10 vs. limit=15.0 2024-06-21 05:57:48,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=329600.3333333333, ans=0.125 2024-06-21 05:57:51,500 INFO [train.py:1028] (1/2) Epoch 18, batch 7800, loss[loss=0.2357, simple_loss=0.2838, pruned_loss=0.09381, over 13210.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2789, pruned_loss=0.08538, over 2579362.61 frames. ], batch size: 95, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:57:53,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=329618.6666666667, ans=0.125 2024-06-21 05:58:08,726 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 05:58:10,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=329637.0, ans=0.125 2024-06-21 05:58:11,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=329655.3333333333, ans=0.0 2024-06-21 05:58:25,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=329655.3333333333, ans=0.125 2024-06-21 05:58:29,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=329673.6666666667, ans=0.0 2024-06-21 05:58:39,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=329692.0, ans=0.0 2024-06-21 05:58:43,742 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.50 vs. limit=15.0 2024-06-21 05:58:44,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=329692.0, ans=0.0 2024-06-21 05:58:49,216 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.121e+02 2.301e+02 2.539e+02 3.490e+02, threshold=4.601e+02, percent-clipped=0.0 2024-06-21 05:58:49,257 INFO [train.py:1028] (1/2) Epoch 18, batch 7850, loss[loss=0.2143, simple_loss=0.2702, pruned_loss=0.07914, over 12180.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2794, pruned_loss=0.08568, over 2573981.51 frames. ], batch size: 18, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:59:11,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.51 vs. limit=22.5 2024-06-21 05:59:13,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=329747.0, ans=0.125 2024-06-21 05:59:30,220 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=15.0 2024-06-21 05:59:42,765 INFO [train.py:1028] (1/2) Epoch 18, batch 7900, loss[loss=0.2043, simple_loss=0.262, pruned_loss=0.07328, over 13136.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2804, pruned_loss=0.08652, over 2572579.54 frames. ], batch size: 77, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:59:45,563 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.35 vs. limit=15.0 2024-06-21 05:59:50,348 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=15.0 2024-06-21 05:59:54,863 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.24 vs. limit=22.5 2024-06-21 05:59:59,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=329820.3333333333, ans=0.1 2024-06-21 06:00:25,789 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.78 vs. limit=15.0 2024-06-21 06:00:26,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=329857.0, ans=0.0 2024-06-21 06:00:30,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=329875.3333333333, ans=0.2 2024-06-21 06:00:32,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329875.3333333333, ans=0.1 2024-06-21 06:00:34,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=329875.3333333333, ans=0.125 2024-06-21 06:00:37,215 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.111e+02 2.272e+02 2.551e+02 3.559e+02, threshold=4.544e+02, percent-clipped=0.0 2024-06-21 06:00:37,274 INFO [train.py:1028] (1/2) Epoch 18, batch 7950, loss[loss=0.2376, simple_loss=0.2781, pruned_loss=0.09862, over 10578.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2807, pruned_loss=0.08633, over 2575275.77 frames. ], batch size: 304, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 06:00:52,399 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:00:52,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=329912.0, ans=0.5 2024-06-21 06:00:52,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=329912.0, ans=0.1 2024-06-21 06:00:56,740 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=8.444e-02 2024-06-21 06:01:27,677 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.71 vs. limit=22.5 2024-06-21 06:01:31,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=329967.0, ans=0.0 2024-06-21 06:01:33,628 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.10 vs. limit=15.0 2024-06-21 06:01:34,929 INFO [train.py:1028] (1/2) Epoch 18, batch 8000, loss[loss=0.2105, simple_loss=0.2685, pruned_loss=0.07626, over 12660.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2812, pruned_loss=0.0864, over 2571986.85 frames. ], batch size: 29, lr: 3.21e-03, grad_scale: 64.0 2024-06-21 06:02:06,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=330022.0, ans=0.025 2024-06-21 06:02:09,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=330022.0, ans=0.0 2024-06-21 06:02:24,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=330058.6666666667, ans=0.125 2024-06-21 06:02:33,933 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.163e+02 2.316e+02 2.569e+02 3.179e+02, threshold=4.632e+02, percent-clipped=0.0 2024-06-21 06:02:33,971 INFO [train.py:1028] (1/2) Epoch 18, batch 8050, loss[loss=0.2192, simple_loss=0.2772, pruned_loss=0.08054, over 13215.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2805, pruned_loss=0.08583, over 2572030.00 frames. ], batch size: 83, lr: 3.21e-03, grad_scale: 64.0 2024-06-21 06:02:43,638 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.99 vs. limit=15.0 2024-06-21 06:02:48,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=330095.3333333333, ans=0.025 2024-06-21 06:02:49,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=330113.6666666667, ans=0.0 2024-06-21 06:02:51,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=330113.6666666667, ans=0.125 2024-06-21 06:02:53,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=330113.6666666667, ans=0.0 2024-06-21 06:03:27,922 INFO [train.py:1028] (1/2) Epoch 18, batch 8100, loss[loss=0.2256, simple_loss=0.278, pruned_loss=0.08656, over 13167.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.281, pruned_loss=0.08594, over 2576513.78 frames. ], batch size: 112, lr: 3.21e-03, grad_scale: 64.0 2024-06-21 06:03:33,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=330168.6666666667, ans=0.05 2024-06-21 06:03:54,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=330205.3333333333, ans=0.2 2024-06-21 06:03:58,004 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.20 vs. limit=15.0 2024-06-21 06:04:07,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=330223.6666666667, ans=0.1 2024-06-21 06:04:21,312 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.873e+02 2.104e+02 2.232e+02 2.387e+02 3.049e+02, threshold=4.464e+02, percent-clipped=0.0 2024-06-21 06:04:21,353 INFO [train.py:1028] (1/2) Epoch 18, batch 8150, loss[loss=0.2267, simple_loss=0.2772, pruned_loss=0.08808, over 13097.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2812, pruned_loss=0.08584, over 2580939.60 frames. ], batch size: 121, lr: 3.21e-03, grad_scale: 64.0 2024-06-21 06:04:26,091 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.99 vs. limit=22.5 2024-06-21 06:04:30,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=330278.6666666667, ans=0.09899494936611666 2024-06-21 06:05:10,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=330333.6666666667, ans=0.0 2024-06-21 06:05:11,457 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.56 vs. limit=15.0 2024-06-21 06:05:20,455 INFO [train.py:1028] (1/2) Epoch 18, batch 8200, loss[loss=0.2343, simple_loss=0.2867, pruned_loss=0.09091, over 13142.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2814, pruned_loss=0.08557, over 2584405.82 frames. ], batch size: 112, lr: 3.21e-03, grad_scale: 64.0 2024-06-21 06:05:21,081 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.88 vs. limit=15.0 2024-06-21 06:05:22,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=330352.0, ans=0.0 2024-06-21 06:05:28,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=330352.0, ans=0.125 2024-06-21 06:05:46,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=330388.6666666667, ans=0.2 2024-06-21 06:05:46,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=330388.6666666667, ans=0.07 2024-06-21 06:05:54,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=330407.0, ans=0.2 2024-06-21 06:06:04,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=330425.3333333333, ans=0.0 2024-06-21 06:06:12,744 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.050e+02 2.196e+02 2.395e+02 2.969e+02, threshold=4.392e+02, percent-clipped=0.0 2024-06-21 06:06:12,772 INFO [train.py:1028] (1/2) Epoch 18, batch 8250, loss[loss=0.2105, simple_loss=0.2742, pruned_loss=0.0734, over 13254.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2822, pruned_loss=0.08582, over 2584396.92 frames. ], batch size: 52, lr: 3.20e-03, grad_scale: 64.0 2024-06-21 06:06:28,029 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=330462.0, ans=0.2 2024-06-21 06:06:31,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=330480.3333333333, ans=0.04949747468305833 2024-06-21 06:06:34,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=330480.3333333333, ans=0.0 2024-06-21 06:06:51,213 INFO [train.py:1028] (1/2) Epoch 18, batch 8300, loss[loss=0.2382, simple_loss=0.2861, pruned_loss=0.09516, over 13006.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2814, pruned_loss=0.08541, over 2581843.13 frames. ], batch size: 102, lr: 3.20e-03, grad_scale: 64.0 2024-06-21 06:07:18,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=330572.0, ans=0.0 2024-06-21 06:07:21,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=330590.3333333333, ans=0.125 2024-06-21 06:07:33,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=330590.3333333333, ans=0.125 2024-06-21 06:07:45,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=330608.6666666667, ans=0.125 2024-06-21 06:07:51,735 INFO [train.py:1028] (1/2) Epoch 18, batch 8350, loss[loss=0.2289, simple_loss=0.2728, pruned_loss=0.0925, over 13192.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2819, pruned_loss=0.0855, over 2582049.52 frames. ], batch size: 112, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:07:52,653 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.154e+02 2.311e+02 2.459e+02 3.428e+02, threshold=4.622e+02, percent-clipped=0.0 2024-06-21 06:07:56,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=330627.0, ans=0.0 2024-06-21 06:08:00,091 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.86 vs. limit=22.5 2024-06-21 06:08:14,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=330663.6666666667, ans=0.2 2024-06-21 06:08:22,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=330682.0, ans=0.125 2024-06-21 06:08:28,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=330682.0, ans=0.2 2024-06-21 06:08:44,150 INFO [train.py:1028] (1/2) Epoch 18, batch 8400, loss[loss=0.1926, simple_loss=0.2557, pruned_loss=0.06477, over 13259.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2821, pruned_loss=0.0858, over 2577879.80 frames. ], batch size: 40, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:08:53,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=330737.0, ans=0.0 2024-06-21 06:08:57,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=330737.0, ans=0.2 2024-06-21 06:09:00,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=330737.0, ans=0.0 2024-06-21 06:09:12,914 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.73 vs. limit=15.0 2024-06-21 06:09:17,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=330773.6666666667, ans=0.5 2024-06-21 06:09:37,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=330792.0, ans=0.125 2024-06-21 06:09:38,676 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.17 vs. limit=15.0 2024-06-21 06:09:42,724 INFO [train.py:1028] (1/2) Epoch 18, batch 8450, loss[loss=0.2371, simple_loss=0.2899, pruned_loss=0.09211, over 13187.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2822, pruned_loss=0.08559, over 2580306.67 frames. ], batch size: 112, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:09:43,704 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.128e+02 2.308e+02 2.595e+02 3.438e+02, threshold=4.616e+02, percent-clipped=0.0 2024-06-21 06:09:45,611 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=15.0 2024-06-21 06:09:56,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=330828.6666666667, ans=0.025 2024-06-21 06:09:59,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=330828.6666666667, ans=0.1 2024-06-21 06:10:08,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=330847.0, ans=0.1 2024-06-21 06:10:10,228 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.61 vs. limit=15.0 2024-06-21 06:10:10,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=330847.0, ans=0.1 2024-06-21 06:10:16,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=330865.3333333333, ans=0.125 2024-06-21 06:10:23,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=330883.6666666667, ans=0.125 2024-06-21 06:10:31,314 INFO [train.py:1028] (1/2) Epoch 18, batch 8500, loss[loss=0.2165, simple_loss=0.2715, pruned_loss=0.08073, over 12556.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2837, pruned_loss=0.08616, over 2579297.02 frames. ], batch size: 29, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:10:43,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=330902.0, ans=0.125 2024-06-21 06:10:59,376 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=330938.6666666667, ans=0.1 2024-06-21 06:11:10,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=330957.0, ans=0.125 2024-06-21 06:11:11,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=330957.0, ans=0.0 2024-06-21 06:11:18,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2024-06-21 06:11:30,187 INFO [train.py:1028] (1/2) Epoch 18, batch 8550, loss[loss=0.2183, simple_loss=0.2807, pruned_loss=0.07797, over 12565.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2836, pruned_loss=0.08608, over 2576810.63 frames. ], batch size: 22, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:11:31,078 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.092e+02 2.212e+02 2.480e+02 4.496e+02, threshold=4.425e+02, percent-clipped=0.0 2024-06-21 06:11:31,792 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.70 vs. limit=6.0 2024-06-21 06:11:37,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=330993.6666666667, ans=0.0 2024-06-21 06:11:55,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=331030.3333333333, ans=0.125 2024-06-21 06:11:56,229 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.14 vs. limit=15.0 2024-06-21 06:12:21,918 INFO [train.py:1028] (1/2) Epoch 18, batch 8600, loss[loss=0.2333, simple_loss=0.2779, pruned_loss=0.09434, over 13077.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2837, pruned_loss=0.08613, over 2574372.53 frames. ], batch size: 121, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:12:24,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=331085.3333333333, ans=0.125 2024-06-21 06:12:58,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=331122.0, ans=0.0 2024-06-21 06:13:01,501 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:13:22,178 INFO [train.py:1028] (1/2) Epoch 18, batch 8650, loss[loss=0.2167, simple_loss=0.2659, pruned_loss=0.08374, over 12985.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2835, pruned_loss=0.08562, over 2576230.80 frames. ], batch size: 102, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:13:23,032 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.131e+02 2.273e+02 2.423e+02 2.858e+02, threshold=4.546e+02, percent-clipped=0.0 2024-06-21 06:13:41,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=331195.3333333333, ans=0.125 2024-06-21 06:13:52,675 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.32 vs. limit=15.0 2024-06-21 06:14:03,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=331232.0, ans=0.0 2024-06-21 06:14:14,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=331250.3333333333, ans=10.0 2024-06-21 06:14:18,094 INFO [train.py:1028] (1/2) Epoch 18, batch 8700, loss[loss=0.2287, simple_loss=0.2855, pruned_loss=0.08596, over 13196.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2841, pruned_loss=0.08634, over 2572331.85 frames. ], batch size: 59, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:14:24,956 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.71 vs. limit=12.0 2024-06-21 06:14:31,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=331287.0, ans=0.2 2024-06-21 06:14:39,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=331305.3333333333, ans=0.025 2024-06-21 06:14:55,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=331342.0, ans=0.5 2024-06-21 06:14:56,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=331342.0, ans=0.0 2024-06-21 06:15:07,534 INFO [train.py:1028] (1/2) Epoch 18, batch 8750, loss[loss=0.2213, simple_loss=0.2746, pruned_loss=0.084, over 13080.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2839, pruned_loss=0.08623, over 2569294.71 frames. ], batch size: 121, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:15:08,311 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.187e+02 2.311e+02 2.588e+02 3.245e+02, threshold=4.622e+02, percent-clipped=0.0 2024-06-21 06:15:14,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=331360.3333333333, ans=0.125 2024-06-21 06:15:20,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=331378.6666666667, ans=0.1 2024-06-21 06:15:23,594 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.87 vs. limit=10.0 2024-06-21 06:15:24,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=331378.6666666667, ans=0.1 2024-06-21 06:15:34,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=331397.0, ans=0.2 2024-06-21 06:16:07,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=331452.0, ans=0.05 2024-06-21 06:16:08,150 INFO [train.py:1028] (1/2) Epoch 18, batch 8800, loss[loss=0.228, simple_loss=0.2805, pruned_loss=0.0877, over 13190.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2841, pruned_loss=0.0863, over 2573740.95 frames. ], batch size: 72, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:16:14,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=331452.0, ans=0.1 2024-06-21 06:16:52,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=331525.3333333333, ans=0.2 2024-06-21 06:16:53,463 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2024-06-21 06:16:56,171 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.73 vs. limit=22.5 2024-06-21 06:17:02,595 INFO [train.py:1028] (1/2) Epoch 18, batch 8850, loss[loss=0.2415, simple_loss=0.2915, pruned_loss=0.09576, over 12563.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2847, pruned_loss=0.08695, over 2564361.87 frames. ], batch size: 202, lr: 3.20e-03, grad_scale: 16.0 2024-06-21 06:17:04,370 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.169e+02 2.295e+02 2.512e+02 3.208e+02, threshold=4.590e+02, percent-clipped=0.0 2024-06-21 06:17:13,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=331562.0, ans=0.125 2024-06-21 06:17:29,609 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.07 vs. limit=10.0 2024-06-21 06:17:56,389 INFO [train.py:1028] (1/2) Epoch 18, batch 8900, loss[loss=0.2218, simple_loss=0.2767, pruned_loss=0.08351, over 13037.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2852, pruned_loss=0.0873, over 2562300.71 frames. ], batch size: 33, lr: 3.20e-03, grad_scale: 16.0 2024-06-21 06:18:24,459 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.26 vs. limit=15.0 2024-06-21 06:18:55,805 INFO [train.py:1028] (1/2) Epoch 18, batch 8950, loss[loss=0.2496, simple_loss=0.3009, pruned_loss=0.09912, over 12484.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2855, pruned_loss=0.08721, over 2562463.67 frames. ], batch size: 202, lr: 3.20e-03, grad_scale: 16.0 2024-06-21 06:18:57,559 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.172e+02 2.329e+02 2.469e+02 3.325e+02, threshold=4.658e+02, percent-clipped=0.0 2024-06-21 06:19:03,352 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=8.56 vs. limit=12.0 2024-06-21 06:19:18,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.33 vs. limit=22.5 2024-06-21 06:19:27,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=331782.0, ans=0.0 2024-06-21 06:19:30,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.whiten.whitening_limit, batch_count=331782.0, ans=12.0 2024-06-21 06:19:47,156 INFO [train.py:1028] (1/2) Epoch 18, batch 9000, loss[loss=0.234, simple_loss=0.2955, pruned_loss=0.08632, over 13318.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.2854, pruned_loss=0.08694, over 2568629.04 frames. ], batch size: 46, lr: 3.20e-03, grad_scale: 16.0 2024-06-21 06:19:47,157 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 06:19:55,011 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.5853, 4.4948, 5.1403, 4.8413], device='cuda:1') 2024-06-21 06:19:56,434 INFO [train.py:1060] (1/2) Epoch 18, validation: loss=0.187, simple_loss=0.2519, pruned_loss=0.06106, over 351949.00 frames. 2024-06-21 06:19:56,434 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 06:19:58,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=331818.6666666667, ans=0.1 2024-06-21 06:20:02,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=331818.6666666667, ans=0.0 2024-06-21 06:20:16,778 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=331837.0, ans=0.2 2024-06-21 06:20:28,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=331855.3333333333, ans=0.125 2024-06-21 06:20:31,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.70 vs. limit=6.0 2024-06-21 06:20:36,644 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.17 vs. limit=15.0 2024-06-21 06:20:49,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=331892.0, ans=0.125 2024-06-21 06:20:55,064 INFO [train.py:1028] (1/2) Epoch 18, batch 9050, loss[loss=0.2084, simple_loss=0.2694, pruned_loss=0.07371, over 10566.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.2861, pruned_loss=0.08719, over 2566138.26 frames. ], batch size: 16, lr: 3.20e-03, grad_scale: 16.0 2024-06-21 06:20:56,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=331910.3333333333, ans=0.0 2024-06-21 06:20:57,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=331910.3333333333, ans=15.0 2024-06-21 06:20:57,383 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.794e+02 2.123e+02 2.252e+02 2.459e+02 3.045e+02, threshold=4.503e+02, percent-clipped=0.0 2024-06-21 06:21:10,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=331928.6666666667, ans=0.125 2024-06-21 06:21:27,286 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.48 vs. limit=22.5 2024-06-21 06:21:30,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=331965.3333333333, ans=0.07 2024-06-21 06:21:40,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=331983.6666666667, ans=0.2 2024-06-21 06:21:45,616 INFO [train.py:1028] (1/2) Epoch 18, batch 9100, loss[loss=0.2332, simple_loss=0.2922, pruned_loss=0.08711, over 13066.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.2856, pruned_loss=0.0867, over 2567441.10 frames. ], batch size: 71, lr: 3.20e-03, grad_scale: 16.0 2024-06-21 06:21:50,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332002.0, ans=0.1 2024-06-21 06:22:04,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=332038.6666666667, ans=0.04949747468305833 2024-06-21 06:22:11,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=332038.6666666667, ans=0.5 2024-06-21 06:22:13,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=332057.0, ans=0.0 2024-06-21 06:22:13,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=332057.0, ans=0.125 2024-06-21 06:22:13,757 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2024-06-21 06:22:22,491 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=31.35 vs. limit=22.5 2024-06-21 06:22:28,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=332075.3333333333, ans=0.0 2024-06-21 06:22:29,112 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.74 vs. limit=15.0 2024-06-21 06:22:32,019 INFO [train.py:1028] (1/2) Epoch 18, batch 9150, loss[loss=0.2156, simple_loss=0.2758, pruned_loss=0.07776, over 13152.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.2857, pruned_loss=0.08664, over 2568860.80 frames. ], batch size: 77, lr: 3.20e-03, grad_scale: 16.0 2024-06-21 06:22:34,073 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.877e+02 2.104e+02 2.273e+02 2.392e+02 2.929e+02, threshold=4.545e+02, percent-clipped=0.0 2024-06-21 06:22:44,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=332112.0, ans=0.5 2024-06-21 06:22:45,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=332112.0, ans=0.2 2024-06-21 06:22:52,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=332130.3333333333, ans=0.0 2024-06-21 06:22:57,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=332130.3333333333, ans=0.1 2024-06-21 06:23:03,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=332148.6666666667, ans=0.125 2024-06-21 06:23:08,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=332148.6666666667, ans=0.0 2024-06-21 06:23:18,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=332167.0, ans=0.125 2024-06-21 06:23:21,922 INFO [train.py:1028] (1/2) Epoch 18, batch 9200, loss[loss=0.2181, simple_loss=0.28, pruned_loss=0.07808, over 12941.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2853, pruned_loss=0.08609, over 2571993.98 frames. ], batch size: 36, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:23:35,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=332203.6666666667, ans=0.0 2024-06-21 06:23:39,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=332203.6666666667, ans=0.0 2024-06-21 06:23:39,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=332203.6666666667, ans=0.025 2024-06-21 06:23:48,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=332222.0, ans=0.0 2024-06-21 06:24:10,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=332258.6666666667, ans=0.2 2024-06-21 06:24:12,003 INFO [train.py:1028] (1/2) Epoch 18, batch 9250, loss[loss=0.2217, simple_loss=0.2767, pruned_loss=0.08339, over 13230.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.285, pruned_loss=0.08602, over 2572920.57 frames. ], batch size: 67, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:24:14,194 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.076e+02 2.234e+02 2.439e+02 3.603e+02, threshold=4.468e+02, percent-clipped=0.0 2024-06-21 06:24:28,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=332295.3333333333, ans=0.0 2024-06-21 06:24:29,394 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2024-06-21 06:24:30,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=332295.3333333333, ans=0.2 2024-06-21 06:24:43,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=332332.0, ans=0.0 2024-06-21 06:24:59,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=332368.6666666667, ans=0.125 2024-06-21 06:24:59,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332368.6666666667, ans=0.1 2024-06-21 06:24:59,887 INFO [train.py:1028] (1/2) Epoch 18, batch 9300, loss[loss=0.2139, simple_loss=0.2698, pruned_loss=0.07898, over 12989.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2856, pruned_loss=0.08621, over 2568757.40 frames. ], batch size: 39, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:25:05,887 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2024-06-21 06:25:10,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=332387.0, ans=0.015 2024-06-21 06:25:11,458 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.62 vs. limit=6.0 2024-06-21 06:25:13,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332387.0, ans=0.1 2024-06-21 06:25:28,784 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.50 vs. limit=22.5 2024-06-21 06:25:33,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=332423.6666666667, ans=0.125 2024-06-21 06:25:36,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=332423.6666666667, ans=0.125 2024-06-21 06:25:38,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=332442.0, ans=0.025 2024-06-21 06:25:40,178 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2024-06-21 06:25:44,231 INFO [train.py:1028] (1/2) Epoch 18, batch 9350, loss[loss=0.2124, simple_loss=0.2712, pruned_loss=0.07681, over 12601.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2851, pruned_loss=0.08639, over 2567770.75 frames. ], batch size: 22, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:25:45,388 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.886e+02 2.166e+02 2.296e+02 2.538e+02 3.596e+02, threshold=4.591e+02, percent-clipped=0.0 2024-06-21 06:25:53,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=332478.6666666667, ans=0.125 2024-06-21 06:25:54,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=332478.6666666667, ans=0.125 2024-06-21 06:25:54,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=332478.6666666667, ans=0.125 2024-06-21 06:25:57,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=332478.6666666667, ans=0.2 2024-06-21 06:26:00,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=332497.0, ans=0.025 2024-06-21 06:26:01,189 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.34 vs. limit=12.0 2024-06-21 06:26:06,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=332515.3333333333, ans=0.125 2024-06-21 06:26:12,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=332533.6666666667, ans=0.125 2024-06-21 06:26:13,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=332533.6666666667, ans=0.125 2024-06-21 06:26:13,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=332533.6666666667, ans=0.0 2024-06-21 06:26:14,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=332533.6666666667, ans=0.2 2024-06-21 06:26:16,639 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.79 vs. limit=22.5 2024-06-21 06:26:16,811 INFO [train.py:1028] (1/2) Epoch 18, batch 9400, loss[loss=0.216, simple_loss=0.2824, pruned_loss=0.0748, over 13259.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2852, pruned_loss=0.08654, over 2567436.65 frames. ], batch size: 52, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:26:24,303 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.99 vs. limit=15.0 2024-06-21 06:26:28,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.61 vs. limit=15.0 2024-06-21 06:26:29,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332588.6666666667, ans=0.1 2024-06-21 06:26:29,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=332588.6666666667, ans=0.2 2024-06-21 06:26:30,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=332588.6666666667, ans=0.125 2024-06-21 06:26:33,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=332588.6666666667, ans=0.125 2024-06-21 06:26:38,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=332607.0, ans=0.1 2024-06-21 06:26:43,333 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.93 vs. limit=22.5 2024-06-21 06:26:46,634 INFO [train.py:1028] (1/2) Epoch 18, batch 9450, loss[loss=0.2384, simple_loss=0.2953, pruned_loss=0.09078, over 12555.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2857, pruned_loss=0.08687, over 2567423.72 frames. ], batch size: 22, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:26:47,791 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.113e+02 2.300e+02 2.541e+02 3.122e+02, threshold=4.599e+02, percent-clipped=0.0 2024-06-21 06:26:48,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=332643.6666666667, ans=0.025 2024-06-21 06:27:03,486 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.90 vs. limit=15.0 2024-06-21 06:27:05,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=332698.6666666667, ans=0.0 2024-06-21 06:27:06,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=332698.6666666667, ans=0.125 2024-06-21 06:27:11,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=332717.0, ans=0.125 2024-06-21 06:27:16,060 INFO [train.py:1028] (1/2) Epoch 18, batch 9500, loss[loss=0.2351, simple_loss=0.2928, pruned_loss=0.08864, over 13244.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2853, pruned_loss=0.08613, over 2576684.99 frames. ], batch size: 43, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:27:19,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=332735.3333333333, ans=0.0 2024-06-21 06:27:25,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=332753.6666666667, ans=0.125 2024-06-21 06:27:32,518 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.67 vs. limit=15.0 2024-06-21 06:27:32,546 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.38 vs. limit=6.0 2024-06-21 06:27:48,515 INFO [train.py:1028] (1/2) Epoch 18, batch 9550, loss[loss=0.2048, simple_loss=0.2651, pruned_loss=0.0722, over 12962.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2853, pruned_loss=0.08621, over 2571424.16 frames. ], batch size: 39, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:27:49,895 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.114e+02 2.250e+02 2.517e+02 3.191e+02, threshold=4.501e+02, percent-clipped=0.0 2024-06-21 06:27:50,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=332827.0, ans=0.125 2024-06-21 06:27:52,398 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.63 vs. limit=15.0 2024-06-21 06:28:05,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=332863.6666666667, ans=0.0 2024-06-21 06:28:06,532 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.80 vs. limit=15.0 2024-06-21 06:28:10,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=332882.0, ans=0.125 2024-06-21 06:28:11,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=332882.0, ans=0.125 2024-06-21 06:28:12,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=332900.3333333333, ans=0.1 2024-06-21 06:28:13,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332900.3333333333, ans=0.1 2024-06-21 06:28:14,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=332900.3333333333, ans=0.125 2024-06-21 06:28:21,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=332918.6666666667, ans=0.125 2024-06-21 06:28:22,228 INFO [train.py:1028] (1/2) Epoch 18, batch 9600, loss[loss=0.2511, simple_loss=0.294, pruned_loss=0.1041, over 10350.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2846, pruned_loss=0.0857, over 2570158.45 frames. ], batch size: 303, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:28:26,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=332918.6666666667, ans=0.0 2024-06-21 06:28:30,085 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.85 vs. limit=15.0 2024-06-21 06:28:33,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=332937.0, ans=0.025 2024-06-21 06:28:38,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=332955.3333333333, ans=0.0 2024-06-21 06:28:39,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=332955.3333333333, ans=0.1 2024-06-21 06:28:53,202 INFO [train.py:1028] (1/2) Epoch 18, batch 9650, loss[loss=0.2193, simple_loss=0.2701, pruned_loss=0.08424, over 13111.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2849, pruned_loss=0.08641, over 2560012.77 frames. ], batch size: 132, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:28:54,394 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.136e+02 2.290e+02 2.620e+02 3.254e+02, threshold=4.580e+02, percent-clipped=0.0 2024-06-21 06:28:58,038 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.00 vs. limit=12.0 2024-06-21 06:29:02,508 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.437e-01 2024-06-21 06:29:12,145 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.70 vs. limit=22.5 2024-06-21 06:29:24,153 INFO [train.py:1028] (1/2) Epoch 18, batch 9700, loss[loss=0.2154, simple_loss=0.2663, pruned_loss=0.08226, over 13047.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2839, pruned_loss=0.08625, over 2554448.47 frames. ], batch size: 144, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:29:26,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=333102.0, ans=0.025 2024-06-21 06:29:29,966 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:29:31,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=333120.3333333333, ans=0.2 2024-06-21 06:29:38,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=333138.6666666667, ans=0.0 2024-06-21 06:29:43,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=333157.0, ans=0.2 2024-06-21 06:29:46,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=333157.0, ans=0.025 2024-06-21 06:29:55,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=333193.6666666667, ans=0.125 2024-06-21 06:29:55,659 INFO [train.py:1028] (1/2) Epoch 18, batch 9750, loss[loss=0.221, simple_loss=0.2695, pruned_loss=0.08622, over 13094.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.283, pruned_loss=0.08598, over 2550515.25 frames. ], batch size: 132, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:29:55,743 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:29:55,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=333193.6666666667, ans=0.125 2024-06-21 06:29:56,847 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 2.126e+02 2.250e+02 2.401e+02 3.043e+02, threshold=4.501e+02, percent-clipped=0.0 2024-06-21 06:30:07,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten.whitening_limit, batch_count=333212.0, ans=15.0 2024-06-21 06:30:25,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=333267.0, ans=0.025 2024-06-21 06:30:29,205 INFO [train.py:1028] (1/2) Epoch 18, batch 9800, loss[loss=0.2176, simple_loss=0.2738, pruned_loss=0.0807, over 12962.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2826, pruned_loss=0.08562, over 2544974.71 frames. ], batch size: 39, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:30:35,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=333303.6666666667, ans=0.025 2024-06-21 06:30:43,153 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.55 vs. limit=15.0 2024-06-21 06:30:45,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=333322.0, ans=12.0 2024-06-21 06:30:49,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=333340.3333333333, ans=0.09899494936611666 2024-06-21 06:30:56,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=333358.6666666667, ans=0.025 2024-06-21 06:30:59,711 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.81 vs. limit=6.0 2024-06-21 06:30:59,845 INFO [train.py:1028] (1/2) Epoch 18, batch 9850, loss[loss=0.2417, simple_loss=0.2899, pruned_loss=0.09677, over 13024.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2822, pruned_loss=0.08559, over 2537059.92 frames. ], batch size: 102, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:31:01,038 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.091e+02 2.238e+02 2.447e+02 3.018e+02, threshold=4.477e+02, percent-clipped=0.0 2024-06-21 06:31:15,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=333413.6666666667, ans=0.125 2024-06-21 06:31:17,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=333413.6666666667, ans=0.2 2024-06-21 06:31:30,989 INFO [train.py:1028] (1/2) Epoch 18, batch 9900, loss[loss=0.2121, simple_loss=0.2726, pruned_loss=0.07583, over 13012.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.282, pruned_loss=0.08579, over 2529371.55 frames. ], batch size: 39, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:31:35,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=333468.6666666667, ans=0.1 2024-06-21 06:31:43,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=333487.0, ans=0.125 2024-06-21 06:31:50,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=333505.3333333333, ans=0.125 2024-06-21 06:32:03,687 INFO [train.py:1028] (1/2) Epoch 18, batch 9950, loss[loss=0.2266, simple_loss=0.2831, pruned_loss=0.08505, over 12663.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2802, pruned_loss=0.08515, over 2522066.84 frames. ], batch size: 29, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:32:04,990 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.051e+02 2.279e+02 2.503e+02 4.129e+02, threshold=4.558e+02, percent-clipped=0.0 2024-06-21 06:32:15,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=333578.6666666667, ans=0.1 2024-06-21 06:32:16,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=333597.0, ans=0.2 2024-06-21 06:32:18,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=333597.0, ans=0.0 2024-06-21 06:32:18,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=333597.0, ans=0.125 2024-06-21 06:32:23,886 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:32:34,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333633.6666666667, ans=0.1 2024-06-21 06:32:36,006 INFO [train.py:1028] (1/2) Epoch 18, batch 10000, loss[loss=0.2191, simple_loss=0.2782, pruned_loss=0.08007, over 12434.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2808, pruned_loss=0.08574, over 2485270.80 frames. ], batch size: 22, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:32:46,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=333670.3333333333, ans=10.0 2024-06-21 06:33:07,002 INFO [train.py:1028] (1/2) Epoch 18, batch 10050, loss[loss=0.2302, simple_loss=0.2848, pruned_loss=0.08781, over 12501.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2812, pruned_loss=0.08646, over 2443678.89 frames. ], batch size: 22, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:33:08,661 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.138e+02 2.350e+02 2.532e+02 3.465e+02, threshold=4.701e+02, percent-clipped=0.0 2024-06-21 06:33:11,352 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.88 vs. limit=15.0 2024-06-21 06:33:19,919 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=333780.3333333333, ans=0.125 2024-06-21 06:33:25,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=333798.6666666667, ans=0.125 2024-06-21 06:33:36,456 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.35 vs. limit=15.0 2024-06-21 06:33:37,376 INFO [train.py:1028] (1/2) Epoch 18, batch 10100, loss[loss=0.2172, simple_loss=0.2768, pruned_loss=0.07886, over 12945.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2808, pruned_loss=0.08595, over 2428237.60 frames. ], batch size: 20, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:33:41,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=333835.3333333333, ans=0.025 2024-06-21 06:35:51,145 INFO [train.py:1028] (1/2) Epoch 19, batch 0, loss[loss=0.1832, simple_loss=0.2407, pruned_loss=0.06287, over 12892.00 frames. ], tot_loss[loss=0.1832, simple_loss=0.2407, pruned_loss=0.06287, over 12892.00 frames. ], batch size: 36, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:35:51,146 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 06:35:54,959 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.4.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([2.9908, 2.7257, 1.8307, 2.7477], device='cuda:1') 2024-06-21 06:35:58,385 INFO [train.py:1060] (1/2) Epoch 19, validation: loss=0.1875, simple_loss=0.2524, pruned_loss=0.06132, over 351949.00 frames. 2024-06-21 06:35:58,386 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 06:36:02,873 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.18 vs. limit=15.0 2024-06-21 06:36:07,674 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.65 vs. limit=15.0 2024-06-21 06:36:21,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=333919.6666666667, ans=0.125 2024-06-21 06:36:23,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333919.6666666667, ans=0.1 2024-06-21 06:36:23,540 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 2.030e+02 2.223e+02 2.404e+02 3.272e+02, threshold=4.447e+02, percent-clipped=0.0 2024-06-21 06:36:32,153 INFO [train.py:1028] (1/2) Epoch 19, batch 50, loss[loss=0.215, simple_loss=0.2703, pruned_loss=0.07991, over 13035.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2645, pruned_loss=0.07952, over 574346.60 frames. ], batch size: 30, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:36:36,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=333956.3333333333, ans=0.125 2024-06-21 06:36:41,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=333974.6666666667, ans=0.125 2024-06-21 06:36:51,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=334011.3333333333, ans=0.125 2024-06-21 06:36:57,617 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=7.90 vs. limit=12.0 2024-06-21 06:37:03,809 INFO [train.py:1028] (1/2) Epoch 19, batch 100, loss[loss=0.1819, simple_loss=0.2458, pruned_loss=0.05899, over 13255.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2607, pruned_loss=0.07711, over 1017933.01 frames. ], batch size: 46, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:37:07,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=334048.0, ans=0.125 2024-06-21 06:37:08,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=334048.0, ans=0.125 2024-06-21 06:37:14,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=334066.3333333333, ans=0.125 2024-06-21 06:37:14,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=334066.3333333333, ans=0.125 2024-06-21 06:37:21,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=334103.0, ans=0.025 2024-06-21 06:37:22,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=334103.0, ans=0.125 2024-06-21 06:37:33,006 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.56 vs. limit=6.0 2024-06-21 06:37:33,145 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 2.005e+02 2.167e+02 2.304e+02 2.832e+02, threshold=4.335e+02, percent-clipped=0.0 2024-06-21 06:37:41,318 INFO [train.py:1028] (1/2) Epoch 19, batch 150, loss[loss=0.1838, simple_loss=0.2413, pruned_loss=0.06317, over 12681.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2601, pruned_loss=0.07629, over 1365550.75 frames. ], batch size: 29, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:37:44,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=334139.6666666667, ans=0.2 2024-06-21 06:37:45,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=334139.6666666667, ans=0.125 2024-06-21 06:37:56,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=334176.3333333333, ans=0.1 2024-06-21 06:38:02,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=334194.6666666667, ans=0.125 2024-06-21 06:38:04,477 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.14 vs. limit=22.5 2024-06-21 06:38:06,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=334194.6666666667, ans=0.09899494936611666 2024-06-21 06:38:08,208 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.34 vs. limit=22.5 2024-06-21 06:38:11,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=334213.0, ans=0.0 2024-06-21 06:38:12,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=334213.0, ans=0.2 2024-06-21 06:38:13,813 INFO [train.py:1028] (1/2) Epoch 19, batch 200, loss[loss=0.2376, simple_loss=0.283, pruned_loss=0.09612, over 12562.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2607, pruned_loss=0.0768, over 1635443.88 frames. ], batch size: 202, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:38:15,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=334231.3333333333, ans=0.0 2024-06-21 06:38:29,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=334268.0, ans=0.1 2024-06-21 06:38:30,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=334268.0, ans=0.1 2024-06-21 06:38:31,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=334268.0, ans=0.09899494936611666 2024-06-21 06:38:34,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=334286.3333333333, ans=0.0 2024-06-21 06:38:34,918 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.18 vs. limit=15.0 2024-06-21 06:38:37,490 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 1.986e+02 2.104e+02 2.298e+02 3.037e+02, threshold=4.208e+02, percent-clipped=0.0 2024-06-21 06:38:44,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=334304.6666666667, ans=0.04949747468305833 2024-06-21 06:38:45,953 INFO [train.py:1028] (1/2) Epoch 19, batch 250, loss[loss=0.1986, simple_loss=0.2418, pruned_loss=0.07768, over 13019.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2611, pruned_loss=0.07714, over 1847861.37 frames. ], batch size: 144, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:38:46,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=334323.0, ans=0.125 2024-06-21 06:38:49,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=334323.0, ans=0.0 2024-06-21 06:38:50,965 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.55 vs. limit=15.0 2024-06-21 06:39:03,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=334359.6666666667, ans=0.0 2024-06-21 06:39:11,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=334396.3333333333, ans=0.125 2024-06-21 06:39:12,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=334396.3333333333, ans=0.0 2024-06-21 06:39:18,495 INFO [train.py:1028] (1/2) Epoch 19, batch 300, loss[loss=0.2124, simple_loss=0.2544, pruned_loss=0.0852, over 13169.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2607, pruned_loss=0.07671, over 2010942.00 frames. ], batch size: 112, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:39:25,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=334414.6666666667, ans=0.1 2024-06-21 06:39:34,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=334433.0, ans=0.0 2024-06-21 06:39:39,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=334451.3333333333, ans=0.0 2024-06-21 06:39:48,270 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 1.980e+02 2.106e+02 2.270e+02 3.333e+02, threshold=4.211e+02, percent-clipped=0.0 2024-06-21 06:39:53,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=334488.0, ans=0.0 2024-06-21 06:39:56,124 INFO [train.py:1028] (1/2) Epoch 19, batch 350, loss[loss=0.2141, simple_loss=0.2652, pruned_loss=0.08146, over 12961.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2612, pruned_loss=0.07717, over 2140200.22 frames. ], batch size: 33, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:40:02,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.85 vs. limit=22.5 2024-06-21 06:40:06,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=334524.6666666667, ans=0.1 2024-06-21 06:40:07,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=334524.6666666667, ans=0.125 2024-06-21 06:40:10,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=334543.0, ans=0.125 2024-06-21 06:40:15,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=334561.3333333333, ans=0.2 2024-06-21 06:40:16,149 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:40:25,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=334579.6666666667, ans=0.05 2024-06-21 06:40:27,307 INFO [train.py:1028] (1/2) Epoch 19, batch 400, loss[loss=0.1988, simple_loss=0.2574, pruned_loss=0.07008, over 13267.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2602, pruned_loss=0.07631, over 2240000.31 frames. ], batch size: 63, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:40:29,619 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.91 vs. limit=15.0 2024-06-21 06:40:29,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=334598.0, ans=0.125 2024-06-21 06:40:44,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=334634.6666666667, ans=0.125 2024-06-21 06:40:51,163 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 1.984e+02 2.110e+02 2.297e+02 2.944e+02, threshold=4.220e+02, percent-clipped=0.0 2024-06-21 06:40:52,425 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.48 vs. limit=6.0 2024-06-21 06:40:52,946 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.51 vs. limit=12.0 2024-06-21 06:40:53,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=334671.3333333333, ans=0.125 2024-06-21 06:40:55,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=334671.3333333333, ans=0.0 2024-06-21 06:40:59,048 INFO [train.py:1028] (1/2) Epoch 19, batch 450, loss[loss=0.2098, simple_loss=0.2648, pruned_loss=0.0774, over 13207.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.26, pruned_loss=0.07613, over 2314511.52 frames. ], batch size: 67, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:41:03,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=334689.6666666667, ans=0.125 2024-06-21 06:41:03,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=334689.6666666667, ans=0.125 2024-06-21 06:41:04,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=334689.6666666667, ans=0.125 2024-06-21 06:41:08,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=334708.0, ans=0.0 2024-06-21 06:41:09,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=334708.0, ans=0.125 2024-06-21 06:41:14,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=334726.3333333333, ans=0.0 2024-06-21 06:41:15,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=334726.3333333333, ans=0.125 2024-06-21 06:41:21,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=334744.6666666667, ans=0.1 2024-06-21 06:41:22,663 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.91 vs. limit=12.0 2024-06-21 06:41:29,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=334763.0, ans=0.125 2024-06-21 06:41:36,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=334763.0, ans=0.125 2024-06-21 06:41:38,362 INFO [train.py:1028] (1/2) Epoch 19, batch 500, loss[loss=0.2138, simple_loss=0.2589, pruned_loss=0.08435, over 13169.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2604, pruned_loss=0.07632, over 2376859.64 frames. ], batch size: 121, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:41:43,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=334781.3333333333, ans=0.125 2024-06-21 06:41:52,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=334818.0, ans=0.125 2024-06-21 06:41:57,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=334836.3333333333, ans=0.0 2024-06-21 06:41:58,833 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.93 vs. limit=15.0 2024-06-21 06:42:00,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=334836.3333333333, ans=0.2 2024-06-21 06:42:00,943 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.70 vs. limit=15.0 2024-06-21 06:42:02,768 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.18 vs. limit=12.0 2024-06-21 06:42:03,031 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.769e+02 1.947e+02 2.048e+02 2.204e+02 3.030e+02, threshold=4.095e+02, percent-clipped=0.0 2024-06-21 06:42:03,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=334854.6666666667, ans=0.125 2024-06-21 06:42:05,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=334854.6666666667, ans=0.125 2024-06-21 06:42:05,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=334854.6666666667, ans=22.5 2024-06-21 06:42:05,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=334854.6666666667, ans=0.125 2024-06-21 06:42:09,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=334854.6666666667, ans=0.1 2024-06-21 06:42:10,525 INFO [train.py:1028] (1/2) Epoch 19, batch 550, loss[loss=0.2107, simple_loss=0.2636, pruned_loss=0.07887, over 12965.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2608, pruned_loss=0.07656, over 2422342.40 frames. ], batch size: 158, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:42:12,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=334873.0, ans=0.0 2024-06-21 06:42:21,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=334891.3333333333, ans=0.0 2024-06-21 06:42:24,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=334909.6666666667, ans=0.2 2024-06-21 06:42:27,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=334909.6666666667, ans=0.07 2024-06-21 06:42:32,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=334928.0, ans=0.1 2024-06-21 06:42:34,181 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.84 vs. limit=6.0 2024-06-21 06:42:36,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=334946.3333333333, ans=0.025 2024-06-21 06:42:42,229 INFO [train.py:1028] (1/2) Epoch 19, batch 600, loss[loss=0.1832, simple_loss=0.2325, pruned_loss=0.06695, over 13016.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2613, pruned_loss=0.07638, over 2459157.91 frames. ], batch size: 144, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:42:51,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=334983.0, ans=0.0 2024-06-21 06:43:01,407 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.50 vs. limit=12.0 2024-06-21 06:43:03,920 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.78 vs. limit=15.0 2024-06-21 06:43:06,742 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 2.003e+02 2.149e+02 2.414e+02 2.992e+02, threshold=4.298e+02, percent-clipped=0.0 2024-06-21 06:43:08,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=335038.0, ans=0.1 2024-06-21 06:43:09,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=335038.0, ans=0.2 2024-06-21 06:43:14,362 INFO [train.py:1028] (1/2) Epoch 19, batch 650, loss[loss=0.1981, simple_loss=0.2585, pruned_loss=0.06885, over 13258.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2612, pruned_loss=0.07624, over 2490310.18 frames. ], batch size: 59, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:43:19,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=335056.3333333333, ans=0.0 2024-06-21 06:43:32,434 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.93 vs. limit=6.0 2024-06-21 06:43:42,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=335111.3333333333, ans=0.125 2024-06-21 06:43:43,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=335111.3333333333, ans=0.125 2024-06-21 06:43:46,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=335129.6666666667, ans=0.125 2024-06-21 06:43:47,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=335129.6666666667, ans=0.1 2024-06-21 06:43:51,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335148.0, ans=0.1 2024-06-21 06:43:52,001 INFO [train.py:1028] (1/2) Epoch 19, batch 700, loss[loss=0.2089, simple_loss=0.2631, pruned_loss=0.07736, over 13253.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.261, pruned_loss=0.07656, over 2511690.29 frames. ], batch size: 46, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:43:57,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=335148.0, ans=0.95 2024-06-21 06:44:06,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=335184.6666666667, ans=0.025 2024-06-21 06:44:10,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335203.0, ans=0.1 2024-06-21 06:44:14,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=335203.0, ans=0.0 2024-06-21 06:44:16,012 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 1.968e+02 2.123e+02 2.241e+02 3.000e+02, threshold=4.247e+02, percent-clipped=0.0 2024-06-21 06:44:20,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=335221.3333333333, ans=0.2 2024-06-21 06:44:23,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=335239.6666666667, ans=0.0 2024-06-21 06:44:23,507 INFO [train.py:1028] (1/2) Epoch 19, batch 750, loss[loss=0.2029, simple_loss=0.2656, pruned_loss=0.07006, over 13242.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2611, pruned_loss=0.07626, over 2527490.86 frames. ], batch size: 63, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:44:26,294 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.41 vs. limit=6.0 2024-06-21 06:44:41,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=335276.3333333333, ans=0.0 2024-06-21 06:44:55,691 INFO [train.py:1028] (1/2) Epoch 19, batch 800, loss[loss=0.1917, simple_loss=0.2526, pruned_loss=0.06544, over 12891.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2612, pruned_loss=0.07622, over 2540684.62 frames. ], batch size: 36, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:44:59,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=335331.3333333333, ans=0.025 2024-06-21 06:45:01,877 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.42 vs. limit=15.0 2024-06-21 06:45:02,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=335349.6666666667, ans=0.0 2024-06-21 06:45:12,499 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.56 vs. limit=15.0 2024-06-21 06:45:17,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=335386.3333333333, ans=0.0 2024-06-21 06:45:20,210 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 1.995e+02 2.111e+02 2.221e+02 4.184e+02, threshold=4.222e+02, percent-clipped=0.0 2024-06-21 06:45:28,363 INFO [train.py:1028] (1/2) Epoch 19, batch 850, loss[loss=0.1972, simple_loss=0.2536, pruned_loss=0.07039, over 13132.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2613, pruned_loss=0.07624, over 2550287.19 frames. ], batch size: 95, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:45:29,968 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.61 vs. limit=15.0 2024-06-21 06:45:30,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=335423.0, ans=0.025 2024-06-21 06:45:41,503 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.95 vs. limit=6.0 2024-06-21 06:45:45,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=335441.3333333333, ans=0.0 2024-06-21 06:45:50,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=335459.6666666667, ans=0.2 2024-06-21 06:45:50,676 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.09 vs. limit=15.0 2024-06-21 06:45:52,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=335459.6666666667, ans=0.125 2024-06-21 06:45:52,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=335459.6666666667, ans=10.0 2024-06-21 06:45:55,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=335478.0, ans=0.0 2024-06-21 06:45:59,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=335478.0, ans=0.125 2024-06-21 06:46:02,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=335496.3333333333, ans=0.125 2024-06-21 06:46:02,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=335496.3333333333, ans=0.07 2024-06-21 06:46:02,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=335496.3333333333, ans=0.125 2024-06-21 06:46:02,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=335496.3333333333, ans=0.125 2024-06-21 06:46:03,102 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.25 vs. limit=6.0 2024-06-21 06:46:08,734 INFO [train.py:1028] (1/2) Epoch 19, batch 900, loss[loss=0.2224, simple_loss=0.2804, pruned_loss=0.08214, over 12924.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2609, pruned_loss=0.07623, over 2555528.58 frames. ], batch size: 36, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:46:12,152 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2024-06-21 06:46:21,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335551.3333333333, ans=0.1 2024-06-21 06:46:27,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=335569.6666666667, ans=0.09899494936611666 2024-06-21 06:46:31,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=335569.6666666667, ans=0.125 2024-06-21 06:46:33,165 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 1.987e+02 2.123e+02 2.258e+02 3.013e+02, threshold=4.245e+02, percent-clipped=0.0 2024-06-21 06:46:34,110 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.44 vs. limit=6.0 2024-06-21 06:46:40,892 INFO [train.py:1028] (1/2) Epoch 19, batch 950, loss[loss=0.2207, simple_loss=0.2752, pruned_loss=0.08305, over 12915.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2601, pruned_loss=0.07589, over 2559418.41 frames. ], batch size: 39, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:46:49,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=335624.6666666667, ans=0.0 2024-06-21 06:46:54,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=335643.0, ans=0.025 2024-06-21 06:46:57,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=335643.0, ans=0.125 2024-06-21 06:47:04,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=335661.3333333333, ans=0.0 2024-06-21 06:47:09,846 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.84 vs. limit=15.0 2024-06-21 06:47:11,985 INFO [train.py:1028] (1/2) Epoch 19, batch 1000, loss[loss=0.2203, simple_loss=0.2767, pruned_loss=0.082, over 13261.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2594, pruned_loss=0.07591, over 2561644.40 frames. ], batch size: 49, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:47:16,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=335698.0, ans=0.07 2024-06-21 06:47:16,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=335698.0, ans=0.0 2024-06-21 06:47:38,928 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.047e+02 2.169e+02 2.393e+02 2.853e+02, threshold=4.339e+02, percent-clipped=0.0 2024-06-21 06:47:43,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335771.3333333333, ans=0.1 2024-06-21 06:47:49,338 INFO [train.py:1028] (1/2) Epoch 19, batch 1050, loss[loss=0.208, simple_loss=0.271, pruned_loss=0.07249, over 13160.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2598, pruned_loss=0.07588, over 2566190.61 frames. ], batch size: 77, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:47:52,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=335789.6666666667, ans=0.125 2024-06-21 06:47:53,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=335789.6666666667, ans=0.125 2024-06-21 06:47:54,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=335789.6666666667, ans=0.125 2024-06-21 06:47:55,858 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.16 vs. limit=12.0 2024-06-21 06:48:00,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=335808.0, ans=0.025 2024-06-21 06:48:08,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=335844.6666666667, ans=0.125 2024-06-21 06:48:22,207 INFO [train.py:1028] (1/2) Epoch 19, batch 1100, loss[loss=0.2053, simple_loss=0.2618, pruned_loss=0.07435, over 13201.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2601, pruned_loss=0.07591, over 2570600.14 frames. ], batch size: 52, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:48:27,172 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.11 vs. limit=10.0 2024-06-21 06:48:34,804 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.83 vs. limit=15.0 2024-06-21 06:48:40,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=335918.0, ans=0.0 2024-06-21 06:48:47,357 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 1.973e+02 2.124e+02 2.300e+02 3.230e+02, threshold=4.247e+02, percent-clipped=0.0 2024-06-21 06:48:49,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=335954.6666666667, ans=0.0 2024-06-21 06:48:51,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=335954.6666666667, ans=0.125 2024-06-21 06:48:53,593 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.35 vs. limit=15.0 2024-06-21 06:48:55,222 INFO [train.py:1028] (1/2) Epoch 19, batch 1150, loss[loss=0.2159, simple_loss=0.2765, pruned_loss=0.07766, over 13281.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2608, pruned_loss=0.07632, over 2571340.36 frames. ], batch size: 52, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:48:56,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=335973.0, ans=0.2 2024-06-21 06:48:57,641 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2024-06-21 06:48:58,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=335973.0, ans=0.125 2024-06-21 06:49:19,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=336028.0, ans=0.125 2024-06-21 06:49:21,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=336046.3333333333, ans=0.0 2024-06-21 06:49:29,817 INFO [train.py:1028] (1/2) Epoch 19, batch 1200, loss[loss=0.2125, simple_loss=0.2727, pruned_loss=0.07609, over 13169.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2609, pruned_loss=0.07666, over 2574271.69 frames. ], batch size: 77, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:49:52,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=336119.6666666667, ans=0.0 2024-06-21 06:49:57,060 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 1.988e+02 2.130e+02 2.298e+02 2.759e+02, threshold=4.261e+02, percent-clipped=0.0 2024-06-21 06:50:04,646 INFO [train.py:1028] (1/2) Epoch 19, batch 1250, loss[loss=0.1977, simple_loss=0.2488, pruned_loss=0.07332, over 13157.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2607, pruned_loss=0.07647, over 2583572.11 frames. ], batch size: 112, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:50:21,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=336193.0, ans=0.2 2024-06-21 06:50:27,425 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.54 vs. limit=12.0 2024-06-21 06:50:36,381 INFO [train.py:1028] (1/2) Epoch 19, batch 1300, loss[loss=0.2223, simple_loss=0.269, pruned_loss=0.0878, over 12829.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2615, pruned_loss=0.07674, over 2584463.01 frames. ], batch size: 177, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:50:37,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=336248.0, ans=0.0 2024-06-21 06:50:40,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=336248.0, ans=0.0 2024-06-21 06:50:46,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=336266.3333333333, ans=0.0 2024-06-21 06:50:52,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=336284.6666666667, ans=0.1 2024-06-21 06:50:57,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=336303.0, ans=0.0 2024-06-21 06:51:00,776 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 2.032e+02 2.117e+02 2.336e+02 3.053e+02, threshold=4.234e+02, percent-clipped=0.0 2024-06-21 06:51:04,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=336321.3333333333, ans=0.125 2024-06-21 06:51:05,826 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:51:08,755 INFO [train.py:1028] (1/2) Epoch 19, batch 1350, loss[loss=0.2026, simple_loss=0.2623, pruned_loss=0.07141, over 13169.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2618, pruned_loss=0.07684, over 2586298.93 frames. ], batch size: 59, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:51:35,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=336394.6666666667, ans=0.025 2024-06-21 06:51:46,835 INFO [train.py:1028] (1/2) Epoch 19, batch 1400, loss[loss=0.21, simple_loss=0.2653, pruned_loss=0.07739, over 12543.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2619, pruned_loss=0.07695, over 2588007.40 frames. ], batch size: 25, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:52:04,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=336468.0, ans=0.1 2024-06-21 06:52:10,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=336486.3333333333, ans=0.05 2024-06-21 06:52:11,271 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.751e+02 1.985e+02 2.096e+02 2.205e+02 3.142e+02, threshold=4.192e+02, percent-clipped=0.0 2024-06-21 06:52:12,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=336504.6666666667, ans=0.0 2024-06-21 06:52:14,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=336504.6666666667, ans=0.125 2024-06-21 06:52:16,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=336504.6666666667, ans=0.125 2024-06-21 06:52:19,095 INFO [train.py:1028] (1/2) Epoch 19, batch 1450, loss[loss=0.1998, simple_loss=0.2482, pruned_loss=0.07568, over 13115.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2616, pruned_loss=0.07689, over 2587713.81 frames. ], batch size: 121, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:52:30,097 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2024-06-21 06:52:33,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=336559.6666666667, ans=0.125 2024-06-21 06:52:39,987 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:52:42,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=336578.0, ans=0.1 2024-06-21 06:52:51,331 INFO [train.py:1028] (1/2) Epoch 19, batch 1500, loss[loss=0.2309, simple_loss=0.2792, pruned_loss=0.0913, over 13229.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2619, pruned_loss=0.07722, over 2589963.72 frames. ], batch size: 83, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:52:55,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=336614.6666666667, ans=0.2 2024-06-21 06:53:02,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=336633.0, ans=0.125 2024-06-21 06:53:11,994 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.62 vs. limit=10.0 2024-06-21 06:53:12,743 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2024-06-21 06:53:15,408 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.790e+02 2.027e+02 2.146e+02 2.360e+02 2.942e+02, threshold=4.292e+02, percent-clipped=0.0 2024-06-21 06:53:20,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=336688.0, ans=0.0 2024-06-21 06:53:26,726 INFO [train.py:1028] (1/2) Epoch 19, batch 1550, loss[loss=0.2227, simple_loss=0.273, pruned_loss=0.08617, over 13035.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2619, pruned_loss=0.07728, over 2585047.85 frames. ], batch size: 102, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:53:26,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=336706.3333333333, ans=0.0 2024-06-21 06:53:28,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=336706.3333333333, ans=0.0 2024-06-21 06:53:31,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=336706.3333333333, ans=0.2 2024-06-21 06:53:33,290 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.70 vs. limit=15.0 2024-06-21 06:53:38,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=336724.6666666667, ans=0.125 2024-06-21 06:53:42,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=336743.0, ans=0.0 2024-06-21 06:53:48,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=336743.0, ans=0.1 2024-06-21 06:53:49,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.97 vs. limit=15.0 2024-06-21 06:53:50,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=336761.3333333333, ans=0.125 2024-06-21 06:53:56,184 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2024-06-21 06:54:01,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=336779.6666666667, ans=0.0 2024-06-21 06:54:02,585 INFO [train.py:1028] (1/2) Epoch 19, batch 1600, loss[loss=0.2046, simple_loss=0.2587, pruned_loss=0.07524, over 13183.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2621, pruned_loss=0.07724, over 2580232.89 frames. ], batch size: 77, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:54:06,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=336798.0, ans=0.0 2024-06-21 06:54:10,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=336816.3333333333, ans=0.125 2024-06-21 06:54:21,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=336853.0, ans=0.125 2024-06-21 06:54:22,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=336853.0, ans=0.0 2024-06-21 06:54:26,413 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.072e+02 2.212e+02 2.413e+02 3.055e+02, threshold=4.423e+02, percent-clipped=0.0 2024-06-21 06:54:28,652 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.71 vs. limit=15.0 2024-06-21 06:54:34,159 INFO [train.py:1028] (1/2) Epoch 19, batch 1650, loss[loss=0.206, simple_loss=0.2514, pruned_loss=0.08027, over 13134.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.262, pruned_loss=0.07742, over 2575591.74 frames. ], batch size: 95, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:54:36,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=336889.6666666667, ans=0.125 2024-06-21 06:54:40,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=336908.0, ans=0.025 2024-06-21 06:54:45,631 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.64 vs. limit=5.0 2024-06-21 06:55:01,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=336963.0, ans=0.1 2024-06-21 06:55:07,560 INFO [train.py:1028] (1/2) Epoch 19, batch 1700, loss[loss=0.2101, simple_loss=0.269, pruned_loss=0.07563, over 12631.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2622, pruned_loss=0.07738, over 2580473.33 frames. ], batch size: 25, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:55:16,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=336999.6666666667, ans=0.125 2024-06-21 06:55:18,362 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2024-06-21 06:55:33,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=337036.3333333333, ans=0.125 2024-06-21 06:55:33,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337036.3333333333, ans=0.1 2024-06-21 06:55:35,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=337036.3333333333, ans=0.0 2024-06-21 06:55:36,812 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.877e+02 2.103e+02 2.260e+02 2.509e+02 3.791e+02, threshold=4.520e+02, percent-clipped=0.0 2024-06-21 06:55:42,529 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.48 vs. limit=22.5 2024-06-21 06:55:48,944 INFO [train.py:1028] (1/2) Epoch 19, batch 1750, loss[loss=0.2419, simple_loss=0.3022, pruned_loss=0.09076, over 12526.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2623, pruned_loss=0.07739, over 2581380.63 frames. ], batch size: 22, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:55:58,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=337091.3333333333, ans=0.0 2024-06-21 06:56:07,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=337109.6666666667, ans=0.025 2024-06-21 06:56:15,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=337146.3333333333, ans=0.0 2024-06-21 06:56:16,656 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2024-06-21 06:56:21,850 INFO [train.py:1028] (1/2) Epoch 19, batch 1800, loss[loss=0.2053, simple_loss=0.258, pruned_loss=0.07626, over 13192.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2624, pruned_loss=0.07734, over 2582159.55 frames. ], batch size: 67, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:56:27,746 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.18 vs. limit=12.0 2024-06-21 06:56:32,571 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2024-06-21 06:56:43,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=337219.6666666667, ans=0.1 2024-06-21 06:56:44,482 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.55 vs. limit=6.0 2024-06-21 06:56:46,643 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.001e+02 2.118e+02 2.267e+02 4.544e+02, threshold=4.236e+02, percent-clipped=1.0 2024-06-21 06:56:54,790 INFO [train.py:1028] (1/2) Epoch 19, batch 1850, loss[loss=0.2218, simple_loss=0.2754, pruned_loss=0.08407, over 13257.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2619, pruned_loss=0.0771, over 2583213.12 frames. ], batch size: 83, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:56:59,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=337256.3333333333, ans=0.04949747468305833 2024-06-21 06:57:00,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=337256.3333333333, ans=0.1 2024-06-21 06:57:16,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=337311.3333333333, ans=0.0 2024-06-21 06:57:28,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=337329.6666666667, ans=0.035 2024-06-21 06:57:35,757 INFO [train.py:1028] (1/2) Epoch 19, batch 1900, loss[loss=0.2153, simple_loss=0.2666, pruned_loss=0.08201, over 13153.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2616, pruned_loss=0.07705, over 2585726.45 frames. ], batch size: 95, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:57:45,599 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.31 vs. limit=12.0 2024-06-21 06:57:55,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=337384.6666666667, ans=0.0 2024-06-21 06:58:05,360 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 1.991e+02 2.133e+02 2.313e+02 2.789e+02, threshold=4.265e+02, percent-clipped=0.0 2024-06-21 06:58:05,581 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:58:13,039 INFO [train.py:1028] (1/2) Epoch 19, batch 1950, loss[loss=0.187, simple_loss=0.2514, pruned_loss=0.06128, over 13253.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2617, pruned_loss=0.07725, over 2591060.03 frames. ], batch size: 52, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:58:15,814 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.88 vs. limit=6.0 2024-06-21 06:58:21,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=337458.0, ans=0.0 2024-06-21 06:58:44,804 INFO [train.py:1028] (1/2) Epoch 19, batch 2000, loss[loss=0.2194, simple_loss=0.2824, pruned_loss=0.07821, over 12912.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2616, pruned_loss=0.07716, over 2587732.11 frames. ], batch size: 22, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 06:58:46,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=337531.3333333333, ans=0.1 2024-06-21 06:58:48,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=337531.3333333333, ans=0.125 2024-06-21 06:58:57,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=337568.0, ans=0.125 2024-06-21 06:59:05,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=337586.3333333333, ans=0.125 2024-06-21 06:59:09,744 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.045e+02 2.213e+02 2.420e+02 3.112e+02, threshold=4.427e+02, percent-clipped=0.0 2024-06-21 06:59:17,331 INFO [train.py:1028] (1/2) Epoch 19, batch 2050, loss[loss=0.2184, simple_loss=0.2711, pruned_loss=0.08285, over 12519.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2613, pruned_loss=0.07718, over 2582307.04 frames. ], batch size: 29, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 06:59:27,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=337641.3333333333, ans=0.125 2024-06-21 06:59:34,829 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.66 vs. limit=22.5 2024-06-21 06:59:45,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337678.0, ans=0.1 2024-06-21 06:59:48,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=337696.3333333333, ans=0.2 2024-06-21 06:59:53,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=337696.3333333333, ans=0.125 2024-06-21 06:59:55,537 INFO [train.py:1028] (1/2) Epoch 19, batch 2100, loss[loss=0.1965, simple_loss=0.2558, pruned_loss=0.06859, over 13203.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2613, pruned_loss=0.07697, over 2585109.92 frames. ], batch size: 59, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 07:00:02,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=337733.0, ans=0.0 2024-06-21 07:00:03,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=337733.0, ans=0.0 2024-06-21 07:00:04,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=337733.0, ans=0.0 2024-06-21 07:00:05,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=337733.0, ans=0.125 2024-06-21 07:00:18,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=337769.6666666667, ans=0.04949747468305833 2024-06-21 07:00:18,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=337769.6666666667, ans=0.125 2024-06-21 07:00:20,649 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 2.004e+02 2.106e+02 2.256e+02 2.796e+02, threshold=4.212e+02, percent-clipped=0.0 2024-06-21 07:00:20,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=337769.6666666667, ans=0.125 2024-06-21 07:00:21,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=337788.0, ans=0.0 2024-06-21 07:00:23,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=337788.0, ans=0.125 2024-06-21 07:00:29,125 INFO [train.py:1028] (1/2) Epoch 19, batch 2150, loss[loss=0.1899, simple_loss=0.2536, pruned_loss=0.06309, over 13292.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2616, pruned_loss=0.0767, over 2587937.12 frames. ], batch size: 52, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 07:00:30,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=337806.3333333333, ans=0.125 2024-06-21 07:00:36,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=337824.6666666667, ans=0.125 2024-06-21 07:00:45,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=337843.0, ans=0.025 2024-06-21 07:00:49,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337861.3333333333, ans=0.1 2024-06-21 07:01:03,322 INFO [train.py:1028] (1/2) Epoch 19, batch 2200, loss[loss=0.2032, simple_loss=0.2503, pruned_loss=0.07805, over 13188.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2613, pruned_loss=0.07639, over 2588540.63 frames. ], batch size: 83, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 07:01:06,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=337898.0, ans=0.0 2024-06-21 07:01:25,687 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.76 vs. limit=8.0 2024-06-21 07:01:28,398 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 1.962e+02 2.150e+02 2.342e+02 4.060e+02, threshold=4.301e+02, percent-clipped=0.0 2024-06-21 07:01:31,478 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:01:36,718 INFO [train.py:1028] (1/2) Epoch 19, batch 2250, loss[loss=0.202, simple_loss=0.2584, pruned_loss=0.07278, over 13261.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2614, pruned_loss=0.07683, over 2588670.93 frames. ], batch size: 63, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 07:01:39,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=337989.6666666667, ans=0.0 2024-06-21 07:01:43,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=337989.6666666667, ans=0.2 2024-06-21 07:01:43,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=337989.6666666667, ans=0.0 2024-06-21 07:02:03,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=338044.6666666667, ans=0.09899494936611666 2024-06-21 07:02:07,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=338063.0, ans=0.1 2024-06-21 07:02:11,180 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.63 vs. limit=15.0 2024-06-21 07:02:14,778 INFO [train.py:1028] (1/2) Epoch 19, batch 2300, loss[loss=0.1712, simple_loss=0.2311, pruned_loss=0.05561, over 12877.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.262, pruned_loss=0.07723, over 2582576.95 frames. ], batch size: 33, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 07:02:25,577 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.13 vs. limit=15.0 2024-06-21 07:02:34,773 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.88 vs. limit=15.0 2024-06-21 07:02:39,596 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.032e+02 2.142e+02 2.325e+02 3.906e+02, threshold=4.284e+02, percent-clipped=0.0 2024-06-21 07:02:41,260 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.58 vs. limit=6.0 2024-06-21 07:02:47,380 INFO [train.py:1028] (1/2) Epoch 19, batch 2350, loss[loss=0.2131, simple_loss=0.2676, pruned_loss=0.07936, over 13194.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2616, pruned_loss=0.07717, over 2585173.93 frames. ], batch size: 67, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 07:02:47,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=338173.0, ans=0.0 2024-06-21 07:02:51,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=338173.0, ans=0.2 2024-06-21 07:02:54,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=338191.3333333333, ans=0.1 2024-06-21 07:03:02,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=338209.6666666667, ans=0.125 2024-06-21 07:03:07,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=338228.0, ans=0.1 2024-06-21 07:03:09,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.42 vs. limit=6.0 2024-06-21 07:03:14,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=338246.3333333333, ans=0.125 2024-06-21 07:03:16,696 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2024-06-21 07:03:20,402 INFO [train.py:1028] (1/2) Epoch 19, batch 2400, loss[loss=0.2017, simple_loss=0.2579, pruned_loss=0.0728, over 13302.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.261, pruned_loss=0.07698, over 2588386.70 frames. ], batch size: 46, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:03:20,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=338264.6666666667, ans=0.2 2024-06-21 07:03:21,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=338264.6666666667, ans=0.125 2024-06-21 07:03:23,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=338264.6666666667, ans=0.125 2024-06-21 07:03:24,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=338264.6666666667, ans=0.1 2024-06-21 07:03:27,831 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2024-06-21 07:03:28,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=338283.0, ans=0.0 2024-06-21 07:03:42,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=338319.6666666667, ans=0.125 2024-06-21 07:03:50,421 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 1.989e+02 2.102e+02 2.305e+02 2.886e+02, threshold=4.204e+02, percent-clipped=0.0 2024-06-21 07:03:52,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=338338.0, ans=0.025 2024-06-21 07:03:54,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=338338.0, ans=0.0 2024-06-21 07:03:58,147 INFO [train.py:1028] (1/2) Epoch 19, batch 2450, loss[loss=0.2056, simple_loss=0.2646, pruned_loss=0.07325, over 13288.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2607, pruned_loss=0.07738, over 2585482.06 frames. ], batch size: 63, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:04:14,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=338393.0, ans=0.0 2024-06-21 07:04:21,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=338411.3333333333, ans=0.125 2024-06-21 07:04:30,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=338448.0, ans=0.125 2024-06-21 07:04:31,143 INFO [train.py:1028] (1/2) Epoch 19, batch 2500, loss[loss=0.2031, simple_loss=0.2556, pruned_loss=0.07529, over 13252.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2602, pruned_loss=0.07714, over 2586972.60 frames. ], batch size: 83, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:04:40,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=338466.3333333333, ans=0.0 2024-06-21 07:04:56,442 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 1.988e+02 2.092e+02 2.207e+02 4.147e+02, threshold=4.184e+02, percent-clipped=0.0 2024-06-21 07:04:58,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=338521.3333333333, ans=0.0 2024-06-21 07:05:00,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=338521.3333333333, ans=0.125 2024-06-21 07:05:04,350 INFO [train.py:1028] (1/2) Epoch 19, batch 2550, loss[loss=0.2271, simple_loss=0.2865, pruned_loss=0.08382, over 12434.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2595, pruned_loss=0.07678, over 2587733.56 frames. ], batch size: 22, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:05:22,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=338576.3333333333, ans=0.1 2024-06-21 07:05:42,690 INFO [train.py:1028] (1/2) Epoch 19, batch 2600, loss[loss=0.2049, simple_loss=0.262, pruned_loss=0.07391, over 13265.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2581, pruned_loss=0.07629, over 2586970.22 frames. ], batch size: 52, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:05:44,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=338631.3333333333, ans=0.0 2024-06-21 07:05:52,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=338649.6666666667, ans=0.0 2024-06-21 07:06:04,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=338686.3333333333, ans=0.1 2024-06-21 07:06:07,360 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 2.030e+02 2.187e+02 2.443e+02 3.083e+02, threshold=4.375e+02, percent-clipped=0.0 2024-06-21 07:06:12,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=338704.6666666667, ans=0.125 2024-06-21 07:06:12,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=338704.6666666667, ans=0.125 2024-06-21 07:06:14,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=338704.6666666667, ans=0.125 2024-06-21 07:06:15,356 INFO [train.py:1028] (1/2) Epoch 19, batch 2650, loss[loss=0.1815, simple_loss=0.2291, pruned_loss=0.06698, over 13024.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2564, pruned_loss=0.07582, over 2585863.55 frames. ], batch size: 144, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:06:24,004 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.10 vs. limit=22.5 2024-06-21 07:06:24,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=338741.3333333333, ans=0.125 2024-06-21 07:06:28,995 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.28 vs. limit=15.0 2024-06-21 07:06:46,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=338796.3333333333, ans=0.125 2024-06-21 07:06:48,196 INFO [train.py:1028] (1/2) Epoch 19, batch 2700, loss[loss=0.1977, simple_loss=0.2482, pruned_loss=0.07362, over 13232.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2551, pruned_loss=0.07562, over 2583441.66 frames. ], batch size: 89, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:06:48,246 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:06:51,023 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.34 vs. limit=10.0 2024-06-21 07:06:54,591 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.76 vs. limit=15.0 2024-06-21 07:07:08,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=338869.6666666667, ans=0.2 2024-06-21 07:07:13,117 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.746e+02 1.953e+02 2.102e+02 2.262e+02 2.942e+02, threshold=4.204e+02, percent-clipped=0.0 2024-06-21 07:07:24,385 INFO [train.py:1028] (1/2) Epoch 19, batch 2750, loss[loss=0.1894, simple_loss=0.2461, pruned_loss=0.06636, over 13330.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2542, pruned_loss=0.07494, over 2580186.13 frames. ], batch size: 43, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:07:27,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=338906.3333333333, ans=0.0 2024-06-21 07:07:32,159 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.74 vs. limit=6.0 2024-06-21 07:07:32,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=338924.6666666667, ans=0.2 2024-06-21 07:07:46,815 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.93 vs. limit=15.0 2024-06-21 07:07:53,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=338961.3333333333, ans=0.1 2024-06-21 07:07:57,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=338979.6666666667, ans=0.125 2024-06-21 07:08:01,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=338998.0, ans=0.2 2024-06-21 07:08:01,912 INFO [train.py:1028] (1/2) Epoch 19, batch 2800, loss[loss=0.2112, simple_loss=0.2562, pruned_loss=0.08313, over 10764.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2537, pruned_loss=0.07479, over 2578122.45 frames. ], batch size: 303, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:08:16,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=339034.6666666667, ans=15.0 2024-06-21 07:08:19,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=339034.6666666667, ans=0.0 2024-06-21 07:08:23,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=339053.0, ans=0.0 2024-06-21 07:08:26,907 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.751e+02 1.988e+02 2.133e+02 2.273e+02 2.773e+02, threshold=4.266e+02, percent-clipped=0.0 2024-06-21 07:08:34,984 INFO [train.py:1028] (1/2) Epoch 19, batch 2850, loss[loss=0.217, simple_loss=0.2649, pruned_loss=0.08459, over 13013.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2524, pruned_loss=0.07425, over 2576119.50 frames. ], batch size: 48, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:08:39,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=339089.6666666667, ans=0.1 2024-06-21 07:08:40,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=339108.0, ans=0.95 2024-06-21 07:08:58,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=339144.6666666667, ans=0.0 2024-06-21 07:08:59,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=339144.6666666667, ans=0.125 2024-06-21 07:09:06,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=339163.0, ans=0.125 2024-06-21 07:09:07,431 INFO [train.py:1028] (1/2) Epoch 19, batch 2900, loss[loss=0.1859, simple_loss=0.2375, pruned_loss=0.06715, over 13150.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2506, pruned_loss=0.07378, over 2584097.89 frames. ], batch size: 55, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:09:15,317 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=28.30 vs. limit=22.5 2024-06-21 07:09:34,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=339218.0, ans=0.0 2024-06-21 07:09:40,591 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.782e+02 1.923e+02 2.018e+02 2.205e+02 2.853e+02, threshold=4.037e+02, percent-clipped=0.0 2024-06-21 07:09:44,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=339254.6666666667, ans=0.5 2024-06-21 07:09:48,719 INFO [train.py:1028] (1/2) Epoch 19, batch 2950, loss[loss=0.1801, simple_loss=0.2335, pruned_loss=0.06339, over 13267.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2508, pruned_loss=0.07394, over 2579168.64 frames. ], batch size: 43, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:09:49,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=339273.0, ans=0.125 2024-06-21 07:09:50,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=339273.0, ans=0.1 2024-06-21 07:09:54,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=339273.0, ans=0.05 2024-06-21 07:09:56,333 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.60 vs. limit=12.0 2024-06-21 07:10:07,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=339309.6666666667, ans=0.2 2024-06-21 07:10:07,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=339309.6666666667, ans=0.1 2024-06-21 07:10:10,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=339328.0, ans=0.125 2024-06-21 07:10:11,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=339328.0, ans=0.0 2024-06-21 07:10:22,641 INFO [train.py:1028] (1/2) Epoch 19, batch 3000, loss[loss=0.192, simple_loss=0.2531, pruned_loss=0.06542, over 13196.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2496, pruned_loss=0.07312, over 2578649.21 frames. ], batch size: 59, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:10:22,642 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 07:10:27,178 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.3.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([2.9033, 2.5919, 2.6082, 2.2818, 2.4815, 2.5042, 2.6343, 2.1280], device='cuda:1') 2024-06-21 07:10:30,410 INFO [train.py:1060] (1/2) Epoch 19, validation: loss=0.186, simple_loss=0.2507, pruned_loss=0.06065, over 351949.00 frames. 2024-06-21 07:10:30,411 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 07:10:31,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=339364.6666666667, ans=0.125 2024-06-21 07:10:32,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=339364.6666666667, ans=0.125 2024-06-21 07:10:36,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=339383.0, ans=0.125 2024-06-21 07:10:37,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=339383.0, ans=0.125 2024-06-21 07:10:42,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=339401.3333333333, ans=0.0 2024-06-21 07:10:45,196 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2024-06-21 07:10:45,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=339401.3333333333, ans=0.125 2024-06-21 07:10:45,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=339401.3333333333, ans=0.0 2024-06-21 07:10:47,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=339401.3333333333, ans=0.5 2024-06-21 07:10:48,154 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:10:55,043 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.023e+02 2.182e+02 2.347e+02 3.005e+02, threshold=4.364e+02, percent-clipped=0.0 2024-06-21 07:11:02,757 INFO [train.py:1028] (1/2) Epoch 19, batch 3050, loss[loss=0.1892, simple_loss=0.2466, pruned_loss=0.06592, over 13284.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2496, pruned_loss=0.07334, over 2578477.94 frames. ], batch size: 46, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:11:10,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=339456.3333333333, ans=0.125 2024-06-21 07:11:11,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=339474.6666666667, ans=0.2 2024-06-21 07:11:15,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=339474.6666666667, ans=0.2 2024-06-21 07:11:24,991 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.43 vs. limit=15.0 2024-06-21 07:11:27,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=339511.3333333333, ans=0.0 2024-06-21 07:11:41,456 INFO [train.py:1028] (1/2) Epoch 19, batch 3100, loss[loss=0.1882, simple_loss=0.2375, pruned_loss=0.06948, over 13033.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2488, pruned_loss=0.07295, over 2579016.13 frames. ], batch size: 144, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:11:42,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=339548.0, ans=0.125 2024-06-21 07:11:50,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=339566.3333333333, ans=0.2 2024-06-21 07:12:01,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=339603.0, ans=0.0 2024-06-21 07:12:06,992 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 1.957e+02 2.090e+02 2.363e+02 2.907e+02, threshold=4.180e+02, percent-clipped=0.0 2024-06-21 07:12:10,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=339621.3333333333, ans=0.125 2024-06-21 07:12:14,363 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.33 vs. limit=15.0 2024-06-21 07:12:14,583 INFO [train.py:1028] (1/2) Epoch 19, batch 3150, loss[loss=0.1827, simple_loss=0.2302, pruned_loss=0.0676, over 12941.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2478, pruned_loss=0.07254, over 2582015.18 frames. ], batch size: 158, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:12:24,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=339658.0, ans=0.125 2024-06-21 07:12:25,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=339658.0, ans=0.09899494936611666 2024-06-21 07:12:27,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=339676.3333333333, ans=0.125 2024-06-21 07:12:29,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=339676.3333333333, ans=0.125 2024-06-21 07:12:33,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=339676.3333333333, ans=0.0 2024-06-21 07:12:34,161 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.26 vs. limit=6.0 2024-06-21 07:12:35,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=339694.6666666667, ans=0.2 2024-06-21 07:12:46,972 INFO [train.py:1028] (1/2) Epoch 19, batch 3200, loss[loss=0.1804, simple_loss=0.238, pruned_loss=0.0614, over 13128.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.247, pruned_loss=0.07215, over 2581702.24 frames. ], batch size: 55, lr: 3.07e-03, grad_scale: 64.0 2024-06-21 07:12:53,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=339749.6666666667, ans=0.1 2024-06-21 07:12:54,745 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=15.0 2024-06-21 07:13:02,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=339768.0, ans=0.125 2024-06-21 07:13:04,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=339768.0, ans=0.04949747468305833 2024-06-21 07:13:11,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=339786.3333333333, ans=0.0 2024-06-21 07:13:13,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=339786.3333333333, ans=0.0 2024-06-21 07:13:14,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=339786.3333333333, ans=0.125 2024-06-21 07:13:14,704 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.946e+02 2.055e+02 2.198e+02 2.815e+02, threshold=4.109e+02, percent-clipped=0.0 2024-06-21 07:13:22,360 INFO [train.py:1028] (1/2) Epoch 19, batch 3250, loss[loss=0.1893, simple_loss=0.2434, pruned_loss=0.06761, over 13227.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2461, pruned_loss=0.072, over 2584877.07 frames. ], batch size: 72, lr: 3.07e-03, grad_scale: 64.0 2024-06-21 07:13:24,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=339823.0, ans=0.125 2024-06-21 07:13:38,372 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.39 vs. limit=12.0 2024-06-21 07:13:50,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=339878.0, ans=0.2 2024-06-21 07:13:56,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=339896.3333333333, ans=0.2 2024-06-21 07:13:59,205 INFO [train.py:1028] (1/2) Epoch 19, batch 3300, loss[loss=0.1951, simple_loss=0.2433, pruned_loss=0.07348, over 12745.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.246, pruned_loss=0.07183, over 2581026.28 frames. ], batch size: 176, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:14:13,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=339951.3333333333, ans=0.0 2024-06-21 07:14:14,783 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.83 vs. limit=15.0 2024-06-21 07:14:24,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=339969.6666666667, ans=0.1 2024-06-21 07:14:24,822 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 2.009e+02 2.139e+02 2.349e+02 3.100e+02, threshold=4.277e+02, percent-clipped=0.0 2024-06-21 07:14:25,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=339988.0, ans=0.0 2024-06-21 07:14:27,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=339988.0, ans=0.0 2024-06-21 07:14:32,137 INFO [train.py:1028] (1/2) Epoch 19, batch 3350, loss[loss=0.1874, simple_loss=0.2376, pruned_loss=0.06865, over 12933.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2457, pruned_loss=0.07214, over 2576467.09 frames. ], batch size: 158, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:14:40,286 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.48 vs. limit=15.0 2024-06-21 07:14:41,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.76 vs. limit=15.0 2024-06-21 07:14:47,457 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.62 vs. limit=12.0 2024-06-21 07:14:50,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=340043.0, ans=0.125 2024-06-21 07:14:58,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=340061.3333333333, ans=0.125 2024-06-21 07:15:05,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=340079.6666666667, ans=0.1 2024-06-21 07:15:08,165 INFO [train.py:1028] (1/2) Epoch 19, batch 3400, loss[loss=0.2049, simple_loss=0.256, pruned_loss=0.07685, over 12541.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2453, pruned_loss=0.07226, over 2574863.26 frames. ], batch size: 22, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:15:13,028 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.05 vs. limit=15.0 2024-06-21 07:15:15,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=340116.3333333333, ans=0.0 2024-06-21 07:15:32,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=340153.0, ans=0.125 2024-06-21 07:15:36,743 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 1.945e+02 2.046e+02 2.317e+02 2.778e+02, threshold=4.091e+02, percent-clipped=0.0 2024-06-21 07:15:41,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=340171.3333333333, ans=0.2 2024-06-21 07:15:44,117 INFO [train.py:1028] (1/2) Epoch 19, batch 3450, loss[loss=0.2013, simple_loss=0.2484, pruned_loss=0.07711, over 12694.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2448, pruned_loss=0.07207, over 2576250.54 frames. ], batch size: 176, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:15:48,834 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.93 vs. limit=15.0 2024-06-21 07:15:50,068 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2024-06-21 07:15:52,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=340208.0, ans=0.09899494936611666 2024-06-21 07:15:54,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=340208.0, ans=0.0 2024-06-21 07:16:12,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=340263.0, ans=0.0 2024-06-21 07:16:16,568 INFO [train.py:1028] (1/2) Epoch 19, batch 3500, loss[loss=0.2103, simple_loss=0.2605, pruned_loss=0.08008, over 12990.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2448, pruned_loss=0.07195, over 2575760.66 frames. ], batch size: 33, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:16:42,578 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 1.899e+02 2.001e+02 2.180e+02 3.338e+02, threshold=4.001e+02, percent-clipped=0.0 2024-06-21 07:16:54,646 INFO [train.py:1028] (1/2) Epoch 19, batch 3550, loss[loss=0.1876, simple_loss=0.2382, pruned_loss=0.06856, over 13182.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2445, pruned_loss=0.07175, over 2577733.92 frames. ], batch size: 95, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:16:59,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=340373.0, ans=0.0 2024-06-21 07:16:59,698 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2024-06-21 07:17:03,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340391.3333333333, ans=0.1 2024-06-21 07:17:07,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=340409.6666666667, ans=0.125 2024-06-21 07:17:11,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=340409.6666666667, ans=0.1 2024-06-21 07:17:30,342 INFO [train.py:1028] (1/2) Epoch 19, batch 3600, loss[loss=0.1857, simple_loss=0.2439, pruned_loss=0.06374, over 13310.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2444, pruned_loss=0.07199, over 2581033.52 frames. ], batch size: 49, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:17:46,062 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.24 vs. limit=15.0 2024-06-21 07:17:48,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=340501.3333333333, ans=15.0 2024-06-21 07:17:56,005 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 1.999e+02 2.212e+02 2.432e+02 3.452e+02, threshold=4.424e+02, percent-clipped=0.0 2024-06-21 07:17:56,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=340538.0, ans=0.2 2024-06-21 07:17:58,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=340538.0, ans=0.125 2024-06-21 07:18:02,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=340556.3333333333, ans=0.0 2024-06-21 07:18:03,162 INFO [train.py:1028] (1/2) Epoch 19, batch 3650, loss[loss=0.2028, simple_loss=0.2436, pruned_loss=0.081, over 13012.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2445, pruned_loss=0.07216, over 2579278.48 frames. ], batch size: 102, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:18:16,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=340593.0, ans=0.125 2024-06-21 07:18:18,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=340593.0, ans=0.5 2024-06-21 07:18:22,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=340611.3333333333, ans=0.125 2024-06-21 07:18:29,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=340629.6666666667, ans=0.2 2024-06-21 07:18:36,076 INFO [train.py:1028] (1/2) Epoch 19, batch 3700, loss[loss=0.1842, simple_loss=0.2356, pruned_loss=0.06635, over 13234.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2432, pruned_loss=0.07171, over 2585044.79 frames. ], batch size: 72, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:18:43,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=340666.3333333333, ans=0.125 2024-06-21 07:18:55,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=340703.0, ans=0.125 2024-06-21 07:19:05,075 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 1.972e+02 2.096e+02 2.246e+02 2.799e+02, threshold=4.192e+02, percent-clipped=0.0 2024-06-21 07:19:12,301 INFO [train.py:1028] (1/2) Epoch 19, batch 3750, loss[loss=0.2222, simple_loss=0.273, pruned_loss=0.08573, over 12809.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2434, pruned_loss=0.07168, over 2587813.10 frames. ], batch size: 22, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:19:18,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=340758.0, ans=0.0 2024-06-21 07:19:46,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=340813.0, ans=0.1 2024-06-21 07:19:47,504 INFO [train.py:1028] (1/2) Epoch 19, batch 3800, loss[loss=0.1788, simple_loss=0.2302, pruned_loss=0.06366, over 13194.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2431, pruned_loss=0.0714, over 2585389.57 frames. ], batch size: 83, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:19:47,692 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:19:52,444 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2024-06-21 07:19:59,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340849.6666666667, ans=0.1 2024-06-21 07:20:05,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=340868.0, ans=0.2 2024-06-21 07:20:12,560 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.38 vs. limit=12.0 2024-06-21 07:20:12,714 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 1.882e+02 2.058e+02 2.202e+02 2.940e+02, threshold=4.116e+02, percent-clipped=0.0 2024-06-21 07:20:12,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=340904.6666666667, ans=0.125 2024-06-21 07:20:16,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=340904.6666666667, ans=0.0 2024-06-21 07:20:20,148 INFO [train.py:1028] (1/2) Epoch 19, batch 3850, loss[loss=0.1938, simple_loss=0.2386, pruned_loss=0.07455, over 13065.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2423, pruned_loss=0.07091, over 2585038.10 frames. ], batch size: 144, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:20:34,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=340959.6666666667, ans=0.125 2024-06-21 07:20:34,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=340959.6666666667, ans=0.0 2024-06-21 07:20:41,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=340978.0, ans=0.125 2024-06-21 07:20:43,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=340978.0, ans=0.125 2024-06-21 07:20:43,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=340978.0, ans=0.0 2024-06-21 07:20:44,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=340978.0, ans=0.125 2024-06-21 07:20:52,658 INFO [train.py:1028] (1/2) Epoch 19, batch 3900, loss[loss=0.2025, simple_loss=0.2503, pruned_loss=0.07733, over 13242.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2422, pruned_loss=0.07096, over 2587146.58 frames. ], batch size: 83, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:20:56,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=341014.6666666667, ans=0.0 2024-06-21 07:21:21,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=341069.6666666667, ans=0.5 2024-06-21 07:21:21,497 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 1.938e+02 2.061e+02 2.241e+02 2.813e+02, threshold=4.122e+02, percent-clipped=0.0 2024-06-21 07:21:28,721 INFO [train.py:1028] (1/2) Epoch 19, batch 3950, loss[loss=0.1563, simple_loss=0.2041, pruned_loss=0.05424, over 13072.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2409, pruned_loss=0.07031, over 2589557.90 frames. ], batch size: 132, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:21:42,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=341124.6666666667, ans=0.125 2024-06-21 07:21:42,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=341124.6666666667, ans=0.0 2024-06-21 07:21:46,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=341143.0, ans=0.125 2024-06-21 07:21:49,776 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.60 vs. limit=22.5 2024-06-21 07:21:51,201 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.33 vs. limit=22.5 2024-06-21 07:21:54,335 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=5.544e-01 2024-06-21 07:22:04,975 INFO [train.py:1028] (1/2) Epoch 19, batch 4000, loss[loss=0.1992, simple_loss=0.2555, pruned_loss=0.07145, over 12986.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.2406, pruned_loss=0.07024, over 2583883.74 frames. ], batch size: 39, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:22:20,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=341234.6666666667, ans=0.125 2024-06-21 07:22:21,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=341234.6666666667, ans=0.0 2024-06-21 07:22:29,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=341253.0, ans=0.1 2024-06-21 07:22:30,243 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 1.947e+02 2.062e+02 2.205e+02 3.340e+02, threshold=4.124e+02, percent-clipped=0.0 2024-06-21 07:22:31,477 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.16 vs. limit=12.0 2024-06-21 07:22:37,709 INFO [train.py:1028] (1/2) Epoch 19, batch 4050, loss[loss=0.1939, simple_loss=0.2349, pruned_loss=0.07642, over 10866.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.2407, pruned_loss=0.0703, over 2581111.99 frames. ], batch size: 303, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:22:38,096 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.22 vs. limit=15.0 2024-06-21 07:22:38,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=341289.6666666667, ans=0.07 2024-06-21 07:22:45,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=341308.0, ans=0.025 2024-06-21 07:22:48,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=341308.0, ans=0.125 2024-06-21 07:23:07,918 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2024-06-21 07:23:08,746 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.25 vs. limit=15.0 2024-06-21 07:23:13,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=341381.3333333333, ans=0.125 2024-06-21 07:23:13,789 INFO [train.py:1028] (1/2) Epoch 19, batch 4100, loss[loss=0.1958, simple_loss=0.2413, pruned_loss=0.07521, over 13149.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.2405, pruned_loss=0.07033, over 2578476.61 frames. ], batch size: 103, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:23:17,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=341381.3333333333, ans=0.125 2024-06-21 07:23:18,166 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.57 vs. limit=15.0 2024-06-21 07:23:18,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=341381.3333333333, ans=0.0 2024-06-21 07:23:29,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=341418.0, ans=0.07 2024-06-21 07:23:31,146 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.66 vs. limit=15.0 2024-06-21 07:23:40,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=341436.3333333333, ans=0.04949747468305833 2024-06-21 07:23:43,225 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.764e+02 1.993e+02 2.146e+02 2.420e+02 3.460e+02, threshold=4.292e+02, percent-clipped=0.0 2024-06-21 07:23:50,662 INFO [train.py:1028] (1/2) Epoch 19, batch 4150, loss[loss=0.1945, simple_loss=0.2394, pruned_loss=0.07484, over 13165.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2405, pruned_loss=0.07029, over 2577097.61 frames. ], batch size: 55, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:24:03,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=341509.6666666667, ans=0.0 2024-06-21 07:24:06,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=341509.6666666667, ans=0.125 2024-06-21 07:24:08,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=341509.6666666667, ans=0.035 2024-06-21 07:24:11,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=341528.0, ans=0.1 2024-06-21 07:24:23,575 INFO [train.py:1028] (1/2) Epoch 19, batch 4200, loss[loss=0.1902, simple_loss=0.2297, pruned_loss=0.07534, over 12998.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2394, pruned_loss=0.06983, over 2579772.60 frames. ], batch size: 102, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:24:41,400 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.48 vs. limit=22.5 2024-06-21 07:24:49,099 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 1.889e+02 2.049e+02 2.187e+02 2.672e+02, threshold=4.099e+02, percent-clipped=0.0 2024-06-21 07:24:53,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=341638.0, ans=0.07 2024-06-21 07:24:53,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=341638.0, ans=0.125 2024-06-21 07:24:53,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=341638.0, ans=0.125 2024-06-21 07:24:55,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=341656.3333333333, ans=0.025 2024-06-21 07:24:56,438 INFO [train.py:1028] (1/2) Epoch 19, batch 4250, loss[loss=0.1921, simple_loss=0.2483, pruned_loss=0.06798, over 13286.00 frames. ], tot_loss[loss=0.189, simple_loss=0.239, pruned_loss=0.06945, over 2582300.84 frames. ], batch size: 46, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:24:56,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=341656.3333333333, ans=0.125 2024-06-21 07:25:27,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=341729.6666666667, ans=0.125 2024-06-21 07:25:28,851 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.89 vs. limit=15.0 2024-06-21 07:25:32,348 INFO [train.py:1028] (1/2) Epoch 19, batch 4300, loss[loss=0.1939, simple_loss=0.2504, pruned_loss=0.06867, over 13200.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2391, pruned_loss=0.06959, over 2582333.12 frames. ], batch size: 59, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:25:35,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=341748.0, ans=0.0 2024-06-21 07:25:42,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=341766.3333333333, ans=15.0 2024-06-21 07:25:50,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=341784.6666666667, ans=0.0 2024-06-21 07:25:51,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=341784.6666666667, ans=0.125 2024-06-21 07:26:00,217 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.956e+02 2.044e+02 2.173e+02 2.894e+02, threshold=4.087e+02, percent-clipped=0.0 2024-06-21 07:26:05,206 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2024-06-21 07:26:07,373 INFO [train.py:1028] (1/2) Epoch 19, batch 4350, loss[loss=0.1948, simple_loss=0.237, pruned_loss=0.07634, over 13231.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2383, pruned_loss=0.06963, over 2586855.37 frames. ], batch size: 59, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:26:12,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=341839.6666666667, ans=0.125 2024-06-21 07:26:28,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.48 vs. limit=12.0 2024-06-21 07:26:32,998 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.47 vs. limit=6.0 2024-06-21 07:26:40,078 INFO [train.py:1028] (1/2) Epoch 19, batch 4400, loss[loss=0.2043, simple_loss=0.2479, pruned_loss=0.08035, over 13284.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2378, pruned_loss=0.06942, over 2586935.76 frames. ], batch size: 83, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:26:41,091 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.80 vs. limit=10.0 2024-06-21 07:26:45,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=341949.6666666667, ans=0.0 2024-06-21 07:27:05,284 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 1.892e+02 1.967e+02 2.124e+02 2.885e+02, threshold=3.934e+02, percent-clipped=0.0 2024-06-21 07:27:15,963 INFO [train.py:1028] (1/2) Epoch 19, batch 4450, loss[loss=0.1947, simple_loss=0.2476, pruned_loss=0.07088, over 12887.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2379, pruned_loss=0.06939, over 2582267.29 frames. ], batch size: 33, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:27:18,844 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.43 vs. limit=15.0 2024-06-21 07:27:25,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=342041.3333333333, ans=0.0 2024-06-21 07:27:30,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=342059.6666666667, ans=0.125 2024-06-21 07:27:31,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=342059.6666666667, ans=0.125 2024-06-21 07:27:38,567 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.44 vs. limit=12.0 2024-06-21 07:27:39,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=342078.0, ans=0.1 2024-06-21 07:27:43,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=342078.0, ans=0.125 2024-06-21 07:27:48,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=342096.3333333333, ans=0.0 2024-06-21 07:27:48,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=342096.3333333333, ans=0.1 2024-06-21 07:27:49,802 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:27:51,118 INFO [train.py:1028] (1/2) Epoch 19, batch 4500, loss[loss=0.1884, simple_loss=0.2331, pruned_loss=0.07182, over 13259.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2377, pruned_loss=0.06924, over 2586408.87 frames. ], batch size: 89, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:27:55,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=342114.6666666667, ans=0.0 2024-06-21 07:27:57,347 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.74 vs. limit=12.0 2024-06-21 07:27:57,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=342133.0, ans=0.1 2024-06-21 07:27:57,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=342133.0, ans=0.025 2024-06-21 07:28:04,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=342151.3333333333, ans=0.0 2024-06-21 07:28:16,870 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 1.948e+02 2.036e+02 2.180e+02 2.904e+02, threshold=4.072e+02, percent-clipped=0.0 2024-06-21 07:28:24,241 INFO [train.py:1028] (1/2) Epoch 19, batch 4550, loss[loss=0.194, simple_loss=0.2469, pruned_loss=0.07059, over 13283.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2382, pruned_loss=0.06914, over 2590639.11 frames. ], batch size: 52, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:28:42,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=342243.0, ans=0.125 2024-06-21 07:28:49,422 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.83 vs. limit=10.0 2024-06-21 07:28:56,704 INFO [train.py:1028] (1/2) Epoch 19, batch 4600, loss[loss=0.2056, simple_loss=0.2509, pruned_loss=0.0802, over 12565.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2388, pruned_loss=0.06916, over 2586287.36 frames. ], batch size: 202, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:29:02,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=342316.3333333333, ans=0.0 2024-06-21 07:29:08,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=342316.3333333333, ans=0.125 2024-06-21 07:29:08,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=342316.3333333333, ans=0.0 2024-06-21 07:29:09,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=342316.3333333333, ans=0.0 2024-06-21 07:29:12,283 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2024-06-21 07:29:14,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=342334.6666666667, ans=0.2 2024-06-21 07:29:14,799 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:29:19,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=342353.0, ans=0.125 2024-06-21 07:29:25,640 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 1.889e+02 2.049e+02 2.297e+02 2.729e+02, threshold=4.098e+02, percent-clipped=0.0 2024-06-21 07:29:32,969 INFO [train.py:1028] (1/2) Epoch 19, batch 4650, loss[loss=0.1743, simple_loss=0.216, pruned_loss=0.06624, over 13143.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2376, pruned_loss=0.06886, over 2589724.01 frames. ], batch size: 132, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:29:40,372 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2024-06-21 07:29:45,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=342408.0, ans=0.0 2024-06-21 07:29:48,212 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.88 vs. limit=15.0 2024-06-21 07:30:09,802 INFO [train.py:1028] (1/2) Epoch 19, batch 4700, loss[loss=0.1853, simple_loss=0.2457, pruned_loss=0.06246, over 12536.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2382, pruned_loss=0.06922, over 2583625.99 frames. ], batch size: 25, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:30:21,673 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.34 vs. limit=15.0 2024-06-21 07:30:23,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=342518.0, ans=0.0 2024-06-21 07:30:31,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=342536.3333333333, ans=0.0 2024-06-21 07:30:36,590 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 1.980e+02 2.165e+02 2.370e+02 2.910e+02, threshold=4.331e+02, percent-clipped=0.0 2024-06-21 07:30:36,982 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.12 vs. limit=6.0 2024-06-21 07:30:43,776 INFO [train.py:1028] (1/2) Epoch 19, batch 4750, loss[loss=0.1934, simple_loss=0.2449, pruned_loss=0.07092, over 12459.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2379, pruned_loss=0.06938, over 2580373.59 frames. ], batch size: 202, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:30:46,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=342573.0, ans=0.1 2024-06-21 07:30:50,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=342591.3333333333, ans=0.09899494936611666 2024-06-21 07:30:58,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=342609.6666666667, ans=0.0 2024-06-21 07:31:07,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=342628.0, ans=0.125 2024-06-21 07:31:21,695 INFO [train.py:1028] (1/2) Epoch 19, batch 4800, loss[loss=0.1874, simple_loss=0.2348, pruned_loss=0.07, over 13269.00 frames. ], tot_loss[loss=0.1887, simple_loss=0.2382, pruned_loss=0.06961, over 2576720.68 frames. ], batch size: 63, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:31:32,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=342683.0, ans=0.2 2024-06-21 07:31:47,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=342719.6666666667, ans=0.125 2024-06-21 07:31:51,725 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.953e+02 2.094e+02 2.323e+02 2.994e+02, threshold=4.188e+02, percent-clipped=0.0 2024-06-21 07:31:57,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=342738.0, ans=0.025 2024-06-21 07:31:58,148 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.52 vs. limit=22.5 2024-06-21 07:31:59,022 INFO [train.py:1028] (1/2) Epoch 19, batch 4850, loss[loss=0.1795, simple_loss=0.2287, pruned_loss=0.06518, over 13252.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2378, pruned_loss=0.06924, over 2575287.47 frames. ], batch size: 89, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:32:10,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=342774.6666666667, ans=0.125 2024-06-21 07:32:17,390 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=342793.0, ans=0.0 2024-06-21 07:32:26,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=342829.6666666667, ans=0.2 2024-06-21 07:32:31,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=342829.6666666667, ans=0.125 2024-06-21 07:32:32,902 INFO [train.py:1028] (1/2) Epoch 19, batch 4900, loss[loss=0.1769, simple_loss=0.2265, pruned_loss=0.06362, over 13188.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.238, pruned_loss=0.06947, over 2576526.15 frames. ], batch size: 59, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:32:33,409 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2024-06-21 07:32:34,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=342848.0, ans=0.125 2024-06-21 07:32:40,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=342866.3333333333, ans=0.025 2024-06-21 07:32:47,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=342884.6666666667, ans=0.2 2024-06-21 07:32:53,635 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:32:55,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=342903.0, ans=0.125 2024-06-21 07:33:01,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=342903.0, ans=0.0 2024-06-21 07:33:02,254 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.897e+02 2.022e+02 2.196e+02 2.774e+02, threshold=4.044e+02, percent-clipped=0.0 2024-06-21 07:33:06,584 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=12.0 2024-06-21 07:33:08,645 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.54 vs. limit=15.0 2024-06-21 07:33:09,485 INFO [train.py:1028] (1/2) Epoch 19, batch 4950, loss[loss=0.1947, simple_loss=0.2305, pruned_loss=0.07948, over 10989.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2381, pruned_loss=0.06982, over 2570214.14 frames. ], batch size: 304, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:33:17,246 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.55 vs. limit=15.0 2024-06-21 07:33:18,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=342958.0, ans=0.125 2024-06-21 07:33:45,535 INFO [train.py:1028] (1/2) Epoch 19, batch 5000, loss[loss=0.1812, simple_loss=0.2237, pruned_loss=0.06935, over 13086.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2375, pruned_loss=0.06921, over 2574214.68 frames. ], batch size: 95, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:33:46,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=343031.3333333333, ans=0.125 2024-06-21 07:33:46,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=343031.3333333333, ans=0.2 2024-06-21 07:33:50,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=343031.3333333333, ans=0.09899494936611666 2024-06-21 07:33:59,420 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.04 vs. limit=15.0 2024-06-21 07:33:59,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=343068.0, ans=0.125 2024-06-21 07:34:11,659 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 1.843e+02 1.957e+02 2.118e+02 2.953e+02, threshold=3.913e+02, percent-clipped=0.0 2024-06-21 07:34:14,840 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=15.0 2024-06-21 07:34:15,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=343104.6666666667, ans=0.125 2024-06-21 07:34:18,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=343123.0, ans=0.035 2024-06-21 07:34:18,925 INFO [train.py:1028] (1/2) Epoch 19, batch 5050, loss[loss=0.1765, simple_loss=0.2282, pruned_loss=0.06239, over 12957.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2375, pruned_loss=0.06898, over 2572050.67 frames. ], batch size: 36, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:34:24,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=12.0 2024-06-21 07:34:35,212 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:34:39,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=343178.0, ans=0.0 2024-06-21 07:34:47,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=343196.3333333333, ans=0.125 2024-06-21 07:34:47,866 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2024-06-21 07:34:52,186 INFO [train.py:1028] (1/2) Epoch 19, batch 5100, loss[loss=0.1972, simple_loss=0.2506, pruned_loss=0.07194, over 13266.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.238, pruned_loss=0.0699, over 2568238.87 frames. ], batch size: 40, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:35:08,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=343251.3333333333, ans=0.1 2024-06-21 07:35:10,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=343251.3333333333, ans=0.0 2024-06-21 07:35:11,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=343251.3333333333, ans=0.0 2024-06-21 07:35:12,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=343251.3333333333, ans=0.125 2024-06-21 07:35:15,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=343269.6666666667, ans=0.0 2024-06-21 07:35:18,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=343269.6666666667, ans=0.0 2024-06-21 07:35:19,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=343269.6666666667, ans=0.125 2024-06-21 07:35:21,249 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.41 vs. limit=10.0 2024-06-21 07:35:21,386 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 1.928e+02 2.122e+02 2.324e+02 3.199e+02, threshold=4.244e+02, percent-clipped=0.0 2024-06-21 07:35:21,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=343288.0, ans=0.1 2024-06-21 07:35:28,536 INFO [train.py:1028] (1/2) Epoch 19, batch 5150, loss[loss=0.1863, simple_loss=0.2271, pruned_loss=0.07273, over 13087.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2375, pruned_loss=0.07008, over 2570446.06 frames. ], batch size: 132, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:35:28,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=343306.3333333333, ans=0.025 2024-06-21 07:35:35,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=343306.3333333333, ans=0.125 2024-06-21 07:35:47,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=343343.0, ans=0.2 2024-06-21 07:35:53,754 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.29 vs. limit=22.5 2024-06-21 07:35:54,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=343361.3333333333, ans=0.125 2024-06-21 07:35:55,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=343361.3333333333, ans=0.1 2024-06-21 07:36:00,705 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.43 vs. limit=15.0 2024-06-21 07:36:05,376 INFO [train.py:1028] (1/2) Epoch 19, batch 5200, loss[loss=0.1745, simple_loss=0.2245, pruned_loss=0.06227, over 13133.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2372, pruned_loss=0.06984, over 2574831.12 frames. ], batch size: 95, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:36:31,527 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 1.889e+02 2.044e+02 2.199e+02 2.726e+02, threshold=4.087e+02, percent-clipped=0.0 2024-06-21 07:36:31,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=343471.3333333333, ans=0.015 2024-06-21 07:36:31,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=343471.3333333333, ans=0.1 2024-06-21 07:36:38,672 INFO [train.py:1028] (1/2) Epoch 19, batch 5250, loss[loss=0.1988, simple_loss=0.2458, pruned_loss=0.07589, over 13218.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2371, pruned_loss=0.0698, over 2571590.16 frames. ], batch size: 52, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:37:00,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=343526.3333333333, ans=0.125 2024-06-21 07:37:07,197 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2024-06-21 07:37:09,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=343563.0, ans=0.2 2024-06-21 07:37:10,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=343563.0, ans=0.125 2024-06-21 07:37:10,561 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.56 vs. limit=12.0 2024-06-21 07:37:14,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=343563.0, ans=0.125 2024-06-21 07:37:15,459 INFO [train.py:1028] (1/2) Epoch 19, batch 5300, loss[loss=0.1917, simple_loss=0.2383, pruned_loss=0.07258, over 13041.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2367, pruned_loss=0.06953, over 2567693.31 frames. ], batch size: 144, lr: 3.06e-03, grad_scale: 64.0 2024-06-21 07:37:23,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=343599.6666666667, ans=0.0 2024-06-21 07:37:44,974 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.721e+02 1.924e+02 2.069e+02 2.270e+02 3.512e+02, threshold=4.138e+02, percent-clipped=0.0 2024-06-21 07:37:45,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=343654.6666666667, ans=0.125 2024-06-21 07:37:46,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=343654.6666666667, ans=0.0 2024-06-21 07:37:52,897 INFO [train.py:1028] (1/2) Epoch 19, batch 5350, loss[loss=0.1999, simple_loss=0.2496, pruned_loss=0.0751, over 12000.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2368, pruned_loss=0.06982, over 2574981.95 frames. ], batch size: 17, lr: 3.06e-03, grad_scale: 64.0 2024-06-21 07:38:03,060 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.57 vs. limit=10.0 2024-06-21 07:38:08,294 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2024-06-21 07:38:10,131 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2024-06-21 07:38:13,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=343728.0, ans=0.2 2024-06-21 07:38:15,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=343728.0, ans=0.0 2024-06-21 07:38:24,818 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.40 vs. limit=8.0 2024-06-21 07:38:24,904 INFO [train.py:1028] (1/2) Epoch 19, batch 5400, loss[loss=0.2012, simple_loss=0.2447, pruned_loss=0.07888, over 12199.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.237, pruned_loss=0.06996, over 2567422.14 frames. ], batch size: 240, lr: 3.06e-03, grad_scale: 64.0 2024-06-21 07:38:27,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=343764.6666666667, ans=0.125 2024-06-21 07:38:54,118 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 1.913e+02 2.027e+02 2.190e+02 2.653e+02, threshold=4.054e+02, percent-clipped=0.0 2024-06-21 07:38:55,359 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.84 vs. limit=10.0 2024-06-21 07:38:56,942 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=343838.0, ans=10.0 2024-06-21 07:39:00,459 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.46 vs. limit=22.5 2024-06-21 07:39:01,368 INFO [train.py:1028] (1/2) Epoch 19, batch 5450, loss[loss=0.1728, simple_loss=0.2169, pruned_loss=0.06432, over 12228.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2371, pruned_loss=0.06979, over 2569740.04 frames. ], batch size: 25, lr: 3.06e-03, grad_scale: 64.0 2024-06-21 07:39:09,100 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.97 vs. limit=22.5 2024-06-21 07:39:09,108 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.60 vs. limit=12.0 2024-06-21 07:39:23,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=343911.3333333333, ans=0.95 2024-06-21 07:39:28,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=343911.3333333333, ans=0.125 2024-06-21 07:39:33,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=343929.6666666667, ans=0.025 2024-06-21 07:39:34,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=343929.6666666667, ans=0.0 2024-06-21 07:39:36,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=343929.6666666667, ans=0.125 2024-06-21 07:39:37,296 INFO [train.py:1028] (1/2) Epoch 19, batch 5500, loss[loss=0.1996, simple_loss=0.2391, pruned_loss=0.08001, over 12204.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2369, pruned_loss=0.06961, over 2564262.86 frames. ], batch size: 240, lr: 3.06e-03, grad_scale: 64.0 2024-06-21 07:39:43,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=343966.3333333333, ans=0.1 2024-06-21 07:39:44,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.36 vs. limit=22.5 2024-06-21 07:39:49,938 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2024-06-21 07:40:00,245 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:40:02,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=344003.0, ans=0.125 2024-06-21 07:40:02,702 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.901e+02 2.066e+02 2.390e+02 3.746e+02, threshold=4.132e+02, percent-clipped=0.0 2024-06-21 07:40:10,240 INFO [train.py:1028] (1/2) Epoch 19, batch 5550, loss[loss=0.1634, simple_loss=0.2179, pruned_loss=0.05445, over 13259.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2361, pruned_loss=0.06897, over 2568119.09 frames. ], batch size: 43, lr: 3.06e-03, grad_scale: 64.0 2024-06-21 07:40:15,820 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.29 vs. limit=22.5 2024-06-21 07:40:25,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=344076.3333333333, ans=0.0 2024-06-21 07:40:34,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=344094.6666666667, ans=0.125 2024-06-21 07:40:37,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=344113.0, ans=0.1 2024-06-21 07:40:42,237 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.29 vs. limit=15.0 2024-06-21 07:40:42,503 INFO [train.py:1028] (1/2) Epoch 19, batch 5600, loss[loss=0.187, simple_loss=0.2335, pruned_loss=0.07023, over 13261.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2355, pruned_loss=0.06887, over 2570208.19 frames. ], batch size: 89, lr: 3.06e-03, grad_scale: 64.0 2024-06-21 07:40:44,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=344131.3333333333, ans=0.1 2024-06-21 07:40:56,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=344149.6666666667, ans=0.125 2024-06-21 07:41:12,046 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 1.903e+02 2.029e+02 2.180e+02 2.583e+02, threshold=4.059e+02, percent-clipped=0.0 2024-06-21 07:41:19,077 INFO [train.py:1028] (1/2) Epoch 19, batch 5650, loss[loss=0.2059, simple_loss=0.2517, pruned_loss=0.07999, over 12642.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2355, pruned_loss=0.06892, over 2575595.17 frames. ], batch size: 202, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:41:28,162 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.05 vs. limit=22.5 2024-06-21 07:41:39,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=344259.6666666667, ans=0.0 2024-06-21 07:41:44,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=344278.0, ans=0.125 2024-06-21 07:41:50,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=344296.3333333333, ans=0.0 2024-06-21 07:41:54,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=344296.3333333333, ans=0.0 2024-06-21 07:41:55,451 INFO [train.py:1028] (1/2) Epoch 19, batch 5700, loss[loss=0.2083, simple_loss=0.2603, pruned_loss=0.07817, over 13280.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.235, pruned_loss=0.06876, over 2579593.11 frames. ], batch size: 63, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:42:00,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=344314.6666666667, ans=0.125 2024-06-21 07:42:01,597 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2024-06-21 07:42:05,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=344333.0, ans=0.0 2024-06-21 07:42:11,722 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.08 vs. limit=10.0 2024-06-21 07:42:14,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=344369.6666666667, ans=0.125 2024-06-21 07:42:20,458 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 1.888e+02 2.025e+02 2.203e+02 2.999e+02, threshold=4.051e+02, percent-clipped=0.0 2024-06-21 07:42:27,448 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2024-06-21 07:42:27,738 INFO [train.py:1028] (1/2) Epoch 19, batch 5750, loss[loss=0.1877, simple_loss=0.2332, pruned_loss=0.07109, over 12769.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2352, pruned_loss=0.06864, over 2580661.33 frames. ], batch size: 176, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:42:28,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=344406.3333333333, ans=0.125 2024-06-21 07:42:31,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=344406.3333333333, ans=0.0 2024-06-21 07:42:48,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=344443.0, ans=0.2 2024-06-21 07:42:51,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=344461.3333333333, ans=0.1 2024-06-21 07:43:03,726 INFO [train.py:1028] (1/2) Epoch 19, batch 5800, loss[loss=0.2056, simple_loss=0.2537, pruned_loss=0.07878, over 12755.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2364, pruned_loss=0.06932, over 2578917.17 frames. ], batch size: 176, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:43:03,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=344498.0, ans=0.025 2024-06-21 07:43:13,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=344516.3333333333, ans=0.125 2024-06-21 07:43:17,179 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=344534.6666666667, ans=0.1 2024-06-21 07:43:18,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=344534.6666666667, ans=0.0 2024-06-21 07:43:20,949 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:43:29,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=344553.0, ans=0.125 2024-06-21 07:43:33,358 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 1.958e+02 2.123e+02 2.335e+02 3.232e+02, threshold=4.247e+02, percent-clipped=0.0 2024-06-21 07:43:38,389 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=27.01 vs. limit=22.5 2024-06-21 07:43:40,554 INFO [train.py:1028] (1/2) Epoch 19, batch 5850, loss[loss=0.2157, simple_loss=0.26, pruned_loss=0.08575, over 12631.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2387, pruned_loss=0.07005, over 2577458.04 frames. ], batch size: 202, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:43:42,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=344589.6666666667, ans=0.125 2024-06-21 07:43:43,715 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=27.75 vs. limit=22.5 2024-06-21 07:43:54,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=344626.3333333333, ans=0.125 2024-06-21 07:44:04,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=344644.6666666667, ans=0.125 2024-06-21 07:44:18,478 INFO [train.py:1028] (1/2) Epoch 19, batch 5900, loss[loss=0.1999, simple_loss=0.248, pruned_loss=0.07588, over 13117.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2409, pruned_loss=0.07075, over 2576453.40 frames. ], batch size: 121, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:44:19,376 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.147e+01 2024-06-21 07:44:23,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=344681.3333333333, ans=0.125 2024-06-21 07:44:25,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=344699.6666666667, ans=0.0 2024-06-21 07:44:36,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=344718.0, ans=0.125 2024-06-21 07:44:44,389 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 1.955e+02 2.100e+02 2.342e+02 3.281e+02, threshold=4.200e+02, percent-clipped=0.0 2024-06-21 07:44:55,060 INFO [train.py:1028] (1/2) Epoch 19, batch 5950, loss[loss=0.1903, simple_loss=0.23, pruned_loss=0.07533, over 13079.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2421, pruned_loss=0.07124, over 2582733.25 frames. ], batch size: 121, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:45:01,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=344791.3333333333, ans=0.125 2024-06-21 07:45:10,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=344809.6666666667, ans=0.125 2024-06-21 07:45:12,008 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.46 vs. limit=15.0 2024-06-21 07:45:19,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=344828.0, ans=0.025 2024-06-21 07:45:27,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=344846.3333333333, ans=0.125 2024-06-21 07:45:28,467 INFO [train.py:1028] (1/2) Epoch 19, batch 6000, loss[loss=0.2522, simple_loss=0.292, pruned_loss=0.1062, over 12183.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2439, pruned_loss=0.07179, over 2576650.69 frames. ], batch size: 240, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:45:28,467 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 07:45:36,184 INFO [train.py:1060] (1/2) Epoch 19, validation: loss=0.1869, simple_loss=0.2515, pruned_loss=0.06121, over 351949.00 frames. 2024-06-21 07:45:36,185 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 07:45:47,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=344883.0, ans=0.0 2024-06-21 07:45:52,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=344883.0, ans=0.0 2024-06-21 07:46:02,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=344919.6666666667, ans=0.125 2024-06-21 07:46:05,921 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 1.949e+02 2.070e+02 2.211e+02 3.861e+02, threshold=4.141e+02, percent-clipped=0.0 2024-06-21 07:46:06,307 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2024-06-21 07:46:08,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=344938.0, ans=0.0 2024-06-21 07:46:13,181 INFO [train.py:1028] (1/2) Epoch 19, batch 6050, loss[loss=0.2106, simple_loss=0.2584, pruned_loss=0.08143, over 12913.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2459, pruned_loss=0.07247, over 2579164.35 frames. ], batch size: 39, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:46:13,309 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:46:15,142 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.21 vs. limit=22.5 2024-06-21 07:46:21,569 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:46:23,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=344974.6666666667, ans=0.2 2024-06-21 07:46:24,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=344974.6666666667, ans=0.025 2024-06-21 07:46:33,945 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.80 vs. limit=10.0 2024-06-21 07:46:44,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=345029.6666666667, ans=0.04949747468305833 2024-06-21 07:46:46,813 INFO [train.py:1028] (1/2) Epoch 19, batch 6100, loss[loss=0.1934, simple_loss=0.2391, pruned_loss=0.07382, over 13117.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2473, pruned_loss=0.07304, over 2581944.31 frames. ], batch size: 121, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:46:48,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=345048.0, ans=0.125 2024-06-21 07:46:49,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=345048.0, ans=0.125 2024-06-21 07:47:00,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=345066.3333333333, ans=0.2 2024-06-21 07:47:04,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=345084.6666666667, ans=0.025 2024-06-21 07:47:05,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=345084.6666666667, ans=0.025 2024-06-21 07:47:15,924 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.782e+02 2.007e+02 2.141e+02 2.368e+02 3.897e+02, threshold=4.282e+02, percent-clipped=0.0 2024-06-21 07:47:20,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=345121.3333333333, ans=0.1 2024-06-21 07:47:21,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=345121.3333333333, ans=0.125 2024-06-21 07:47:23,328 INFO [train.py:1028] (1/2) Epoch 19, batch 6150, loss[loss=0.2159, simple_loss=0.252, pruned_loss=0.08985, over 10898.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2491, pruned_loss=0.07354, over 2580461.01 frames. ], batch size: 304, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:47:59,361 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.85 vs. limit=15.0 2024-06-21 07:48:00,269 INFO [train.py:1028] (1/2) Epoch 19, batch 6200, loss[loss=0.217, simple_loss=0.2708, pruned_loss=0.08157, over 13242.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2507, pruned_loss=0.07421, over 2577117.64 frames. ], batch size: 89, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:48:16,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=345268.0, ans=0.0 2024-06-21 07:48:22,630 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.38 vs. limit=15.0 2024-06-21 07:48:24,577 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.30 vs. limit=15.0 2024-06-21 07:48:26,228 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.100e+02 2.345e+02 2.601e+02 3.428e+02, threshold=4.690e+02, percent-clipped=0.0 2024-06-21 07:48:32,020 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.96 vs. limit=15.0 2024-06-21 07:48:33,736 INFO [train.py:1028] (1/2) Epoch 19, batch 6250, loss[loss=0.2199, simple_loss=0.2758, pruned_loss=0.08202, over 13223.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2519, pruned_loss=0.07482, over 2569457.28 frames. ], batch size: 83, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:48:53,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=345359.6666666667, ans=0.125 2024-06-21 07:49:09,170 INFO [train.py:1028] (1/2) Epoch 19, batch 6300, loss[loss=0.2216, simple_loss=0.286, pruned_loss=0.07866, over 11306.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2532, pruned_loss=0.07522, over 2564290.80 frames. ], batch size: 16, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:49:13,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=345414.6666666667, ans=0.2 2024-06-21 07:49:17,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=345433.0, ans=0.125 2024-06-21 07:49:19,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=345433.0, ans=0.125 2024-06-21 07:49:19,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=345433.0, ans=0.09899494936611666 2024-06-21 07:49:23,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=345451.3333333333, ans=0.125 2024-06-21 07:49:32,546 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.12 vs. limit=6.0 2024-06-21 07:49:34,601 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.023e+02 2.163e+02 2.379e+02 3.057e+02, threshold=4.325e+02, percent-clipped=0.0 2024-06-21 07:49:41,391 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.37 vs. limit=22.5 2024-06-21 07:49:45,810 INFO [train.py:1028] (1/2) Epoch 19, batch 6350, loss[loss=0.2113, simple_loss=0.2624, pruned_loss=0.08008, over 12476.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2541, pruned_loss=0.07523, over 2574005.08 frames. ], batch size: 202, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:50:02,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=345543.0, ans=0.04949747468305833 2024-06-21 07:50:07,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=345561.3333333333, ans=0.0 2024-06-21 07:50:12,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=345579.6666666667, ans=0.0 2024-06-21 07:50:18,193 INFO [train.py:1028] (1/2) Epoch 19, batch 6400, loss[loss=0.1986, simple_loss=0.2541, pruned_loss=0.07152, over 13248.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2563, pruned_loss=0.07622, over 2575467.40 frames. ], batch size: 67, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:50:27,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=345616.3333333333, ans=0.125 2024-06-21 07:50:35,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=345634.6666666667, ans=0.125 2024-06-21 07:50:35,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=345634.6666666667, ans=0.125 2024-06-21 07:50:35,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=345634.6666666667, ans=0.125 2024-06-21 07:50:37,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=345653.0, ans=0.2 2024-06-21 07:50:39,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=345653.0, ans=0.0 2024-06-21 07:50:43,305 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.065e+02 2.280e+02 2.608e+02 4.185e+02, threshold=4.561e+02, percent-clipped=0.0 2024-06-21 07:50:46,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=345671.3333333333, ans=0.125 2024-06-21 07:50:50,580 INFO [train.py:1028] (1/2) Epoch 19, batch 6450, loss[loss=0.2403, simple_loss=0.286, pruned_loss=0.09732, over 12585.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2577, pruned_loss=0.07679, over 2581191.31 frames. ], batch size: 202, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:51:04,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=345708.0, ans=0.125 2024-06-21 07:51:11,461 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.366e-01 2024-06-21 07:51:13,758 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.51 vs. limit=15.0 2024-06-21 07:51:15,622 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.24 vs. limit=22.5 2024-06-21 07:51:18,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=345744.6666666667, ans=0.2 2024-06-21 07:51:19,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=345744.6666666667, ans=0.125 2024-06-21 07:51:28,607 INFO [train.py:1028] (1/2) Epoch 19, batch 6500, loss[loss=0.2129, simple_loss=0.254, pruned_loss=0.08583, over 10759.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2589, pruned_loss=0.07691, over 2584938.00 frames. ], batch size: 304, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:51:35,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=345799.6666666667, ans=0.1 2024-06-21 07:51:38,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=345799.6666666667, ans=0.125 2024-06-21 07:51:43,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=345818.0, ans=0.125 2024-06-21 07:51:44,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=345818.0, ans=0.125 2024-06-21 07:51:46,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=345818.0, ans=0.05 2024-06-21 07:51:52,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=345836.3333333333, ans=0.1 2024-06-21 07:51:55,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=345836.3333333333, ans=0.2 2024-06-21 07:51:57,867 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.844e+02 2.101e+02 2.298e+02 2.636e+02 3.938e+02, threshold=4.596e+02, percent-clipped=0.0 2024-06-21 07:52:05,032 INFO [train.py:1028] (1/2) Epoch 19, batch 6550, loss[loss=0.1936, simple_loss=0.2549, pruned_loss=0.06613, over 12959.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2593, pruned_loss=0.07669, over 2589152.04 frames. ], batch size: 23, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:52:12,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=345891.3333333333, ans=0.125 2024-06-21 07:52:25,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=345928.0, ans=0.125 2024-06-21 07:52:32,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=345946.3333333333, ans=0.125 2024-06-21 07:52:35,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=345946.3333333333, ans=0.1 2024-06-21 07:52:36,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=345946.3333333333, ans=15.0 2024-06-21 07:52:36,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=345964.6666666667, ans=0.125 2024-06-21 07:52:37,334 INFO [train.py:1028] (1/2) Epoch 19, batch 6600, loss[loss=0.196, simple_loss=0.2547, pruned_loss=0.06869, over 13244.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2596, pruned_loss=0.07658, over 2591810.56 frames. ], batch size: 72, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:52:37,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=345964.6666666667, ans=0.1 2024-06-21 07:52:42,903 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.70 vs. limit=15.0 2024-06-21 07:52:57,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=346019.6666666667, ans=0.5 2024-06-21 07:53:03,063 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.038e+02 2.225e+02 2.438e+02 3.295e+02, threshold=4.451e+02, percent-clipped=0.0 2024-06-21 07:53:08,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=346038.0, ans=0.1 2024-06-21 07:53:08,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=346038.0, ans=0.025 2024-06-21 07:53:13,722 INFO [train.py:1028] (1/2) Epoch 19, batch 6650, loss[loss=0.2301, simple_loss=0.2781, pruned_loss=0.09108, over 12930.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2616, pruned_loss=0.07752, over 2587923.23 frames. ], batch size: 158, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:53:17,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=346056.3333333333, ans=0.125 2024-06-21 07:53:17,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=346056.3333333333, ans=0.125 2024-06-21 07:53:20,175 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.10 vs. limit=22.5 2024-06-21 07:53:20,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=346074.6666666667, ans=0.125 2024-06-21 07:53:20,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=346074.6666666667, ans=0.125 2024-06-21 07:53:20,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=346074.6666666667, ans=0.125 2024-06-21 07:53:28,245 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:53:30,611 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.18 vs. limit=15.0 2024-06-21 07:53:32,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=346111.3333333333, ans=0.125 2024-06-21 07:53:36,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=346111.3333333333, ans=0.0 2024-06-21 07:53:46,894 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.54 vs. limit=6.0 2024-06-21 07:53:47,090 INFO [train.py:1028] (1/2) Epoch 19, batch 6700, loss[loss=0.2162, simple_loss=0.2723, pruned_loss=0.07998, over 12737.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2629, pruned_loss=0.07814, over 2586140.51 frames. ], batch size: 176, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:54:07,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=346184.6666666667, ans=0.1 2024-06-21 07:54:12,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=346203.0, ans=0.0 2024-06-21 07:54:16,710 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.076e+02 2.216e+02 2.445e+02 3.352e+02, threshold=4.433e+02, percent-clipped=0.0 2024-06-21 07:54:18,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=346221.3333333333, ans=0.125 2024-06-21 07:54:19,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=346221.3333333333, ans=0.1 2024-06-21 07:54:22,520 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=8.057e-02 2024-06-21 07:54:24,466 INFO [train.py:1028] (1/2) Epoch 19, batch 6750, loss[loss=0.273, simple_loss=0.3146, pruned_loss=0.1157, over 12209.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2635, pruned_loss=0.07856, over 2579011.49 frames. ], batch size: 240, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:54:30,613 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=11.53 vs. limit=15.0 2024-06-21 07:54:30,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=346258.0, ans=0.125 2024-06-21 07:54:31,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=346258.0, ans=0.125 2024-06-21 07:54:32,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=346258.0, ans=0.0 2024-06-21 07:54:39,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=346276.3333333333, ans=0.015 2024-06-21 07:54:48,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=346294.6666666667, ans=0.0 2024-06-21 07:54:56,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=346331.3333333333, ans=0.1 2024-06-21 07:54:57,064 INFO [train.py:1028] (1/2) Epoch 19, batch 6800, loss[loss=0.2348, simple_loss=0.2857, pruned_loss=0.09197, over 13206.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2645, pruned_loss=0.07862, over 2580553.71 frames. ], batch size: 67, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:55:02,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=346331.3333333333, ans=0.125 2024-06-21 07:55:04,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=346349.6666666667, ans=0.1 2024-06-21 07:55:11,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=346349.6666666667, ans=0.125 2024-06-21 07:55:13,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=346368.0, ans=0.09899494936611666 2024-06-21 07:55:25,893 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 2.017e+02 2.126e+02 2.343e+02 3.420e+02, threshold=4.251e+02, percent-clipped=0.0 2024-06-21 07:55:26,334 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.54 vs. limit=22.5 2024-06-21 07:55:26,448 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.65 vs. limit=6.0 2024-06-21 07:55:29,150 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.21 vs. limit=15.0 2024-06-21 07:55:32,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=346404.6666666667, ans=0.0 2024-06-21 07:55:33,259 INFO [train.py:1028] (1/2) Epoch 19, batch 6850, loss[loss=0.2154, simple_loss=0.2754, pruned_loss=0.07777, over 13239.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2646, pruned_loss=0.07836, over 2583840.99 frames. ], batch size: 63, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:55:54,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=346478.0, ans=0.125 2024-06-21 07:56:02,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=346496.3333333333, ans=0.1 2024-06-21 07:56:03,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=346496.3333333333, ans=0.125 2024-06-21 07:56:08,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=346496.3333333333, ans=0.125 2024-06-21 07:56:09,519 INFO [train.py:1028] (1/2) Epoch 19, batch 6900, loss[loss=0.2073, simple_loss=0.2606, pruned_loss=0.077, over 13290.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2656, pruned_loss=0.07862, over 2585759.13 frames. ], batch size: 49, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 07:56:34,622 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.092e+02 2.216e+02 2.491e+02 3.460e+02, threshold=4.431e+02, percent-clipped=0.0 2024-06-21 07:56:34,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=346588.0, ans=0.2 2024-06-21 07:56:42,073 INFO [train.py:1028] (1/2) Epoch 19, batch 6950, loss[loss=0.1806, simple_loss=0.2402, pruned_loss=0.06049, over 11154.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2656, pruned_loss=0.07853, over 2579991.11 frames. ], batch size: 16, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 07:56:43,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=346606.3333333333, ans=0.125 2024-06-21 07:56:46,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=346606.3333333333, ans=0.05 2024-06-21 07:56:47,919 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=346624.6666666667, ans=0.025 2024-06-21 07:56:48,783 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.16 vs. limit=15.0 2024-06-21 07:56:56,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=346643.0, ans=0.125 2024-06-21 07:56:57,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=346643.0, ans=0.2 2024-06-21 07:56:57,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=346643.0, ans=0.1 2024-06-21 07:57:17,979 INFO [train.py:1028] (1/2) Epoch 19, batch 7000, loss[loss=0.2091, simple_loss=0.2604, pruned_loss=0.07886, over 12969.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2665, pruned_loss=0.07891, over 2577070.99 frames. ], batch size: 158, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 07:57:22,876 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2024-06-21 07:57:32,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=346734.6666666667, ans=0.125 2024-06-21 07:57:44,882 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.778e+02 2.166e+02 2.379e+02 2.652e+02 3.633e+02, threshold=4.758e+02, percent-clipped=0.0 2024-06-21 07:57:51,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=346789.6666666667, ans=0.125 2024-06-21 07:57:51,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=346789.6666666667, ans=0.2 2024-06-21 07:57:52,021 INFO [train.py:1028] (1/2) Epoch 19, batch 7050, loss[loss=0.2338, simple_loss=0.2815, pruned_loss=0.09304, over 12792.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2678, pruned_loss=0.07949, over 2583585.39 frames. ], batch size: 177, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 07:57:57,179 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.13 vs. limit=10.0 2024-06-21 07:57:58,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=346808.0, ans=0.0 2024-06-21 07:58:05,164 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.27 vs. limit=22.5 2024-06-21 07:58:06,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=346808.0, ans=0.125 2024-06-21 07:58:13,133 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.78 vs. limit=22.5 2024-06-21 07:58:14,372 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.55 vs. limit=15.0 2024-06-21 07:58:18,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=346844.6666666667, ans=0.95 2024-06-21 07:58:27,664 INFO [train.py:1028] (1/2) Epoch 19, batch 7100, loss[loss=0.2257, simple_loss=0.2792, pruned_loss=0.08604, over 13221.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2683, pruned_loss=0.07998, over 2575656.93 frames. ], batch size: 112, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 07:58:32,946 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.74 vs. limit=15.0 2024-06-21 07:58:34,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=346899.6666666667, ans=0.125 2024-06-21 07:58:45,758 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.32 vs. limit=12.0 2024-06-21 07:58:52,512 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.099e+02 2.259e+02 2.470e+02 3.444e+02, threshold=4.518e+02, percent-clipped=0.0 2024-06-21 07:58:54,722 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=346954.6666666667, ans=0.125 2024-06-21 07:58:59,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=346973.0, ans=0.125 2024-06-21 07:58:59,676 INFO [train.py:1028] (1/2) Epoch 19, batch 7150, loss[loss=0.2488, simple_loss=0.2964, pruned_loss=0.1005, over 12501.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2683, pruned_loss=0.07958, over 2573193.83 frames. ], batch size: 202, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 07:59:01,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=346973.0, ans=0.125 2024-06-21 07:59:10,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=346991.3333333333, ans=0.0 2024-06-21 07:59:13,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=347009.6666666667, ans=0.0 2024-06-21 07:59:21,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=347009.6666666667, ans=0.09899494936611666 2024-06-21 07:59:24,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=347028.0, ans=10.0 2024-06-21 07:59:25,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=347028.0, ans=10.0 2024-06-21 07:59:28,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=347046.3333333333, ans=0.0 2024-06-21 07:59:29,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=347046.3333333333, ans=0.2 2024-06-21 07:59:31,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=347046.3333333333, ans=0.125 2024-06-21 07:59:34,114 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2024-06-21 07:59:35,435 INFO [train.py:1028] (1/2) Epoch 19, batch 7200, loss[loss=0.2381, simple_loss=0.2941, pruned_loss=0.09107, over 13161.00 frames. ], tot_loss[loss=0.214, simple_loss=0.269, pruned_loss=0.07956, over 2578799.34 frames. ], batch size: 112, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 07:59:36,598 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.78 vs. limit=12.0 2024-06-21 07:59:44,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=347083.0, ans=0.1 2024-06-21 08:00:00,978 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 2.111e+02 2.252e+02 2.466e+02 3.220e+02, threshold=4.503e+02, percent-clipped=0.0 2024-06-21 08:00:01,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=347138.0, ans=0.125 2024-06-21 08:00:02,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=347138.0, ans=0.125 2024-06-21 08:00:08,688 INFO [train.py:1028] (1/2) Epoch 19, batch 7250, loss[loss=0.1988, simple_loss=0.2657, pruned_loss=0.06592, over 12999.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.27, pruned_loss=0.07957, over 2579221.46 frames. ], batch size: 36, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:00:16,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=347156.3333333333, ans=0.0 2024-06-21 08:00:18,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=347174.6666666667, ans=0.0 2024-06-21 08:00:34,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=347211.3333333333, ans=0.1 2024-06-21 08:00:41,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=347229.6666666667, ans=12.0 2024-06-21 08:00:42,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=347229.6666666667, ans=0.125 2024-06-21 08:00:47,418 INFO [train.py:1028] (1/2) Epoch 19, batch 7300, loss[loss=0.219, simple_loss=0.2768, pruned_loss=0.08058, over 12960.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.271, pruned_loss=0.08023, over 2578515.36 frames. ], batch size: 36, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:00:55,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=347266.3333333333, ans=0.025 2024-06-21 08:01:05,673 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.39 vs. limit=22.5 2024-06-21 08:01:07,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=347303.0, ans=15.0 2024-06-21 08:01:10,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=347303.0, ans=0.1 2024-06-21 08:01:13,692 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 2.070e+02 2.210e+02 2.407e+02 3.874e+02, threshold=4.419e+02, percent-clipped=0.0 2024-06-21 08:01:21,226 INFO [train.py:1028] (1/2) Epoch 19, batch 7350, loss[loss=0.2274, simple_loss=0.2894, pruned_loss=0.08268, over 13292.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.272, pruned_loss=0.08056, over 2580076.10 frames. ], batch size: 46, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:01:28,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=347339.6666666667, ans=0.125 2024-06-21 08:01:37,547 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.28 vs. limit=15.0 2024-06-21 08:02:00,043 INFO [train.py:1028] (1/2) Epoch 19, batch 7400, loss[loss=0.2277, simple_loss=0.2945, pruned_loss=0.08041, over 13254.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2722, pruned_loss=0.0805, over 2586085.14 frames. ], batch size: 63, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:02:00,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=347431.3333333333, ans=0.125 2024-06-21 08:02:01,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=347431.3333333333, ans=0.125 2024-06-21 08:02:12,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=347449.6666666667, ans=0.0 2024-06-21 08:02:15,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=347468.0, ans=0.025 2024-06-21 08:02:30,115 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.138e+02 2.343e+02 2.549e+02 3.300e+02, threshold=4.685e+02, percent-clipped=0.0 2024-06-21 08:02:34,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=347504.6666666667, ans=0.1 2024-06-21 08:02:37,791 INFO [train.py:1028] (1/2) Epoch 19, batch 7450, loss[loss=0.208, simple_loss=0.2688, pruned_loss=0.07362, over 12578.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2722, pruned_loss=0.08069, over 2579636.87 frames. ], batch size: 29, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:02:39,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=347523.0, ans=0.125 2024-06-21 08:02:42,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=347523.0, ans=0.0 2024-06-21 08:02:42,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=347523.0, ans=0.125 2024-06-21 08:02:42,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=347523.0, ans=0.125 2024-06-21 08:02:56,913 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.35 vs. limit=15.0 2024-06-21 08:03:01,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=347578.0, ans=0.125 2024-06-21 08:03:04,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=347578.0, ans=0.0 2024-06-21 08:03:04,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=347596.3333333333, ans=0.0 2024-06-21 08:03:05,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=347596.3333333333, ans=0.015 2024-06-21 08:03:06,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=347596.3333333333, ans=0.125 2024-06-21 08:03:12,312 INFO [train.py:1028] (1/2) Epoch 19, batch 7500, loss[loss=0.2115, simple_loss=0.2567, pruned_loss=0.08319, over 10686.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2733, pruned_loss=0.08125, over 2577037.89 frames. ], batch size: 303, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:03:20,458 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.31 vs. limit=22.5 2024-06-21 08:03:38,805 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.65 vs. limit=12.0 2024-06-21 08:03:40,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=347669.6666666667, ans=0.125 2024-06-21 08:03:43,282 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.130e+02 2.256e+02 2.505e+02 3.086e+02, threshold=4.512e+02, percent-clipped=0.0 2024-06-21 08:03:50,331 INFO [train.py:1028] (1/2) Epoch 19, batch 7550, loss[loss=0.2313, simple_loss=0.2852, pruned_loss=0.08876, over 12948.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2737, pruned_loss=0.0817, over 2576617.06 frames. ], batch size: 158, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:04:05,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=347743.0, ans=0.125 2024-06-21 08:04:11,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=347761.3333333333, ans=0.125 2024-06-21 08:04:20,556 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:04:21,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.34 vs. limit=15.0 2024-06-21 08:04:27,789 INFO [train.py:1028] (1/2) Epoch 19, batch 7600, loss[loss=0.2166, simple_loss=0.2729, pruned_loss=0.08017, over 13225.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2744, pruned_loss=0.08211, over 2574473.01 frames. ], batch size: 83, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:04:29,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=347798.0, ans=0.1 2024-06-21 08:04:46,602 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.84 vs. limit=15.0 2024-06-21 08:04:47,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=347853.0, ans=0.125 2024-06-21 08:04:49,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.17 vs. limit=6.0 2024-06-21 08:04:50,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=347853.0, ans=0.125 2024-06-21 08:04:53,253 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.224e+02 2.502e+02 2.898e+02 4.118e+02, threshold=5.003e+02, percent-clipped=0.0 2024-06-21 08:04:57,222 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.79 vs. limit=10.0 2024-06-21 08:05:01,003 INFO [train.py:1028] (1/2) Epoch 19, batch 7650, loss[loss=0.2112, simple_loss=0.2694, pruned_loss=0.07647, over 12844.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2749, pruned_loss=0.08231, over 2571220.08 frames. ], batch size: 33, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:05:01,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=347889.6666666667, ans=0.125 2024-06-21 08:05:01,954 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:05:11,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=347908.0, ans=0.1 2024-06-21 08:05:13,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=347908.0, ans=0.95 2024-06-21 08:05:30,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=347963.0, ans=0.125 2024-06-21 08:05:33,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=347963.0, ans=0.025 2024-06-21 08:05:37,787 INFO [train.py:1028] (1/2) Epoch 19, batch 7700, loss[loss=0.2253, simple_loss=0.287, pruned_loss=0.08182, over 13286.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2745, pruned_loss=0.08201, over 2568902.04 frames. ], batch size: 63, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:05:39,657 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.02 vs. limit=15.0 2024-06-21 08:05:53,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=348018.0, ans=0.125 2024-06-21 08:06:03,151 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.111e+02 2.206e+02 2.416e+02 3.191e+02, threshold=4.411e+02, percent-clipped=0.0 2024-06-21 08:06:10,341 INFO [train.py:1028] (1/2) Epoch 19, batch 7750, loss[loss=0.1991, simple_loss=0.261, pruned_loss=0.06854, over 13245.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2756, pruned_loss=0.08274, over 2572876.50 frames. ], batch size: 72, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:06:31,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=348109.6666666667, ans=0.0 2024-06-21 08:06:32,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=348109.6666666667, ans=0.025 2024-06-21 08:06:36,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=348128.0, ans=0.125 2024-06-21 08:06:43,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=348146.3333333333, ans=0.125 2024-06-21 08:06:46,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=348164.6666666667, ans=0.125 2024-06-21 08:06:46,703 INFO [train.py:1028] (1/2) Epoch 19, batch 7800, loss[loss=0.2339, simple_loss=0.2871, pruned_loss=0.09031, over 13123.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.276, pruned_loss=0.08274, over 2578212.06 frames. ], batch size: 95, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:07:04,114 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.67 vs. limit=6.0 2024-06-21 08:07:08,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=348219.6666666667, ans=0.0 2024-06-21 08:07:08,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=348219.6666666667, ans=0.02 2024-06-21 08:07:09,235 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.74 vs. limit=15.0 2024-06-21 08:07:16,901 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.127e+02 2.318e+02 2.553e+02 3.908e+02, threshold=4.636e+02, percent-clipped=0.0 2024-06-21 08:07:19,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=348238.0, ans=0.125 2024-06-21 08:07:20,529 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.07 vs. limit=15.0 2024-06-21 08:07:23,381 INFO [train.py:1028] (1/2) Epoch 19, batch 7850, loss[loss=0.1874, simple_loss=0.2442, pruned_loss=0.06528, over 11837.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2764, pruned_loss=0.08304, over 2573091.36 frames. ], batch size: 17, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:07:37,565 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.40 vs. limit=22.5 2024-06-21 08:07:48,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348311.3333333333, ans=0.1 2024-06-21 08:07:52,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=348329.6666666667, ans=0.1 2024-06-21 08:07:56,383 INFO [train.py:1028] (1/2) Epoch 19, batch 7900, loss[loss=0.2127, simple_loss=0.2694, pruned_loss=0.07805, over 13138.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2765, pruned_loss=0.08309, over 2573033.16 frames. ], batch size: 77, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:08:06,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=348366.3333333333, ans=0.5 2024-06-21 08:08:06,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=348366.3333333333, ans=0.125 2024-06-21 08:08:24,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=348403.0, ans=0.125 2024-06-21 08:08:26,016 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.169e+02 2.314e+02 2.473e+02 3.845e+02, threshold=4.627e+02, percent-clipped=0.0 2024-06-21 08:08:26,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=348421.3333333333, ans=0.1 2024-06-21 08:08:29,046 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=15.0 2024-06-21 08:08:31,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348439.6666666667, ans=0.1 2024-06-21 08:08:32,391 INFO [train.py:1028] (1/2) Epoch 19, batch 7950, loss[loss=0.2204, simple_loss=0.2714, pruned_loss=0.08472, over 10482.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2764, pruned_loss=0.08286, over 2575839.12 frames. ], batch size: 303, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:08:38,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=348458.0, ans=0.2 2024-06-21 08:08:55,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=348494.6666666667, ans=0.125 2024-06-21 08:08:56,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=348494.6666666667, ans=0.0 2024-06-21 08:09:05,018 INFO [train.py:1028] (1/2) Epoch 19, batch 8000, loss[loss=0.2022, simple_loss=0.2574, pruned_loss=0.07356, over 12764.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2779, pruned_loss=0.08337, over 2573245.85 frames. ], batch size: 29, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:09:06,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=348531.3333333333, ans=0.0 2024-06-21 08:09:09,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=348531.3333333333, ans=0.2 2024-06-21 08:09:15,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=348549.6666666667, ans=0.0 2024-06-21 08:09:16,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=348549.6666666667, ans=0.2 2024-06-21 08:09:16,602 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.60 vs. limit=15.0 2024-06-21 08:09:20,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=348549.6666666667, ans=15.0 2024-06-21 08:09:20,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=348568.0, ans=0.125 2024-06-21 08:09:23,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348568.0, ans=0.1 2024-06-21 08:09:24,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=348568.0, ans=0.0 2024-06-21 08:09:27,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=348586.3333333333, ans=0.1 2024-06-21 08:09:34,268 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.155e+02 2.286e+02 2.574e+02 3.438e+02, threshold=4.571e+02, percent-clipped=0.0 2024-06-21 08:09:35,033 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:09:39,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=348604.6666666667, ans=0.2 2024-06-21 08:09:40,723 INFO [train.py:1028] (1/2) Epoch 19, batch 8050, loss[loss=0.2077, simple_loss=0.2669, pruned_loss=0.07421, over 13249.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2773, pruned_loss=0.08304, over 2572734.77 frames. ], batch size: 83, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:09:46,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=348623.0, ans=15.0 2024-06-21 08:09:47,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348641.3333333333, ans=0.1 2024-06-21 08:09:56,744 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.90 vs. limit=15.0 2024-06-21 08:10:03,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=348678.0, ans=0.125 2024-06-21 08:10:04,534 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2024-06-21 08:10:15,858 INFO [train.py:1028] (1/2) Epoch 19, batch 8100, loss[loss=0.2273, simple_loss=0.2795, pruned_loss=0.08756, over 13148.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2782, pruned_loss=0.08341, over 2576890.27 frames. ], batch size: 112, lr: 3.04e-03, grad_scale: 32.0 2024-06-21 08:10:21,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=348714.6666666667, ans=0.125 2024-06-21 08:10:42,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348788.0, ans=0.1 2024-06-21 08:10:43,082 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.117e+02 2.249e+02 2.407e+02 3.517e+02, threshold=4.499e+02, percent-clipped=0.0 2024-06-21 08:10:49,413 INFO [train.py:1028] (1/2) Epoch 19, batch 8150, loss[loss=0.2181, simple_loss=0.2691, pruned_loss=0.08355, over 13086.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2786, pruned_loss=0.08305, over 2580107.49 frames. ], batch size: 121, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:10:50,412 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.54 vs. limit=15.0 2024-06-21 08:10:55,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=348824.6666666667, ans=0.0 2024-06-21 08:11:02,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=348843.0, ans=0.125 2024-06-21 08:11:05,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=348843.0, ans=0.0 2024-06-21 08:11:21,250 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.71 vs. limit=12.0 2024-06-21 08:11:22,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=348879.6666666667, ans=0.0 2024-06-21 08:11:22,870 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.70 vs. limit=15.0 2024-06-21 08:11:23,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=348879.6666666667, ans=0.125 2024-06-21 08:11:25,093 INFO [train.py:1028] (1/2) Epoch 19, batch 8200, loss[loss=0.2337, simple_loss=0.2904, pruned_loss=0.08854, over 13191.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2787, pruned_loss=0.0832, over 2583297.13 frames. ], batch size: 112, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:11:40,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=348934.6666666667, ans=0.125 2024-06-21 08:11:41,188 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.93 vs. limit=22.5 2024-06-21 08:11:52,724 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.175e+02 2.320e+02 2.656e+02 3.379e+02, threshold=4.640e+02, percent-clipped=0.0 2024-06-21 08:11:58,647 INFO [train.py:1028] (1/2) Epoch 19, batch 8250, loss[loss=0.2207, simple_loss=0.2858, pruned_loss=0.07782, over 13244.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2794, pruned_loss=0.08372, over 2582887.92 frames. ], batch size: 52, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:12:10,223 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.55 vs. limit=22.5 2024-06-21 08:12:11,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=349026.3333333333, ans=0.2 2024-06-21 08:12:29,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=349063.0, ans=0.125 2024-06-21 08:12:30,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=349063.0, ans=0.0 2024-06-21 08:12:34,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=349063.0, ans=0.1 2024-06-21 08:12:35,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=349063.0, ans=0.1 2024-06-21 08:12:36,217 INFO [train.py:1028] (1/2) Epoch 19, batch 8300, loss[loss=0.2201, simple_loss=0.2749, pruned_loss=0.08266, over 13048.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.279, pruned_loss=0.0836, over 2580375.93 frames. ], batch size: 102, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:12:42,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=349081.3333333333, ans=0.125 2024-06-21 08:12:43,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=349099.6666666667, ans=0.1 2024-06-21 08:12:44,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=349099.6666666667, ans=0.1 2024-06-21 08:12:45,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=349099.6666666667, ans=0.07 2024-06-21 08:12:50,175 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.50 vs. limit=22.5 2024-06-21 08:12:52,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=349118.0, ans=0.09899494936611666 2024-06-21 08:12:59,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=349136.3333333333, ans=0.0 2024-06-21 08:13:00,630 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.08 vs. limit=22.5 2024-06-21 08:13:03,813 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.149e+02 2.314e+02 2.548e+02 3.509e+02, threshold=4.628e+02, percent-clipped=0.0 2024-06-21 08:13:09,573 INFO [train.py:1028] (1/2) Epoch 19, batch 8350, loss[loss=0.2229, simple_loss=0.2744, pruned_loss=0.08571, over 13177.00 frames. ], tot_loss[loss=0.223, simple_loss=0.279, pruned_loss=0.08344, over 2580282.74 frames. ], batch size: 112, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:13:18,893 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=349173.0, ans=0.07 2024-06-21 08:13:20,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=349191.3333333333, ans=0.125 2024-06-21 08:13:23,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=349191.3333333333, ans=0.2 2024-06-21 08:13:42,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=349246.3333333333, ans=0.125 2024-06-21 08:13:46,279 INFO [train.py:1028] (1/2) Epoch 19, batch 8400, loss[loss=0.2041, simple_loss=0.2604, pruned_loss=0.07388, over 12860.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2789, pruned_loss=0.08341, over 2575699.98 frames. ], batch size: 39, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:13:52,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=349283.0, ans=0.0 2024-06-21 08:13:53,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=349283.0, ans=0.95 2024-06-21 08:14:02,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=349301.3333333333, ans=0.125 2024-06-21 08:14:12,771 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.113e+02 2.250e+02 2.433e+02 3.273e+02, threshold=4.500e+02, percent-clipped=0.0 2024-06-21 08:14:15,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=349338.0, ans=0.0 2024-06-21 08:14:21,781 INFO [train.py:1028] (1/2) Epoch 19, batch 8450, loss[loss=0.2297, simple_loss=0.2848, pruned_loss=0.08733, over 13163.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2797, pruned_loss=0.08363, over 2576956.02 frames. ], batch size: 112, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:14:23,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=349356.3333333333, ans=0.1 2024-06-21 08:14:31,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=349374.6666666667, ans=0.125 2024-06-21 08:14:33,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=349374.6666666667, ans=0.125 2024-06-21 08:14:34,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=349393.0, ans=0.125 2024-06-21 08:14:46,061 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=349411.3333333333, ans=0.1 2024-06-21 08:14:53,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=349429.6666666667, ans=0.1 2024-06-21 08:14:54,926 INFO [train.py:1028] (1/2) Epoch 19, batch 8500, loss[loss=0.197, simple_loss=0.2621, pruned_loss=0.06591, over 12569.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2809, pruned_loss=0.08412, over 2575634.05 frames. ], batch size: 29, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:14:56,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=349448.0, ans=0.1 2024-06-21 08:14:58,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=349448.0, ans=0.0 2024-06-21 08:15:04,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=349466.3333333333, ans=0.125 2024-06-21 08:15:09,032 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.79 vs. limit=10.0 2024-06-21 08:15:20,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=349503.0, ans=0.125 2024-06-21 08:15:25,494 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.133e+02 2.237e+02 2.425e+02 3.564e+02, threshold=4.474e+02, percent-clipped=0.0 2024-06-21 08:15:31,370 INFO [train.py:1028] (1/2) Epoch 19, batch 8550, loss[loss=0.225, simple_loss=0.2791, pruned_loss=0.08547, over 12593.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2804, pruned_loss=0.08358, over 2575713.35 frames. ], batch size: 22, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:15:39,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=349558.0, ans=0.1 2024-06-21 08:15:44,519 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.37 vs. limit=15.0 2024-06-21 08:15:52,079 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2024-06-21 08:16:04,997 INFO [train.py:1028] (1/2) Epoch 19, batch 8600, loss[loss=0.2299, simple_loss=0.2777, pruned_loss=0.09101, over 13073.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2816, pruned_loss=0.08407, over 2573818.52 frames. ], batch size: 121, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:16:05,418 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.33 vs. limit=15.0 2024-06-21 08:16:17,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=349649.6666666667, ans=0.125 2024-06-21 08:16:22,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=349668.0, ans=15.0 2024-06-21 08:16:36,083 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.162e+02 2.401e+02 2.690e+02 3.858e+02, threshold=4.802e+02, percent-clipped=0.0 2024-06-21 08:16:42,327 INFO [train.py:1028] (1/2) Epoch 19, batch 8650, loss[loss=0.2362, simple_loss=0.2869, pruned_loss=0.09278, over 12984.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2816, pruned_loss=0.08378, over 2575951.83 frames. ], batch size: 102, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:16:47,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.24 vs. limit=10.0 2024-06-21 08:16:54,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=349741.3333333333, ans=0.05 2024-06-21 08:16:56,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=349759.6666666667, ans=0.0 2024-06-21 08:17:00,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=349759.6666666667, ans=0.125 2024-06-21 08:17:05,398 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2024-06-21 08:17:11,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=349796.3333333333, ans=0.0 2024-06-21 08:17:18,998 INFO [train.py:1028] (1/2) Epoch 19, batch 8700, loss[loss=0.2127, simple_loss=0.2794, pruned_loss=0.07298, over 13159.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.282, pruned_loss=0.08415, over 2573020.20 frames. ], batch size: 59, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:17:19,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=349814.6666666667, ans=0.125 2024-06-21 08:17:23,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=349814.6666666667, ans=0.0 2024-06-21 08:17:35,953 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.16 vs. limit=22.5 2024-06-21 08:17:36,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=349851.3333333333, ans=0.0 2024-06-21 08:17:36,697 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-21 08:17:46,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=349888.0, ans=0.1 2024-06-21 08:17:46,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=349888.0, ans=0.0 2024-06-21 08:17:46,699 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.152e+02 2.260e+02 2.429e+02 3.156e+02, threshold=4.520e+02, percent-clipped=0.0 2024-06-21 08:17:49,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=349888.0, ans=0.125 2024-06-21 08:17:49,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=349888.0, ans=0.125 2024-06-21 08:17:52,636 INFO [train.py:1028] (1/2) Epoch 19, batch 8750, loss[loss=0.2484, simple_loss=0.298, pruned_loss=0.0994, over 13126.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2821, pruned_loss=0.08433, over 2568912.68 frames. ], batch size: 121, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:17:55,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=349906.3333333333, ans=0.125 2024-06-21 08:17:59,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=349924.6666666667, ans=0.125 2024-06-21 08:17:59,565 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.17 vs. limit=10.0 2024-06-21 08:18:04,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=349924.6666666667, ans=0.125 2024-06-21 08:18:24,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.65 vs. limit=5.0 2024-06-21 08:18:26,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=349979.6666666667, ans=0.025 2024-06-21 08:18:27,893 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=12.0 2024-06-21 08:18:29,297 INFO [train.py:1028] (1/2) Epoch 19, batch 8800, loss[loss=0.2287, simple_loss=0.2845, pruned_loss=0.08641, over 13063.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.283, pruned_loss=0.08473, over 2573834.58 frames. ], batch size: 71, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:18:38,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=350016.3333333333, ans=0.1 2024-06-21 08:18:43,967 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2024-06-21 08:18:51,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=350053.0, ans=0.05 2024-06-21 08:18:52,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=350053.0, ans=0.125 2024-06-21 08:18:57,114 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.783e+02 2.127e+02 2.296e+02 2.456e+02 2.976e+02, threshold=4.593e+02, percent-clipped=0.0 2024-06-21 08:18:58,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=350071.3333333333, ans=0.2 2024-06-21 08:19:03,283 INFO [train.py:1028] (1/2) Epoch 19, batch 8850, loss[loss=0.2315, simple_loss=0.2863, pruned_loss=0.08839, over 12493.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2833, pruned_loss=0.08517, over 2562283.36 frames. ], batch size: 202, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:19:05,974 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.93 vs. limit=22.5 2024-06-21 08:19:06,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=350089.6666666667, ans=0.025 2024-06-21 08:19:17,767 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2024-06-21 08:19:20,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=350126.3333333333, ans=0.2 2024-06-21 08:19:27,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=350144.6666666667, ans=0.125 2024-06-21 08:19:30,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=350144.6666666667, ans=0.0 2024-06-21 08:19:33,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=350163.0, ans=0.125 2024-06-21 08:19:36,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=350163.0, ans=0.05 2024-06-21 08:19:36,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=350163.0, ans=0.2 2024-06-21 08:19:39,893 INFO [train.py:1028] (1/2) Epoch 19, batch 8900, loss[loss=0.2246, simple_loss=0.2945, pruned_loss=0.07731, over 12951.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2838, pruned_loss=0.08541, over 2561304.73 frames. ], batch size: 33, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:19:42,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=350181.3333333333, ans=0.0 2024-06-21 08:19:44,161 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.59 vs. limit=15.0 2024-06-21 08:19:45,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=350199.6666666667, ans=0.125 2024-06-21 08:20:06,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=350254.6666666667, ans=0.125 2024-06-21 08:20:06,606 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.163e+02 2.301e+02 2.546e+02 3.527e+02, threshold=4.602e+02, percent-clipped=0.0 2024-06-21 08:20:06,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=350254.6666666667, ans=0.025 2024-06-21 08:20:09,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=350254.6666666667, ans=0.125 2024-06-21 08:20:17,181 INFO [train.py:1028] (1/2) Epoch 19, batch 8950, loss[loss=0.2526, simple_loss=0.3022, pruned_loss=0.1015, over 12570.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2837, pruned_loss=0.08522, over 2560668.83 frames. ], batch size: 202, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:20:17,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=350273.0, ans=0.125 2024-06-21 08:20:27,454 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.86 vs. limit=22.5 2024-06-21 08:20:37,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=350328.0, ans=0.1 2024-06-21 08:20:40,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=350328.0, ans=0.0 2024-06-21 08:20:49,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=350346.3333333333, ans=0.125 2024-06-21 08:20:50,492 INFO [train.py:1028] (1/2) Epoch 19, batch 9000, loss[loss=0.2094, simple_loss=0.2716, pruned_loss=0.07357, over 13256.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2835, pruned_loss=0.08489, over 2566726.59 frames. ], batch size: 46, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:20:50,493 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 08:20:58,230 INFO [train.py:1060] (1/2) Epoch 19, validation: loss=0.1869, simple_loss=0.2513, pruned_loss=0.06122, over 351949.00 frames. 2024-06-21 08:20:58,231 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 08:21:10,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=350401.3333333333, ans=0.125 2024-06-21 08:21:14,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=350401.3333333333, ans=0.1 2024-06-21 08:21:16,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=350401.3333333333, ans=0.125 2024-06-21 08:21:17,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=350419.6666666667, ans=0.125 2024-06-21 08:21:19,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=350419.6666666667, ans=0.0 2024-06-21 08:21:24,434 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.167e+02 2.393e+02 2.729e+02 4.002e+02, threshold=4.787e+02, percent-clipped=0.0 2024-06-21 08:21:29,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=350438.0, ans=0.2 2024-06-21 08:21:32,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=350438.0, ans=0.125 2024-06-21 08:21:33,480 INFO [train.py:1028] (1/2) Epoch 19, batch 9050, loss[loss=0.1953, simple_loss=0.2489, pruned_loss=0.07088, over 12079.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2841, pruned_loss=0.08519, over 2566939.13 frames. ], batch size: 18, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:21:40,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=350474.6666666667, ans=10.0 2024-06-21 08:21:41,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.27 vs. limit=15.0 2024-06-21 08:21:47,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=350493.0, ans=0.125 2024-06-21 08:21:57,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=350511.3333333333, ans=0.0 2024-06-21 08:21:58,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=350529.6666666667, ans=0.125 2024-06-21 08:22:00,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=350529.6666666667, ans=0.1 2024-06-21 08:22:01,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=350529.6666666667, ans=0.1 2024-06-21 08:22:05,644 INFO [train.py:1028] (1/2) Epoch 19, batch 9100, loss[loss=0.2365, simple_loss=0.2908, pruned_loss=0.09112, over 13211.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2841, pruned_loss=0.08492, over 2567243.22 frames. ], batch size: 72, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:22:07,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=350548.0, ans=0.1 2024-06-21 08:22:12,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=350566.3333333333, ans=0.125 2024-06-21 08:22:14,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=350566.3333333333, ans=0.09899494936611666 2024-06-21 08:22:24,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=350603.0, ans=0.0 2024-06-21 08:22:29,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=350603.0, ans=0.125 2024-06-21 08:22:31,633 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.108e+02 2.251e+02 2.454e+02 3.174e+02, threshold=4.501e+02, percent-clipped=0.0 2024-06-21 08:22:37,216 INFO [train.py:1028] (1/2) Epoch 19, batch 9150, loss[loss=0.2364, simple_loss=0.2914, pruned_loss=0.09072, over 13199.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2844, pruned_loss=0.08541, over 2568019.41 frames. ], batch size: 77, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:22:55,989 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.90 vs. limit=22.5 2024-06-21 08:22:58,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=350694.6666666667, ans=0.125 2024-06-21 08:23:05,541 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.64 vs. limit=22.5 2024-06-21 08:23:08,841 INFO [train.py:1028] (1/2) Epoch 19, batch 9200, loss[loss=0.2145, simple_loss=0.277, pruned_loss=0.076, over 12955.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2836, pruned_loss=0.0846, over 2571726.21 frames. ], batch size: 36, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:23:10,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=350731.3333333333, ans=0.125 2024-06-21 08:23:11,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=350731.3333333333, ans=0.0 2024-06-21 08:23:12,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=350731.3333333333, ans=0.2 2024-06-21 08:23:20,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=350768.0, ans=0.0 2024-06-21 08:23:26,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=350786.3333333333, ans=0.0 2024-06-21 08:23:34,324 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.106e+02 2.278e+02 2.450e+02 3.188e+02, threshold=4.556e+02, percent-clipped=0.0 2024-06-21 08:23:34,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=350804.6666666667, ans=0.2 2024-06-21 08:23:43,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=350804.6666666667, ans=0.125 2024-06-21 08:23:44,376 INFO [train.py:1028] (1/2) Epoch 19, batch 9250, loss[loss=0.2057, simple_loss=0.2688, pruned_loss=0.07134, over 13242.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2834, pruned_loss=0.08444, over 2573377.49 frames. ], batch size: 67, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:23:50,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=350841.3333333333, ans=0.07 2024-06-21 08:23:51,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=350841.3333333333, ans=15.0 2024-06-21 08:24:01,560 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.47 vs. limit=15.0 2024-06-21 08:24:11,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=350896.3333333333, ans=0.0 2024-06-21 08:24:13,844 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.86 vs. limit=10.0 2024-06-21 08:24:15,184 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.68 vs. limit=6.0 2024-06-21 08:24:16,054 INFO [train.py:1028] (1/2) Epoch 19, batch 9300, loss[loss=0.2338, simple_loss=0.294, pruned_loss=0.08685, over 13014.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2838, pruned_loss=0.0846, over 2568860.80 frames. ], batch size: 39, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:24:18,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=350914.6666666667, ans=0.1 2024-06-21 08:24:18,399 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.52 vs. limit=10.0 2024-06-21 08:24:37,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=350969.6666666667, ans=0.04949747468305833 2024-06-21 08:24:41,345 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.120e+02 2.240e+02 2.415e+02 3.625e+02, threshold=4.481e+02, percent-clipped=0.0 2024-06-21 08:24:45,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=350988.0, ans=10.0 2024-06-21 08:24:47,024 INFO [train.py:1028] (1/2) Epoch 19, batch 9350, loss[loss=0.2218, simple_loss=0.2851, pruned_loss=0.07923, over 12585.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2842, pruned_loss=0.08489, over 2566739.67 frames. ], batch size: 22, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:24:59,264 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2024-06-21 08:25:15,163 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.08 vs. limit=15.0 2024-06-21 08:25:20,300 INFO [train.py:1028] (1/2) Epoch 19, batch 9400, loss[loss=0.2257, simple_loss=0.2816, pruned_loss=0.08486, over 13274.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2839, pruned_loss=0.08471, over 2565747.77 frames. ], batch size: 52, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:25:20,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=351098.0, ans=0.0 2024-06-21 08:25:20,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=351098.0, ans=0.125 2024-06-21 08:25:22,238 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:25:22,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=351098.0, ans=0.5 2024-06-21 08:25:25,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=351098.0, ans=0.0 2024-06-21 08:25:37,855 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2024-06-21 08:25:39,853 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.91 vs. limit=15.0 2024-06-21 08:25:44,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=351171.3333333333, ans=0.2 2024-06-21 08:25:45,450 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.231e+02 2.406e+02 2.615e+02 3.702e+02, threshold=4.813e+02, percent-clipped=0.0 2024-06-21 08:25:50,098 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.64 vs. limit=10.0 2024-06-21 08:25:50,941 INFO [train.py:1028] (1/2) Epoch 19, batch 9450, loss[loss=0.2509, simple_loss=0.308, pruned_loss=0.09689, over 12732.00 frames. ], tot_loss[loss=0.228, simple_loss=0.285, pruned_loss=0.08553, over 2566022.89 frames. ], batch size: 22, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:25:53,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=351189.6666666667, ans=0.125 2024-06-21 08:25:55,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=351189.6666666667, ans=0.125 2024-06-21 08:25:56,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351189.6666666667, ans=0.1 2024-06-21 08:25:58,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=351208.0, ans=0.1 2024-06-21 08:26:00,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=351208.0, ans=0.125 2024-06-21 08:26:18,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=351263.0, ans=0.125 2024-06-21 08:26:18,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=351263.0, ans=0.05 2024-06-21 08:26:21,210 INFO [train.py:1028] (1/2) Epoch 19, batch 9500, loss[loss=0.2377, simple_loss=0.2944, pruned_loss=0.0905, over 13222.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2844, pruned_loss=0.08477, over 2574211.04 frames. ], batch size: 43, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:26:27,697 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.00 vs. limit=22.5 2024-06-21 08:26:27,768 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.68 vs. limit=22.5 2024-06-21 08:26:31,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=351299.6666666667, ans=0.0 2024-06-21 08:26:33,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=351318.0, ans=0.125 2024-06-21 08:26:34,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=351318.0, ans=0.125 2024-06-21 08:26:38,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=351318.0, ans=0.0 2024-06-21 08:26:48,175 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.133e+02 2.289e+02 2.499e+02 3.226e+02, threshold=4.578e+02, percent-clipped=0.0 2024-06-21 08:26:51,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=351354.6666666667, ans=0.125 2024-06-21 08:26:52,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=351354.6666666667, ans=0.125 2024-06-21 08:26:53,661 INFO [train.py:1028] (1/2) Epoch 19, batch 9550, loss[loss=0.201, simple_loss=0.265, pruned_loss=0.06848, over 13208.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2845, pruned_loss=0.08483, over 2569321.17 frames. ], batch size: 40, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:26:55,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=351373.0, ans=0.125 2024-06-21 08:26:57,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=351373.0, ans=0.125 2024-06-21 08:27:03,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=351391.3333333333, ans=0.125 2024-06-21 08:27:06,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=351409.6666666667, ans=0.0 2024-06-21 08:27:06,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=351409.6666666667, ans=0.04949747468305833 2024-06-21 08:27:09,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=351409.6666666667, ans=22.5 2024-06-21 08:27:10,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=351409.6666666667, ans=0.125 2024-06-21 08:27:15,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=351428.0, ans=0.125 2024-06-21 08:27:18,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=351446.3333333333, ans=0.125 2024-06-21 08:27:23,738 INFO [train.py:1028] (1/2) Epoch 19, batch 9600, loss[loss=0.2205, simple_loss=0.2675, pruned_loss=0.08681, over 10354.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.284, pruned_loss=0.0846, over 2570051.23 frames. ], batch size: 303, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:27:24,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=351464.6666666667, ans=0.125 2024-06-21 08:27:27,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=351464.6666666667, ans=0.125 2024-06-21 08:27:30,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351483.0, ans=0.1 2024-06-21 08:27:39,218 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:27:40,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351501.3333333333, ans=0.1 2024-06-21 08:27:50,458 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.164e+02 2.321e+02 2.550e+02 3.114e+02, threshold=4.641e+02, percent-clipped=0.0 2024-06-21 08:27:56,084 INFO [train.py:1028] (1/2) Epoch 19, batch 9650, loss[loss=0.1928, simple_loss=0.2435, pruned_loss=0.07102, over 13056.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2841, pruned_loss=0.08522, over 2560753.23 frames. ], batch size: 132, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:27:59,450 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=12.0 2024-06-21 08:28:07,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=351593.0, ans=0.2 2024-06-21 08:28:15,940 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.60 vs. limit=15.0 2024-06-21 08:28:18,335 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.51 vs. limit=15.0 2024-06-21 08:28:21,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=351629.6666666667, ans=0.0 2024-06-21 08:28:23,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=351629.6666666667, ans=0.025 2024-06-21 08:28:26,107 INFO [train.py:1028] (1/2) Epoch 19, batch 9700, loss[loss=0.2245, simple_loss=0.2797, pruned_loss=0.08464, over 13019.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2833, pruned_loss=0.08486, over 2555351.85 frames. ], batch size: 144, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:28:29,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=351648.0, ans=0.125 2024-06-21 08:28:32,449 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=15.0 2024-06-21 08:28:43,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=351684.6666666667, ans=0.1 2024-06-21 08:28:46,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=351703.0, ans=0.125 2024-06-21 08:28:46,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=351703.0, ans=0.0 2024-06-21 08:28:52,355 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.176e+02 2.339e+02 2.643e+02 3.345e+02, threshold=4.678e+02, percent-clipped=0.0 2024-06-21 08:28:57,728 INFO [train.py:1028] (1/2) Epoch 19, batch 9750, loss[loss=0.221, simple_loss=0.2762, pruned_loss=0.08294, over 13157.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2818, pruned_loss=0.08405, over 2552204.66 frames. ], batch size: 132, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:28:58,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=351739.6666666667, ans=0.125 2024-06-21 08:29:07,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=351758.0, ans=0.125 2024-06-21 08:29:23,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=351813.0, ans=0.125 2024-06-21 08:29:28,423 INFO [train.py:1028] (1/2) Epoch 19, batch 9800, loss[loss=0.2184, simple_loss=0.2711, pruned_loss=0.08285, over 13235.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2813, pruned_loss=0.08349, over 2545106.83 frames. ], batch size: 40, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:29:30,003 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.83 vs. limit=15.0 2024-06-21 08:29:33,424 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.12 vs. limit=10.0 2024-06-21 08:29:34,765 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.86 vs. limit=22.5 2024-06-21 08:29:38,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=351849.6666666667, ans=0.1 2024-06-21 08:29:39,510 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.20 vs. limit=12.0 2024-06-21 08:29:41,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=351868.0, ans=0.125 2024-06-21 08:29:48,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=351886.3333333333, ans=0.07 2024-06-21 08:29:54,172 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.110e+02 2.278e+02 2.479e+02 2.878e+02, threshold=4.556e+02, percent-clipped=0.0 2024-06-21 08:29:58,048 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.46 vs. limit=15.0 2024-06-21 08:29:59,517 INFO [train.py:1028] (1/2) Epoch 19, batch 9850, loss[loss=0.2233, simple_loss=0.2805, pruned_loss=0.08301, over 13071.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2805, pruned_loss=0.08333, over 2537365.97 frames. ], batch size: 102, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:30:04,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=351923.0, ans=15.0 2024-06-21 08:30:05,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=351941.3333333333, ans=0.125 2024-06-21 08:30:07,986 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.50 vs. limit=10.0 2024-06-21 08:30:09,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=351941.3333333333, ans=0.2 2024-06-21 08:30:12,404 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.51 vs. limit=12.0 2024-06-21 08:30:15,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=351959.6666666667, ans=0.0 2024-06-21 08:30:32,950 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.71 vs. limit=22.5 2024-06-21 08:30:36,590 INFO [train.py:1028] (1/2) Epoch 19, batch 9900, loss[loss=0.1953, simple_loss=0.2555, pruned_loss=0.06751, over 12960.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2796, pruned_loss=0.08327, over 2530177.98 frames. ], batch size: 39, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:30:51,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=352051.3333333333, ans=0.125 2024-06-21 08:31:03,659 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.142e+02 2.293e+02 2.443e+02 3.034e+02, threshold=4.585e+02, percent-clipped=0.0 2024-06-21 08:31:04,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=352088.0, ans=0.125 2024-06-21 08:31:08,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=352106.3333333333, ans=0.0 2024-06-21 08:31:09,380 INFO [train.py:1028] (1/2) Epoch 19, batch 9950, loss[loss=0.2313, simple_loss=0.2918, pruned_loss=0.08539, over 12766.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2789, pruned_loss=0.0834, over 2521899.99 frames. ], batch size: 29, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:31:11,063 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2024-06-21 08:31:27,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=352161.3333333333, ans=0.025 2024-06-21 08:31:35,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=352179.6666666667, ans=0.1 2024-06-21 08:31:39,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=352179.6666666667, ans=0.125 2024-06-21 08:31:41,488 INFO [train.py:1028] (1/2) Epoch 19, batch 10000, loss[loss=0.2454, simple_loss=0.3062, pruned_loss=0.09228, over 12586.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2791, pruned_loss=0.08369, over 2486889.27 frames. ], batch size: 22, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:31:48,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=352216.3333333333, ans=0.0 2024-06-21 08:31:50,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=352216.3333333333, ans=0.125 2024-06-21 08:31:58,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=352234.6666666667, ans=0.125 2024-06-21 08:32:03,582 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2024-06-21 08:32:03,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=352253.0, ans=0.07 2024-06-21 08:32:07,411 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.225e+02 2.432e+02 2.702e+02 3.815e+02, threshold=4.865e+02, percent-clipped=0.0 2024-06-21 08:32:13,401 INFO [train.py:1028] (1/2) Epoch 19, batch 10050, loss[loss=0.2394, simple_loss=0.3054, pruned_loss=0.08665, over 12599.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2793, pruned_loss=0.0845, over 2443420.55 frames. ], batch size: 22, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:32:24,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=352308.0, ans=0.1 2024-06-21 08:32:31,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=352344.6666666667, ans=0.0 2024-06-21 08:32:43,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=352363.0, ans=0.2 2024-06-21 08:32:44,307 INFO [train.py:1028] (1/2) Epoch 19, batch 10100, loss[loss=0.1984, simple_loss=0.2638, pruned_loss=0.06652, over 11493.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2784, pruned_loss=0.08387, over 2422947.00 frames. ], batch size: 17, lr: 3.02e-03, grad_scale: 64.0 2024-06-21 08:35:01,697 INFO [train.py:1028] (1/2) Epoch 20, batch 0, loss[loss=0.1882, simple_loss=0.2492, pruned_loss=0.06355, over 12982.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2492, pruned_loss=0.06355, over 12982.00 frames. ], batch size: 36, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:35:01,697 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 08:35:05,513 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.5028, 3.1910, 2.4443, 3.4271], device='cuda:1') 2024-06-21 08:35:08,566 INFO [train.py:1060] (1/2) Epoch 20, validation: loss=0.1882, simple_loss=0.2529, pruned_loss=0.06178, over 351949.00 frames. 2024-06-21 08:35:08,566 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 08:35:16,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=352430.8333333333, ans=0.0 2024-06-21 08:35:25,167 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.051e+02 2.258e+02 2.481e+02 3.459e+02, threshold=4.516e+02, percent-clipped=0.0 2024-06-21 08:35:33,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=352467.5, ans=0.07 2024-06-21 08:35:38,693 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.01 vs. limit=22.5 2024-06-21 08:35:39,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=352485.8333333333, ans=0.0 2024-06-21 08:35:42,860 INFO [train.py:1028] (1/2) Epoch 20, batch 50, loss[loss=0.1933, simple_loss=0.252, pruned_loss=0.06727, over 12687.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2631, pruned_loss=0.07863, over 575144.95 frames. ], batch size: 29, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:35:44,401 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.26 vs. limit=10.0 2024-06-21 08:35:56,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=352540.8333333333, ans=0.125 2024-06-21 08:35:58,877 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.94 vs. limit=15.0 2024-06-21 08:36:15,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=352595.8333333333, ans=0.125 2024-06-21 08:36:16,486 INFO [train.py:1028] (1/2) Epoch 20, batch 100, loss[loss=0.2169, simple_loss=0.2742, pruned_loss=0.07984, over 13327.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2625, pruned_loss=0.07735, over 1017964.27 frames. ], batch size: 46, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:36:22,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=352614.1666666667, ans=0.2 2024-06-21 08:36:25,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=352614.1666666667, ans=0.2 2024-06-21 08:36:27,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=352614.1666666667, ans=0.0 2024-06-21 08:36:31,799 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.041e+02 2.153e+02 2.355e+02 3.255e+02, threshold=4.307e+02, percent-clipped=0.0 2024-06-21 08:36:42,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=352650.8333333333, ans=0.95 2024-06-21 08:36:45,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=352669.1666666667, ans=0.0 2024-06-21 08:36:48,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=352669.1666666667, ans=0.0 2024-06-21 08:36:52,161 INFO [train.py:1028] (1/2) Epoch 20, batch 150, loss[loss=0.1952, simple_loss=0.2574, pruned_loss=0.06653, over 12630.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.26, pruned_loss=0.07525, over 1365708.20 frames. ], batch size: 29, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:36:57,234 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.16 vs. limit=15.0 2024-06-21 08:37:03,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=352705.8333333333, ans=0.125 2024-06-21 08:37:09,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=352724.1666666667, ans=0.0 2024-06-21 08:37:15,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=352742.5, ans=0.1 2024-06-21 08:37:24,306 INFO [train.py:1028] (1/2) Epoch 20, batch 200, loss[loss=0.2191, simple_loss=0.2645, pruned_loss=0.0868, over 12565.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2599, pruned_loss=0.07547, over 1635585.55 frames. ], batch size: 202, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:37:39,863 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 1.997e+02 2.133e+02 2.256e+02 3.157e+02, threshold=4.266e+02, percent-clipped=0.0 2024-06-21 08:37:41,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=352815.8333333333, ans=0.125 2024-06-21 08:37:41,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=352815.8333333333, ans=10.0 2024-06-21 08:37:43,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=352834.1666666667, ans=0.1 2024-06-21 08:37:48,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=352834.1666666667, ans=0.2 2024-06-21 08:37:49,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=352834.1666666667, ans=0.125 2024-06-21 08:37:55,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=352852.5, ans=0.2 2024-06-21 08:37:56,518 INFO [train.py:1028] (1/2) Epoch 20, batch 250, loss[loss=0.2037, simple_loss=0.2522, pruned_loss=0.07757, over 13031.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2598, pruned_loss=0.07551, over 1847730.69 frames. ], batch size: 144, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:38:02,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=352889.1666666667, ans=0.1 2024-06-21 08:38:07,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=352889.1666666667, ans=0.025 2024-06-21 08:38:09,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=352907.5, ans=0.2 2024-06-21 08:38:10,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=352907.5, ans=0.125 2024-06-21 08:38:14,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=352907.5, ans=6.0 2024-06-21 08:38:15,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=352907.5, ans=0.2 2024-06-21 08:38:30,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=352944.1666666667, ans=0.125 2024-06-21 08:38:33,896 INFO [train.py:1028] (1/2) Epoch 20, batch 300, loss[loss=0.1932, simple_loss=0.2438, pruned_loss=0.07127, over 13179.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2599, pruned_loss=0.07538, over 2010107.18 frames. ], batch size: 112, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:38:44,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=352980.8333333333, ans=0.125 2024-06-21 08:38:52,661 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.027e+02 2.166e+02 2.374e+02 3.059e+02, threshold=4.333e+02, percent-clipped=0.0 2024-06-21 08:38:53,125 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.85 vs. limit=15.0 2024-06-21 08:39:07,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=353035.8333333333, ans=0.0 2024-06-21 08:39:08,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=353054.1666666667, ans=0.07 2024-06-21 08:39:09,295 INFO [train.py:1028] (1/2) Epoch 20, batch 350, loss[loss=0.1967, simple_loss=0.2566, pruned_loss=0.06842, over 13012.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2594, pruned_loss=0.07514, over 2139061.04 frames. ], batch size: 33, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:39:18,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=353072.5, ans=0.0 2024-06-21 08:39:21,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=353090.8333333333, ans=0.125 2024-06-21 08:39:23,405 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=12.0 2024-06-21 08:39:24,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=353090.8333333333, ans=0.125 2024-06-21 08:39:41,320 INFO [train.py:1028] (1/2) Epoch 20, batch 400, loss[loss=0.2046, simple_loss=0.2665, pruned_loss=0.07139, over 13286.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2597, pruned_loss=0.0751, over 2239779.63 frames. ], batch size: 63, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:39:43,736 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.01 vs. limit=15.0 2024-06-21 08:39:44,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=353145.8333333333, ans=0.125 2024-06-21 08:39:45,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=353145.8333333333, ans=0.125 2024-06-21 08:39:47,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=353164.1666666667, ans=0.1 2024-06-21 08:39:51,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=353164.1666666667, ans=0.125 2024-06-21 08:39:53,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=353182.5, ans=0.125 2024-06-21 08:39:56,375 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.784e+02 1.993e+02 2.159e+02 2.410e+02 3.746e+02, threshold=4.317e+02, percent-clipped=0.0 2024-06-21 08:39:58,739 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.97 vs. limit=15.0 2024-06-21 08:39:58,780 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=12.0 2024-06-21 08:40:09,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=353219.1666666667, ans=0.125 2024-06-21 08:40:11,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=353219.1666666667, ans=0.1 2024-06-21 08:40:12,695 INFO [train.py:1028] (1/2) Epoch 20, batch 450, loss[loss=0.1985, simple_loss=0.2573, pruned_loss=0.06992, over 13216.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.26, pruned_loss=0.07509, over 2313410.58 frames. ], batch size: 67, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:40:14,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=353237.5, ans=0.0 2024-06-21 08:40:16,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=353237.5, ans=0.125 2024-06-21 08:40:16,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=353237.5, ans=0.0 2024-06-21 08:40:19,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=353255.8333333333, ans=0.1 2024-06-21 08:40:36,549 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.00 vs. limit=15.0 2024-06-21 08:40:53,479 INFO [train.py:1028] (1/2) Epoch 20, batch 500, loss[loss=0.199, simple_loss=0.2482, pruned_loss=0.07486, over 13064.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2607, pruned_loss=0.07529, over 2376569.06 frames. ], batch size: 121, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:40:53,927 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2024-06-21 08:40:56,383 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.72 vs. limit=15.0 2024-06-21 08:41:06,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=353365.8333333333, ans=0.125 2024-06-21 08:41:07,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=353365.8333333333, ans=0.025 2024-06-21 08:41:08,687 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.038e+02 2.188e+02 2.436e+02 3.019e+02, threshold=4.375e+02, percent-clipped=0.0 2024-06-21 08:41:16,655 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.23 vs. limit=10.0 2024-06-21 08:41:25,667 INFO [train.py:1028] (1/2) Epoch 20, batch 550, loss[loss=0.2223, simple_loss=0.2695, pruned_loss=0.08753, over 12946.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2606, pruned_loss=0.07547, over 2421447.80 frames. ], batch size: 158, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:41:27,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=353420.8333333333, ans=0.05 2024-06-21 08:41:37,320 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.86 vs. limit=22.5 2024-06-21 08:41:40,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=353457.5, ans=0.1 2024-06-21 08:41:43,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=353457.5, ans=0.1 2024-06-21 08:41:43,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=353457.5, ans=0.0 2024-06-21 08:41:44,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=353475.8333333333, ans=0.2 2024-06-21 08:41:54,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=353494.1666666667, ans=0.2 2024-06-21 08:41:56,492 INFO [train.py:1028] (1/2) Epoch 20, batch 600, loss[loss=0.1933, simple_loss=0.2397, pruned_loss=0.07342, over 13009.00 frames. ], tot_loss[loss=0.205, simple_loss=0.26, pruned_loss=0.07501, over 2458825.53 frames. ], batch size: 144, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:41:57,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=353512.5, ans=0.125 2024-06-21 08:42:04,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=353530.8333333333, ans=0.125 2024-06-21 08:42:11,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=353549.1666666667, ans=0.0 2024-06-21 08:42:11,662 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 1.985e+02 2.100e+02 2.261e+02 2.880e+02, threshold=4.199e+02, percent-clipped=0.0 2024-06-21 08:42:16,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=353567.5, ans=0.025 2024-06-21 08:42:16,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=353567.5, ans=0.0 2024-06-21 08:42:18,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=353567.5, ans=0.125 2024-06-21 08:42:28,433 INFO [train.py:1028] (1/2) Epoch 20, batch 650, loss[loss=0.2095, simple_loss=0.2624, pruned_loss=0.07834, over 13209.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2594, pruned_loss=0.07445, over 2490270.89 frames. ], batch size: 59, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:42:37,848 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.95 vs. limit=15.0 2024-06-21 08:42:42,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=353622.5, ans=0.125 2024-06-21 08:42:45,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=353640.8333333333, ans=0.125 2024-06-21 08:42:53,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=353659.1666666667, ans=0.95 2024-06-21 08:43:06,236 INFO [train.py:1028] (1/2) Epoch 20, batch 700, loss[loss=0.2095, simple_loss=0.2731, pruned_loss=0.07292, over 13317.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2589, pruned_loss=0.07428, over 2512598.01 frames. ], batch size: 46, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:43:08,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=353695.8333333333, ans=0.0 2024-06-21 08:43:08,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=353695.8333333333, ans=0.125 2024-06-21 08:43:09,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=353695.8333333333, ans=0.125 2024-06-21 08:43:14,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=353714.1666666667, ans=0.0 2024-06-21 08:43:21,354 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.058e+02 2.232e+02 2.397e+02 4.050e+02, threshold=4.465e+02, percent-clipped=0.0 2024-06-21 08:43:23,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=353732.5, ans=0.0 2024-06-21 08:43:33,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=353769.1666666667, ans=0.0 2024-06-21 08:43:38,020 INFO [train.py:1028] (1/2) Epoch 20, batch 750, loss[loss=0.1907, simple_loss=0.2548, pruned_loss=0.06333, over 13299.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2591, pruned_loss=0.07439, over 2528950.06 frames. ], batch size: 63, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:43:49,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=353824.1666666667, ans=0.125 2024-06-21 08:43:57,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=353842.5, ans=0.125 2024-06-21 08:44:00,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.54 vs. limit=22.5 2024-06-21 08:44:04,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=353860.8333333333, ans=0.025 2024-06-21 08:44:04,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=353860.8333333333, ans=0.0 2024-06-21 08:44:04,310 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=353860.8333333333, ans=0.0 2024-06-21 08:44:09,953 INFO [train.py:1028] (1/2) Epoch 20, batch 800, loss[loss=0.1865, simple_loss=0.251, pruned_loss=0.06105, over 12885.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2595, pruned_loss=0.0745, over 2542597.67 frames. ], batch size: 36, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:44:20,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=353897.5, ans=0.0 2024-06-21 08:44:25,285 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.033e+02 2.142e+02 2.349e+02 3.193e+02, threshold=4.284e+02, percent-clipped=0.0 2024-06-21 08:44:26,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=353915.8333333333, ans=0.125 2024-06-21 08:44:27,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=353915.8333333333, ans=0.0 2024-06-21 08:44:41,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=353952.5, ans=0.125 2024-06-21 08:44:46,909 INFO [train.py:1028] (1/2) Epoch 20, batch 850, loss[loss=0.1875, simple_loss=0.2435, pruned_loss=0.0657, over 13197.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2589, pruned_loss=0.07416, over 2554010.77 frames. ], batch size: 95, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:44:47,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=353970.8333333333, ans=0.1 2024-06-21 08:44:49,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=353970.8333333333, ans=0.125 2024-06-21 08:44:49,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=353970.8333333333, ans=0.125 2024-06-21 08:45:00,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=353989.1666666667, ans=0.125 2024-06-21 08:45:21,569 INFO [train.py:1028] (1/2) Epoch 20, batch 900, loss[loss=0.1806, simple_loss=0.2363, pruned_loss=0.06247, over 12984.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2588, pruned_loss=0.07433, over 2558716.12 frames. ], batch size: 36, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:45:25,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=354062.5, ans=0.125 2024-06-21 08:45:36,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=354099.1666666667, ans=0.125 2024-06-21 08:45:36,685 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 1.997e+02 2.099e+02 2.263e+02 3.423e+02, threshold=4.199e+02, percent-clipped=0.0 2024-06-21 08:45:41,648 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:45:43,064 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.62 vs. limit=10.0 2024-06-21 08:45:51,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=354135.8333333333, ans=0.125 2024-06-21 08:45:53,618 INFO [train.py:1028] (1/2) Epoch 20, batch 950, loss[loss=0.192, simple_loss=0.2591, pruned_loss=0.06248, over 12978.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2581, pruned_loss=0.07427, over 2561144.97 frames. ], batch size: 39, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:45:56,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=354154.1666666667, ans=0.1 2024-06-21 08:45:57,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=354154.1666666667, ans=0.125 2024-06-21 08:46:09,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=354190.8333333333, ans=0.125 2024-06-21 08:46:11,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=354190.8333333333, ans=0.125 2024-06-21 08:46:11,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=354190.8333333333, ans=0.125 2024-06-21 08:46:20,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=354227.5, ans=0.125 2024-06-21 08:46:25,932 INFO [train.py:1028] (1/2) Epoch 20, batch 1000, loss[loss=0.2107, simple_loss=0.2734, pruned_loss=0.07399, over 13314.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2585, pruned_loss=0.07462, over 2563366.12 frames. ], batch size: 49, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:46:44,642 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.041e+02 2.142e+02 2.410e+02 3.076e+02, threshold=4.285e+02, percent-clipped=0.0 2024-06-21 08:46:45,005 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.19 vs. limit=12.0 2024-06-21 08:46:53,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=354300.8333333333, ans=0.1 2024-06-21 08:47:04,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=354337.5, ans=0.04949747468305833 2024-06-21 08:47:04,467 INFO [train.py:1028] (1/2) Epoch 20, batch 1050, loss[loss=0.1948, simple_loss=0.2526, pruned_loss=0.06847, over 13135.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2587, pruned_loss=0.07452, over 2566028.82 frames. ], batch size: 77, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:47:12,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=354355.8333333333, ans=0.125 2024-06-21 08:47:13,072 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=22.5 2024-06-21 08:47:14,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=354355.8333333333, ans=0.0 2024-06-21 08:47:17,497 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.18 vs. limit=22.5 2024-06-21 08:47:22,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=354374.1666666667, ans=0.0 2024-06-21 08:47:36,693 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.02 vs. limit=12.0 2024-06-21 08:47:36,897 INFO [train.py:1028] (1/2) Epoch 20, batch 1100, loss[loss=0.2072, simple_loss=0.2575, pruned_loss=0.07848, over 13293.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2594, pruned_loss=0.0746, over 2570731.22 frames. ], batch size: 52, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:47:52,609 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.817e+02 2.054e+02 2.192e+02 2.330e+02 2.871e+02, threshold=4.383e+02, percent-clipped=0.0 2024-06-21 08:47:58,033 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.24 vs. limit=15.0 2024-06-21 08:48:06,612 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.73 vs. limit=10.0 2024-06-21 08:48:09,291 INFO [train.py:1028] (1/2) Epoch 20, batch 1150, loss[loss=0.1971, simple_loss=0.2598, pruned_loss=0.06727, over 13286.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2603, pruned_loss=0.07496, over 2571350.33 frames. ], batch size: 52, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:48:11,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=354520.8333333333, ans=0.2 2024-06-21 08:48:17,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=354539.1666666667, ans=0.125 2024-06-21 08:48:30,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=354557.5, ans=0.0 2024-06-21 08:48:32,645 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.80 vs. limit=22.5 2024-06-21 08:48:33,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=354575.8333333333, ans=0.2 2024-06-21 08:48:35,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=354575.8333333333, ans=0.125 2024-06-21 08:48:38,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354594.1666666667, ans=0.1 2024-06-21 08:48:44,393 INFO [train.py:1028] (1/2) Epoch 20, batch 1200, loss[loss=0.2005, simple_loss=0.2579, pruned_loss=0.07162, over 13235.00 frames. ], tot_loss[loss=0.205, simple_loss=0.26, pruned_loss=0.075, over 2573825.82 frames. ], batch size: 77, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:48:53,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=354612.5, ans=0.1 2024-06-21 08:48:56,940 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.56 vs. limit=15.0 2024-06-21 08:48:58,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=354630.8333333333, ans=0.125 2024-06-21 08:49:03,449 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.081e+02 2.258e+02 2.489e+02 3.694e+02, threshold=4.517e+02, percent-clipped=0.0 2024-06-21 08:49:13,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=354685.8333333333, ans=0.1 2024-06-21 08:49:19,772 INFO [train.py:1028] (1/2) Epoch 20, batch 1250, loss[loss=0.2068, simple_loss=0.2598, pruned_loss=0.07688, over 13154.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2602, pruned_loss=0.07506, over 2582964.22 frames. ], batch size: 112, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:49:29,839 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2024-06-21 08:49:34,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=354740.8333333333, ans=0.125 2024-06-21 08:49:42,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=354759.1666666667, ans=0.125 2024-06-21 08:49:51,992 INFO [train.py:1028] (1/2) Epoch 20, batch 1300, loss[loss=0.2114, simple_loss=0.2576, pruned_loss=0.08257, over 12734.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2603, pruned_loss=0.07519, over 2582841.32 frames. ], batch size: 176, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:49:52,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=354795.8333333333, ans=0.125 2024-06-21 08:49:53,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=354795.8333333333, ans=0.125 2024-06-21 08:49:55,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=354795.8333333333, ans=0.2 2024-06-21 08:49:55,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=354795.8333333333, ans=0.0 2024-06-21 08:49:57,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2024-06-21 08:50:04,666 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2024-06-21 08:50:05,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=354832.5, ans=0.125 2024-06-21 08:50:07,570 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.755e+02 2.031e+02 2.138e+02 2.259e+02 3.212e+02, threshold=4.275e+02, percent-clipped=0.0 2024-06-21 08:50:14,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=354850.8333333333, ans=0.125 2024-06-21 08:50:20,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=354869.1666666667, ans=0.0 2024-06-21 08:50:21,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=354869.1666666667, ans=0.1 2024-06-21 08:50:25,003 INFO [train.py:1028] (1/2) Epoch 20, batch 1350, loss[loss=0.1985, simple_loss=0.2557, pruned_loss=0.07068, over 13187.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2603, pruned_loss=0.07495, over 2584036.79 frames. ], batch size: 59, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:50:32,237 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.26 vs. limit=22.5 2024-06-21 08:50:35,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=354905.8333333333, ans=0.0 2024-06-21 08:50:36,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=354905.8333333333, ans=0.125 2024-06-21 08:50:37,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354905.8333333333, ans=0.1 2024-06-21 08:50:37,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=354924.1666666667, ans=0.125 2024-06-21 08:50:44,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=354924.1666666667, ans=0.0 2024-06-21 08:50:49,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=354942.5, ans=0.125 2024-06-21 08:50:49,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=354942.5, ans=0.1 2024-06-21 08:50:53,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354942.5, ans=0.1 2024-06-21 08:50:56,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=354942.5, ans=0.0 2024-06-21 08:50:57,007 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=12.0 2024-06-21 08:51:00,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=354960.8333333333, ans=0.015 2024-06-21 08:51:04,205 INFO [train.py:1028] (1/2) Epoch 20, batch 1400, loss[loss=0.2194, simple_loss=0.2772, pruned_loss=0.08084, over 12535.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.261, pruned_loss=0.07531, over 2585868.54 frames. ], batch size: 25, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:51:05,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=354979.1666666667, ans=0.125 2024-06-21 08:51:06,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=354979.1666666667, ans=0.125 2024-06-21 08:51:17,953 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.36 vs. limit=10.0 2024-06-21 08:51:19,647 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.050e+02 2.150e+02 2.262e+02 2.982e+02, threshold=4.301e+02, percent-clipped=0.0 2024-06-21 08:51:20,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=355015.8333333333, ans=0.2 2024-06-21 08:51:36,778 INFO [train.py:1028] (1/2) Epoch 20, batch 1450, loss[loss=0.1939, simple_loss=0.242, pruned_loss=0.07291, over 13168.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2605, pruned_loss=0.07528, over 2586292.17 frames. ], batch size: 121, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:51:40,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=355070.8333333333, ans=0.125 2024-06-21 08:51:42,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=355070.8333333333, ans=0.125 2024-06-21 08:51:47,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=355089.1666666667, ans=0.025 2024-06-21 08:51:53,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=355107.5, ans=0.0 2024-06-21 08:52:03,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=355144.1666666667, ans=0.125 2024-06-21 08:52:03,484 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=355144.1666666667, ans=0.0 2024-06-21 08:52:09,271 INFO [train.py:1028] (1/2) Epoch 20, batch 1500, loss[loss=0.2113, simple_loss=0.2591, pruned_loss=0.08172, over 13197.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2605, pruned_loss=0.07554, over 2588812.76 frames. ], batch size: 83, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:52:11,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=355162.5, ans=0.1 2024-06-21 08:52:19,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=355180.8333333333, ans=0.125 2024-06-21 08:52:25,120 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.778e+02 2.113e+02 2.227e+02 2.437e+02 3.305e+02, threshold=4.455e+02, percent-clipped=0.0 2024-06-21 08:52:44,879 INFO [train.py:1028] (1/2) Epoch 20, batch 1550, loss[loss=0.2038, simple_loss=0.2583, pruned_loss=0.07463, over 13069.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2603, pruned_loss=0.07537, over 2584597.41 frames. ], batch size: 102, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:52:45,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=355254.1666666667, ans=0.2 2024-06-21 08:53:01,313 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.48 vs. limit=10.0 2024-06-21 08:53:07,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=355309.1666666667, ans=0.0 2024-06-21 08:53:08,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=355309.1666666667, ans=0.1 2024-06-21 08:53:11,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=355309.1666666667, ans=0.0 2024-06-21 08:53:18,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=355327.5, ans=0.0 2024-06-21 08:53:20,708 INFO [train.py:1028] (1/2) Epoch 20, batch 1600, loss[loss=0.2236, simple_loss=0.2759, pruned_loss=0.08561, over 13160.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.26, pruned_loss=0.07506, over 2579196.61 frames. ], batch size: 77, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:53:20,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=355345.8333333333, ans=0.0 2024-06-21 08:53:32,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=355364.1666666667, ans=0.0 2024-06-21 08:53:35,776 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.018e+02 2.149e+02 2.335e+02 2.816e+02, threshold=4.297e+02, percent-clipped=0.0 2024-06-21 08:53:37,527 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.11 vs. limit=15.0 2024-06-21 08:53:37,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=355382.5, ans=0.025 2024-06-21 08:53:38,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=355400.8333333333, ans=0.2 2024-06-21 08:53:43,615 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:53:47,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=355419.1666666667, ans=0.07 2024-06-21 08:53:47,773 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=15.0 2024-06-21 08:53:52,676 INFO [train.py:1028] (1/2) Epoch 20, batch 1650, loss[loss=0.1957, simple_loss=0.2495, pruned_loss=0.07092, over 13140.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2598, pruned_loss=0.07519, over 2576009.35 frames. ], batch size: 95, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:53:57,771 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.56 vs. limit=12.0 2024-06-21 08:54:13,180 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=15.0 2024-06-21 08:54:16,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=355492.5, ans=10.0 2024-06-21 08:54:17,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=355492.5, ans=0.125 2024-06-21 08:54:19,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=355510.8333333333, ans=0.1 2024-06-21 08:54:20,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=355510.8333333333, ans=0.125 2024-06-21 08:54:25,679 INFO [train.py:1028] (1/2) Epoch 20, batch 1700, loss[loss=0.2176, simple_loss=0.2786, pruned_loss=0.07824, over 12934.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2595, pruned_loss=0.07501, over 2581376.34 frames. ], batch size: 26, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:54:27,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=355529.1666666667, ans=0.1 2024-06-21 08:54:28,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=355529.1666666667, ans=0.0 2024-06-21 08:54:30,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=355529.1666666667, ans=0.125 2024-06-21 08:54:38,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=355547.5, ans=0.125 2024-06-21 08:54:44,154 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 1.992e+02 2.090e+02 2.291e+02 3.156e+02, threshold=4.180e+02, percent-clipped=0.0 2024-06-21 08:54:44,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=355565.8333333333, ans=0.2 2024-06-21 08:54:53,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=355584.1666666667, ans=0.125 2024-06-21 08:55:03,594 INFO [train.py:1028] (1/2) Epoch 20, batch 1750, loss[loss=0.2217, simple_loss=0.2822, pruned_loss=0.08066, over 12586.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2597, pruned_loss=0.07502, over 2582159.77 frames. ], batch size: 22, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:55:19,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=355657.5, ans=0.1 2024-06-21 08:55:25,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=355675.8333333333, ans=0.0 2024-06-21 08:55:29,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=355694.1666666667, ans=0.0 2024-06-21 08:55:34,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=355694.1666666667, ans=0.0 2024-06-21 08:55:34,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=355694.1666666667, ans=0.1 2024-06-21 08:55:35,754 INFO [train.py:1028] (1/2) Epoch 20, batch 1800, loss[loss=0.2065, simple_loss=0.2656, pruned_loss=0.07372, over 13239.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2598, pruned_loss=0.07485, over 2581896.29 frames. ], batch size: 67, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:55:38,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=355712.5, ans=0.1 2024-06-21 08:55:45,262 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.15 vs. limit=22.5 2024-06-21 08:55:51,340 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.090e+02 2.199e+02 2.361e+02 3.493e+02, threshold=4.398e+02, percent-clipped=0.0 2024-06-21 08:55:53,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=355749.1666666667, ans=0.0 2024-06-21 08:55:55,873 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.50 vs. limit=10.0 2024-06-21 08:56:00,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=355767.5, ans=0.0 2024-06-21 08:56:08,531 INFO [train.py:1028] (1/2) Epoch 20, batch 1850, loss[loss=0.2138, simple_loss=0.2631, pruned_loss=0.08231, over 13241.00 frames. ], tot_loss[loss=0.205, simple_loss=0.26, pruned_loss=0.07499, over 2583250.19 frames. ], batch size: 83, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:56:11,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=355804.1666666667, ans=0.2 2024-06-21 08:56:14,629 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.59 vs. limit=15.0 2024-06-21 08:56:18,184 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.21 vs. limit=6.0 2024-06-21 08:56:19,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=355822.5, ans=0.1 2024-06-21 08:56:21,852 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:56:28,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=355840.8333333333, ans=0.2 2024-06-21 08:56:31,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=355859.1666666667, ans=0.0 2024-06-21 08:56:34,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=355859.1666666667, ans=0.125 2024-06-21 08:56:36,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=355877.5, ans=0.125 2024-06-21 08:56:43,298 INFO [train.py:1028] (1/2) Epoch 20, batch 1900, loss[loss=0.1881, simple_loss=0.2401, pruned_loss=0.06807, over 13183.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2591, pruned_loss=0.07476, over 2585801.07 frames. ], batch size: 95, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:56:46,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=355895.8333333333, ans=0.07 2024-06-21 08:57:02,547 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.764e+02 2.046e+02 2.127e+02 2.298e+02 2.982e+02, threshold=4.254e+02, percent-clipped=0.0 2024-06-21 08:57:06,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=355950.8333333333, ans=0.05 2024-06-21 08:57:13,852 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.60 vs. limit=22.5 2024-06-21 08:57:14,288 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2024-06-21 08:57:15,658 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.43 vs. limit=15.0 2024-06-21 08:57:19,092 INFO [train.py:1028] (1/2) Epoch 20, batch 1950, loss[loss=0.196, simple_loss=0.2589, pruned_loss=0.06657, over 13244.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2586, pruned_loss=0.07466, over 2591542.52 frames. ], batch size: 52, lr: 2.93e-03, grad_scale: 128.0 2024-06-21 08:57:19,208 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:57:23,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=355987.5, ans=0.125 2024-06-21 08:57:31,274 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2024-06-21 08:57:35,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=356024.1666666667, ans=0.0 2024-06-21 08:57:40,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=356042.5, ans=0.2 2024-06-21 08:57:47,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=356060.8333333333, ans=0.1 2024-06-21 08:57:51,747 INFO [train.py:1028] (1/2) Epoch 20, batch 2000, loss[loss=0.216, simple_loss=0.2764, pruned_loss=0.07785, over 12521.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2588, pruned_loss=0.07506, over 2587074.84 frames. ], batch size: 22, lr: 2.93e-03, grad_scale: 128.0 2024-06-21 08:58:04,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=356115.8333333333, ans=0.125 2024-06-21 08:58:07,239 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.026e+02 2.133e+02 2.230e+02 3.147e+02, threshold=4.266e+02, percent-clipped=0.0 2024-06-21 08:58:26,839 INFO [train.py:1028] (1/2) Epoch 20, batch 2050, loss[loss=0.2198, simple_loss=0.2763, pruned_loss=0.08163, over 12620.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2591, pruned_loss=0.07543, over 2583542.27 frames. ], batch size: 29, lr: 2.93e-03, grad_scale: 128.0 2024-06-21 08:58:27,184 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.27 vs. limit=22.5 2024-06-21 08:59:02,426 INFO [train.py:1028] (1/2) Epoch 20, batch 2100, loss[loss=0.2112, simple_loss=0.2716, pruned_loss=0.07535, over 13230.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2599, pruned_loss=0.07526, over 2586567.49 frames. ], batch size: 59, lr: 2.93e-03, grad_scale: 128.0 2024-06-21 08:59:06,556 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.83 vs. limit=12.0 2024-06-21 08:59:12,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=356280.8333333333, ans=0.1 2024-06-21 08:59:18,067 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.731e+02 1.999e+02 2.147e+02 2.325e+02 2.808e+02, threshold=4.294e+02, percent-clipped=0.0 2024-06-21 08:59:19,049 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.15 vs. limit=15.0 2024-06-21 08:59:25,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=356317.5, ans=0.0 2024-06-21 08:59:27,295 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2024-06-21 08:59:29,380 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.49 vs. limit=22.5 2024-06-21 08:59:34,714 INFO [train.py:1028] (1/2) Epoch 20, batch 2150, loss[loss=0.2108, simple_loss=0.2704, pruned_loss=0.07557, over 13288.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2596, pruned_loss=0.07514, over 2589436.74 frames. ], batch size: 52, lr: 2.93e-03, grad_scale: 128.0 2024-06-21 08:59:42,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=356372.5, ans=0.125 2024-06-21 08:59:45,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=356372.5, ans=0.125 2024-06-21 08:59:55,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=356409.1666666667, ans=0.125 2024-06-21 08:59:57,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=356409.1666666667, ans=15.0 2024-06-21 09:00:01,661 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.36 vs. limit=15.0 2024-06-21 09:00:02,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=356427.5, ans=0.125 2024-06-21 09:00:08,205 INFO [train.py:1028] (1/2) Epoch 20, batch 2200, loss[loss=0.2279, simple_loss=0.2789, pruned_loss=0.08841, over 13210.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2602, pruned_loss=0.07559, over 2589537.26 frames. ], batch size: 83, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:00:15,198 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.09 vs. limit=22.5 2024-06-21 09:00:21,769 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:00:24,261 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.042e+02 2.153e+02 2.369e+02 3.097e+02, threshold=4.306e+02, percent-clipped=0.0 2024-06-21 09:00:28,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=356500.8333333333, ans=0.0 2024-06-21 09:00:45,090 INFO [train.py:1028] (1/2) Epoch 20, batch 2250, loss[loss=0.1906, simple_loss=0.2511, pruned_loss=0.06501, over 13292.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.26, pruned_loss=0.07545, over 2587819.89 frames. ], batch size: 63, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:00:50,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=356537.5, ans=0.2 2024-06-21 09:00:51,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=356555.8333333333, ans=0.0 2024-06-21 09:00:52,905 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2024-06-21 09:00:54,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=356555.8333333333, ans=0.125 2024-06-21 09:00:56,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=356555.8333333333, ans=0.125 2024-06-21 09:01:04,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=356574.1666666667, ans=0.1 2024-06-21 09:01:06,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=356574.1666666667, ans=0.125 2024-06-21 09:01:06,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=356574.1666666667, ans=0.1 2024-06-21 09:01:11,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=356592.5, ans=0.125 2024-06-21 09:01:22,546 INFO [train.py:1028] (1/2) Epoch 20, batch 2300, loss[loss=0.2039, simple_loss=0.2589, pruned_loss=0.07448, over 12845.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.26, pruned_loss=0.07541, over 2581864.94 frames. ], batch size: 33, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:01:23,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=356629.1666666667, ans=0.0 2024-06-21 09:01:30,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=356647.5, ans=0.125 2024-06-21 09:01:32,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=356647.5, ans=0.2 2024-06-21 09:01:38,473 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.053e+02 2.182e+02 2.448e+02 3.072e+02, threshold=4.365e+02, percent-clipped=0.0 2024-06-21 09:01:44,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=356684.1666666667, ans=0.1 2024-06-21 09:01:47,325 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.43 vs. limit=10.0 2024-06-21 09:01:55,875 INFO [train.py:1028] (1/2) Epoch 20, batch 2350, loss[loss=0.1945, simple_loss=0.2551, pruned_loss=0.06698, over 13184.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2604, pruned_loss=0.07576, over 2584926.71 frames. ], batch size: 67, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:02:01,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=356720.8333333333, ans=15.0 2024-06-21 09:02:07,582 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.33 vs. limit=22.5 2024-06-21 09:02:24,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=356794.1666666667, ans=0.125 2024-06-21 09:02:25,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=356794.1666666667, ans=0.2 2024-06-21 09:02:28,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=356812.5, ans=0.125 2024-06-21 09:02:28,974 INFO [train.py:1028] (1/2) Epoch 20, batch 2400, loss[loss=0.195, simple_loss=0.2551, pruned_loss=0.0674, over 13375.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2595, pruned_loss=0.07528, over 2587790.04 frames. ], batch size: 46, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:02:37,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=356812.5, ans=0.2 2024-06-21 09:02:44,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=356830.8333333333, ans=0.0 2024-06-21 09:02:45,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=356849.1666666667, ans=0.125 2024-06-21 09:02:48,497 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.795e+02 2.028e+02 2.159e+02 2.269e+02 2.784e+02, threshold=4.319e+02, percent-clipped=0.0 2024-06-21 09:02:56,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=356867.5, ans=0.1 2024-06-21 09:03:06,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=356885.8333333333, ans=0.0 2024-06-21 09:03:08,265 INFO [train.py:1028] (1/2) Epoch 20, batch 2450, loss[loss=0.2345, simple_loss=0.2895, pruned_loss=0.08981, over 13251.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2587, pruned_loss=0.07522, over 2583926.04 frames. ], batch size: 63, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:03:08,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=356904.1666666667, ans=0.125 2024-06-21 09:03:13,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=356904.1666666667, ans=0.2 2024-06-21 09:03:22,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=356940.8333333333, ans=0.5 2024-06-21 09:03:40,889 INFO [train.py:1028] (1/2) Epoch 20, batch 2500, loss[loss=0.2189, simple_loss=0.2586, pruned_loss=0.08961, over 13210.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.258, pruned_loss=0.07518, over 2586758.93 frames. ], batch size: 83, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:03:44,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=356995.8333333333, ans=0.125 2024-06-21 09:03:48,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=357014.1666666667, ans=0.125 2024-06-21 09:03:50,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=357014.1666666667, ans=0.125 2024-06-21 09:03:56,347 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.068e+02 2.198e+02 2.397e+02 3.315e+02, threshold=4.395e+02, percent-clipped=0.0 2024-06-21 09:03:57,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=357032.5, ans=0.0 2024-06-21 09:03:57,503 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.32 vs. limit=15.0 2024-06-21 09:03:58,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=357032.5, ans=0.125 2024-06-21 09:04:00,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=357050.8333333333, ans=0.025 2024-06-21 09:04:00,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=357050.8333333333, ans=0.09899494936611666 2024-06-21 09:04:01,430 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2024-06-21 09:04:03,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=357050.8333333333, ans=0.2 2024-06-21 09:04:13,272 INFO [train.py:1028] (1/2) Epoch 20, batch 2550, loss[loss=0.2221, simple_loss=0.2809, pruned_loss=0.08163, over 12387.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2569, pruned_loss=0.07471, over 2585990.81 frames. ], batch size: 22, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:04:25,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=357105.8333333333, ans=0.125 2024-06-21 09:04:27,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=357105.8333333333, ans=0.2 2024-06-21 09:04:35,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=357142.5, ans=0.0 2024-06-21 09:04:45,325 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:04:46,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=357160.8333333333, ans=0.07 2024-06-21 09:04:50,856 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.53 vs. limit=15.0 2024-06-21 09:04:51,727 INFO [train.py:1028] (1/2) Epoch 20, batch 2600, loss[loss=0.2032, simple_loss=0.2602, pruned_loss=0.07308, over 13263.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2551, pruned_loss=0.07409, over 2585725.29 frames. ], batch size: 52, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:04:53,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=357179.1666666667, ans=0.125 2024-06-21 09:04:57,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=357179.1666666667, ans=0.025 2024-06-21 09:05:04,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=357215.8333333333, ans=0.125 2024-06-21 09:05:07,335 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 2.027e+02 2.159e+02 2.358e+02 3.160e+02, threshold=4.319e+02, percent-clipped=0.0 2024-06-21 09:05:07,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=357215.8333333333, ans=0.125 2024-06-21 09:05:09,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=357215.8333333333, ans=0.125 2024-06-21 09:05:12,265 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.90 vs. limit=10.0 2024-06-21 09:05:16,690 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.56 vs. limit=15.0 2024-06-21 09:05:24,340 INFO [train.py:1028] (1/2) Epoch 20, batch 2650, loss[loss=0.1978, simple_loss=0.244, pruned_loss=0.07578, over 13004.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2538, pruned_loss=0.07351, over 2586713.25 frames. ], batch size: 144, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:05:34,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=357289.1666666667, ans=0.125 2024-06-21 09:05:37,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=357307.5, ans=0.0 2024-06-21 09:05:43,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=357325.8333333333, ans=0.125 2024-06-21 09:05:45,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=357325.8333333333, ans=0.125 2024-06-21 09:05:46,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=357325.8333333333, ans=0.0 2024-06-21 09:05:46,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=357325.8333333333, ans=0.04949747468305833 2024-06-21 09:05:51,153 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.95 vs. limit=15.0 2024-06-21 09:05:51,876 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=12.0 2024-06-21 09:05:57,200 INFO [train.py:1028] (1/2) Epoch 20, batch 2700, loss[loss=0.1897, simple_loss=0.2439, pruned_loss=0.06772, over 13228.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2525, pruned_loss=0.07326, over 2584792.41 frames. ], batch size: 89, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:05:59,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=357362.5, ans=0.0 2024-06-21 09:06:05,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=357380.8333333333, ans=0.125 2024-06-21 09:06:10,347 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.26 vs. limit=15.0 2024-06-21 09:06:12,169 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2024-06-21 09:06:12,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=357399.1666666667, ans=0.2 2024-06-21 09:06:13,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=357399.1666666667, ans=0.125 2024-06-21 09:06:13,722 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.010e+02 2.109e+02 2.292e+02 2.711e+02, threshold=4.218e+02, percent-clipped=0.0 2024-06-21 09:06:30,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=357435.8333333333, ans=0.125 2024-06-21 09:06:33,784 INFO [train.py:1028] (1/2) Epoch 20, batch 2750, loss[loss=0.2099, simple_loss=0.2568, pruned_loss=0.08149, over 13246.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2513, pruned_loss=0.07237, over 2582423.91 frames. ], batch size: 43, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:06:39,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=357454.1666666667, ans=0.2 2024-06-21 09:06:39,806 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.94 vs. limit=15.0 2024-06-21 09:06:49,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=357490.8333333333, ans=0.0 2024-06-21 09:06:51,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=357490.8333333333, ans=0.1 2024-06-21 09:07:01,239 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.05 vs. limit=10.0 2024-06-21 09:07:09,914 INFO [train.py:1028] (1/2) Epoch 20, batch 2800, loss[loss=0.2071, simple_loss=0.2495, pruned_loss=0.08235, over 11018.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2508, pruned_loss=0.07259, over 2579749.47 frames. ], batch size: 304, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:07:15,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=357545.8333333333, ans=0.025 2024-06-21 09:07:18,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=357564.1666666667, ans=0.0 2024-06-21 09:07:18,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=357564.1666666667, ans=0.05 2024-06-21 09:07:25,691 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.006e+02 2.169e+02 2.325e+02 2.896e+02, threshold=4.339e+02, percent-clipped=0.0 2024-06-21 09:07:28,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=357600.8333333333, ans=0.125 2024-06-21 09:07:37,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=357619.1666666667, ans=0.0 2024-06-21 09:07:38,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=357619.1666666667, ans=0.125 2024-06-21 09:07:41,957 INFO [train.py:1028] (1/2) Epoch 20, batch 2850, loss[loss=0.2037, simple_loss=0.2593, pruned_loss=0.07407, over 12989.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2502, pruned_loss=0.07251, over 2577018.56 frames. ], batch size: 48, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:07:43,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=357637.5, ans=0.125 2024-06-21 09:07:44,643 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:07:46,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=357637.5, ans=0.0 2024-06-21 09:07:49,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=357655.8333333333, ans=0.1 2024-06-21 09:07:49,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=357655.8333333333, ans=0.2 2024-06-21 09:08:05,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=357692.5, ans=0.0 2024-06-21 09:08:15,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=357710.8333333333, ans=0.0 2024-06-21 09:08:16,661 INFO [train.py:1028] (1/2) Epoch 20, batch 2900, loss[loss=0.1975, simple_loss=0.2498, pruned_loss=0.0726, over 13122.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.249, pruned_loss=0.07234, over 2585011.75 frames. ], batch size: 55, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:08:32,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=357765.8333333333, ans=0.0 2024-06-21 09:08:35,857 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 2.029e+02 2.261e+02 2.450e+02 3.381e+02, threshold=4.523e+02, percent-clipped=0.0 2024-06-21 09:08:50,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=357802.5, ans=0.05 2024-06-21 09:08:52,250 INFO [train.py:1028] (1/2) Epoch 20, batch 2950, loss[loss=0.1888, simple_loss=0.2441, pruned_loss=0.06675, over 13246.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.249, pruned_loss=0.07238, over 2580371.12 frames. ], batch size: 43, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:08:54,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=357820.8333333333, ans=0.1 2024-06-21 09:09:04,077 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:09:18,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=357875.8333333333, ans=0.125 2024-06-21 09:09:19,248 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.28 vs. limit=15.0 2024-06-21 09:09:25,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=357912.5, ans=0.2 2024-06-21 09:09:26,058 INFO [train.py:1028] (1/2) Epoch 20, batch 3000, loss[loss=0.1885, simple_loss=0.2415, pruned_loss=0.06777, over 13151.00 frames. ], tot_loss[loss=0.196, simple_loss=0.248, pruned_loss=0.07199, over 2578770.55 frames. ], batch size: 59, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:09:26,058 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 09:09:33,922 INFO [train.py:1060] (1/2) Epoch 20, validation: loss=0.1864, simple_loss=0.2507, pruned_loss=0.06101, over 351949.00 frames. 2024-06-21 09:09:33,923 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 09:09:38,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=357912.5, ans=0.2 2024-06-21 09:09:39,709 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.48 vs. limit=12.0 2024-06-21 09:09:44,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=357930.8333333333, ans=0.025 2024-06-21 09:09:45,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=357930.8333333333, ans=0.1 2024-06-21 09:09:46,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=357930.8333333333, ans=0.125 2024-06-21 09:09:48,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=357949.1666666667, ans=0.025 2024-06-21 09:09:50,594 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 2.000e+02 2.115e+02 2.305e+02 3.438e+02, threshold=4.230e+02, percent-clipped=0.0 2024-06-21 09:09:51,186 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.59 vs. limit=15.0 2024-06-21 09:09:51,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=357949.1666666667, ans=0.0 2024-06-21 09:09:53,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=357967.5, ans=0.125 2024-06-21 09:10:09,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=358004.1666666667, ans=0.125 2024-06-21 09:10:10,148 INFO [train.py:1028] (1/2) Epoch 20, batch 3050, loss[loss=0.1864, simple_loss=0.2406, pruned_loss=0.06606, over 13303.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2473, pruned_loss=0.07217, over 2579833.22 frames. ], batch size: 46, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:10:18,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=358022.5, ans=0.0 2024-06-21 09:10:23,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=358040.8333333333, ans=0.0 2024-06-21 09:10:44,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=358077.5, ans=0.1 2024-06-21 09:10:46,115 INFO [train.py:1028] (1/2) Epoch 20, batch 3100, loss[loss=0.1845, simple_loss=0.2387, pruned_loss=0.06515, over 13068.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2469, pruned_loss=0.07179, over 2580521.32 frames. ], batch size: 144, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:10:48,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=358095.8333333333, ans=0.125 2024-06-21 09:10:49,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=358095.8333333333, ans=0.125 2024-06-21 09:11:02,534 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.009e+02 2.183e+02 2.380e+02 3.057e+02, threshold=4.366e+02, percent-clipped=0.0 2024-06-21 09:11:06,913 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.16 vs. limit=12.0 2024-06-21 09:11:07,315 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.073e-01 2024-06-21 09:11:09,686 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.86 vs. limit=12.0 2024-06-21 09:11:10,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=358150.8333333333, ans=0.0 2024-06-21 09:11:14,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=358169.1666666667, ans=0.0 2024-06-21 09:11:14,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=358169.1666666667, ans=0.0 2024-06-21 09:11:19,250 INFO [train.py:1028] (1/2) Epoch 20, batch 3150, loss[loss=0.1906, simple_loss=0.2376, pruned_loss=0.07181, over 12938.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.2457, pruned_loss=0.07081, over 2581949.75 frames. ], batch size: 158, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:11:35,816 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:11:36,154 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.17 vs. limit=15.0 2024-06-21 09:11:50,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=358260.8333333333, ans=0.0 2024-06-21 09:11:51,854 INFO [train.py:1028] (1/2) Epoch 20, batch 3200, loss[loss=0.1903, simple_loss=0.2402, pruned_loss=0.07023, over 13174.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.245, pruned_loss=0.07036, over 2581497.52 frames. ], batch size: 55, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:12:05,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=358297.5, ans=0.025 2024-06-21 09:12:08,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=358315.8333333333, ans=0.125 2024-06-21 09:12:10,965 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 1.992e+02 2.108e+02 2.252e+02 2.950e+02, threshold=4.215e+02, percent-clipped=0.0 2024-06-21 09:12:11,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=358315.8333333333, ans=0.0 2024-06-21 09:12:14,323 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.83 vs. limit=15.0 2024-06-21 09:12:20,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=358352.5, ans=0.1 2024-06-21 09:12:21,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=358352.5, ans=0.2 2024-06-21 09:12:23,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=358352.5, ans=0.125 2024-06-21 09:12:25,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=358352.5, ans=0.0 2024-06-21 09:12:26,736 INFO [train.py:1028] (1/2) Epoch 20, batch 3250, loss[loss=0.182, simple_loss=0.2443, pruned_loss=0.05989, over 13240.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2444, pruned_loss=0.0701, over 2586460.10 frames. ], batch size: 72, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:12:28,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=358370.8333333333, ans=0.0 2024-06-21 09:12:37,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=358389.1666666667, ans=0.2 2024-06-21 09:12:41,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358389.1666666667, ans=0.1 2024-06-21 09:12:48,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=358407.5, ans=0.0 2024-06-21 09:12:54,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=358425.8333333333, ans=0.0 2024-06-21 09:13:03,227 INFO [train.py:1028] (1/2) Epoch 20, batch 3300, loss[loss=0.2112, simple_loss=0.2575, pruned_loss=0.08239, over 12694.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2439, pruned_loss=0.06986, over 2581820.12 frames. ], batch size: 176, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:13:10,020 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.75 vs. limit=15.0 2024-06-21 09:13:12,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=358480.8333333333, ans=0.125 2024-06-21 09:13:18,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=358499.1666666667, ans=0.125 2024-06-21 09:13:19,345 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+02 1.998e+02 2.151e+02 2.365e+02 3.190e+02, threshold=4.303e+02, percent-clipped=0.0 2024-06-21 09:13:20,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358499.1666666667, ans=0.1 2024-06-21 09:13:23,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=358517.5, ans=0.125 2024-06-21 09:13:24,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=358517.5, ans=0.125 2024-06-21 09:13:25,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=358517.5, ans=0.125 2024-06-21 09:13:26,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=358517.5, ans=0.95 2024-06-21 09:13:33,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358535.8333333333, ans=0.1 2024-06-21 09:13:33,416 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.16 vs. limit=10.0 2024-06-21 09:13:35,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=358554.1666666667, ans=0.0 2024-06-21 09:13:35,478 INFO [train.py:1028] (1/2) Epoch 20, batch 3350, loss[loss=0.2018, simple_loss=0.2457, pruned_loss=0.07899, over 12948.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2438, pruned_loss=0.0702, over 2577240.35 frames. ], batch size: 158, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:13:44,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=358572.5, ans=0.0 2024-06-21 09:13:51,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358590.8333333333, ans=0.1 2024-06-21 09:14:08,314 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.47 vs. limit=10.0 2024-06-21 09:14:12,795 INFO [train.py:1028] (1/2) Epoch 20, batch 3400, loss[loss=0.1884, simple_loss=0.243, pruned_loss=0.06694, over 12514.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2432, pruned_loss=0.07023, over 2575947.00 frames. ], batch size: 22, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:14:13,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=358645.8333333333, ans=0.125 2024-06-21 09:14:32,098 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.702e+02 1.966e+02 2.093e+02 2.290e+02 3.085e+02, threshold=4.187e+02, percent-clipped=0.0 2024-06-21 09:14:37,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=358700.8333333333, ans=0.0 2024-06-21 09:14:37,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=358700.8333333333, ans=0.125 2024-06-21 09:14:44,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=358719.1666666667, ans=0.0 2024-06-21 09:14:48,751 INFO [train.py:1028] (1/2) Epoch 20, batch 3450, loss[loss=0.1793, simple_loss=0.231, pruned_loss=0.06376, over 12727.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2429, pruned_loss=0.07004, over 2576845.50 frames. ], batch size: 176, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:14:51,061 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2024-06-21 09:14:51,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=358737.5, ans=0.2 2024-06-21 09:14:58,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=358755.8333333333, ans=0.1 2024-06-21 09:14:59,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=358755.8333333333, ans=0.125 2024-06-21 09:14:59,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=358755.8333333333, ans=0.125 2024-06-21 09:15:06,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=358774.1666666667, ans=0.1 2024-06-21 09:15:09,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=358792.5, ans=0.0 2024-06-21 09:15:11,672 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.72 vs. limit=15.0 2024-06-21 09:15:15,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=358810.8333333333, ans=0.0 2024-06-21 09:15:21,816 INFO [train.py:1028] (1/2) Epoch 20, batch 3500, loss[loss=0.1756, simple_loss=0.2216, pruned_loss=0.0648, over 13018.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2425, pruned_loss=0.06969, over 2577865.69 frames. ], batch size: 33, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:15:27,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=358829.1666666667, ans=0.1 2024-06-21 09:15:32,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358847.5, ans=0.1 2024-06-21 09:15:34,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358865.8333333333, ans=0.1 2024-06-21 09:15:35,026 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=15.0 2024-06-21 09:15:38,664 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 1.926e+02 2.032e+02 2.224e+02 2.906e+02, threshold=4.063e+02, percent-clipped=0.0 2024-06-21 09:15:39,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=358865.8333333333, ans=0.125 2024-06-21 09:15:44,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358884.1666666667, ans=0.1 2024-06-21 09:15:44,570 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.37 vs. limit=6.0 2024-06-21 09:15:58,570 INFO [train.py:1028] (1/2) Epoch 20, batch 3550, loss[loss=0.1964, simple_loss=0.2463, pruned_loss=0.07328, over 13152.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2418, pruned_loss=0.06957, over 2578412.13 frames. ], batch size: 95, lr: 2.91e-03, grad_scale: 64.0 2024-06-21 09:16:07,098 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.43 vs. limit=22.5 2024-06-21 09:16:34,155 INFO [train.py:1028] (1/2) Epoch 20, batch 3600, loss[loss=0.1564, simple_loss=0.2182, pruned_loss=0.04733, over 13322.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2415, pruned_loss=0.06976, over 2581433.98 frames. ], batch size: 49, lr: 2.91e-03, grad_scale: 64.0 2024-06-21 09:16:37,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=359012.5, ans=0.125 2024-06-21 09:16:39,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=359012.5, ans=0.0 2024-06-21 09:16:50,384 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.782e+02 1.957e+02 2.064e+02 2.266e+02 3.230e+02, threshold=4.128e+02, percent-clipped=0.0 2024-06-21 09:16:52,824 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2024-06-21 09:16:53,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=359067.5, ans=0.025 2024-06-21 09:16:57,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=359067.5, ans=0.125 2024-06-21 09:17:06,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=359104.1666666667, ans=0.0 2024-06-21 09:17:06,651 INFO [train.py:1028] (1/2) Epoch 20, batch 3650, loss[loss=0.1911, simple_loss=0.236, pruned_loss=0.07314, over 13037.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2415, pruned_loss=0.06958, over 2580185.63 frames. ], batch size: 102, lr: 2.91e-03, grad_scale: 64.0 2024-06-21 09:17:08,328 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2024-06-21 09:17:10,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=359104.1666666667, ans=0.125 2024-06-21 09:17:10,326 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.07 vs. limit=22.5 2024-06-21 09:17:10,453 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=359104.1666666667, ans=6.0 2024-06-21 09:17:12,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=359122.5, ans=0.125 2024-06-21 09:17:22,678 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.80 vs. limit=10.0 2024-06-21 09:17:23,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=359140.8333333333, ans=0.125 2024-06-21 09:17:24,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=359140.8333333333, ans=0.2 2024-06-21 09:17:26,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=359159.1666666667, ans=0.125 2024-06-21 09:17:34,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=359177.5, ans=0.1 2024-06-21 09:17:39,109 INFO [train.py:1028] (1/2) Epoch 20, batch 3700, loss[loss=0.1927, simple_loss=0.2541, pruned_loss=0.06569, over 13226.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2406, pruned_loss=0.06906, over 2584598.14 frames. ], batch size: 72, lr: 2.91e-03, grad_scale: 64.0 2024-06-21 09:17:40,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=359195.8333333333, ans=0.125 2024-06-21 09:17:46,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=359214.1666666667, ans=0.1 2024-06-21 09:17:49,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=359214.1666666667, ans=0.125 2024-06-21 09:17:58,665 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 1.938e+02 2.020e+02 2.165e+02 2.733e+02, threshold=4.040e+02, percent-clipped=0.0 2024-06-21 09:18:01,471 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=12.0 2024-06-21 09:18:05,461 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.40 vs. limit=12.0 2024-06-21 09:18:06,972 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.51 vs. limit=22.5 2024-06-21 09:18:08,306 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2024-06-21 09:18:15,086 INFO [train.py:1028] (1/2) Epoch 20, batch 3750, loss[loss=0.2215, simple_loss=0.2746, pruned_loss=0.08418, over 12644.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2403, pruned_loss=0.06909, over 2586779.84 frames. ], batch size: 22, lr: 2.91e-03, grad_scale: 64.0 2024-06-21 09:18:18,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=359287.5, ans=0.125 2024-06-21 09:18:25,619 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2024-06-21 09:18:27,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=359324.1666666667, ans=0.125 2024-06-21 09:18:28,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=359324.1666666667, ans=0.125 2024-06-21 09:18:32,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=359324.1666666667, ans=0.2 2024-06-21 09:18:55,356 INFO [train.py:1028] (1/2) Epoch 20, batch 3800, loss[loss=0.1749, simple_loss=0.2271, pruned_loss=0.06134, over 13187.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2401, pruned_loss=0.06875, over 2585339.48 frames. ], batch size: 83, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:19:12,001 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 1.947e+02 2.134e+02 2.264e+02 3.069e+02, threshold=4.268e+02, percent-clipped=0.0 2024-06-21 09:19:23,161 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.01 vs. limit=22.5 2024-06-21 09:19:27,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=359470.8333333333, ans=0.07 2024-06-21 09:19:28,042 INFO [train.py:1028] (1/2) Epoch 20, batch 3850, loss[loss=0.1787, simple_loss=0.2276, pruned_loss=0.06488, over 13055.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2395, pruned_loss=0.06834, over 2585344.07 frames. ], batch size: 144, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:19:41,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=359507.5, ans=0.125 2024-06-21 09:19:45,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=359507.5, ans=0.125 2024-06-21 09:20:00,830 INFO [train.py:1028] (1/2) Epoch 20, batch 3900, loss[loss=0.1708, simple_loss=0.2215, pruned_loss=0.06006, over 13250.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.24, pruned_loss=0.06888, over 2588363.66 frames. ], batch size: 83, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:20:03,586 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:20:03,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=359562.5, ans=0.2 2024-06-21 09:20:14,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=359580.8333333333, ans=0.0 2024-06-21 09:20:21,254 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 1.974e+02 2.120e+02 2.265e+02 2.992e+02, threshold=4.240e+02, percent-clipped=0.0 2024-06-21 09:20:24,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=359617.5, ans=0.0 2024-06-21 09:20:25,237 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=359617.5, ans=0.0 2024-06-21 09:20:30,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=359635.8333333333, ans=0.0 2024-06-21 09:20:31,049 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.90 vs. limit=12.0 2024-06-21 09:20:37,125 INFO [train.py:1028] (1/2) Epoch 20, batch 3950, loss[loss=0.1814, simple_loss=0.2259, pruned_loss=0.06849, over 13151.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2389, pruned_loss=0.06828, over 2589060.94 frames. ], batch size: 132, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:20:46,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=359672.5, ans=0.0 2024-06-21 09:20:46,691 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.33 vs. limit=15.0 2024-06-21 09:20:55,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=359690.8333333333, ans=0.04949747468305833 2024-06-21 09:20:59,127 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:21:01,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=359709.1666666667, ans=0.04949747468305833 2024-06-21 09:21:02,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=359709.1666666667, ans=15.0 2024-06-21 09:21:13,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=359745.8333333333, ans=0.125 2024-06-21 09:21:13,721 INFO [train.py:1028] (1/2) Epoch 20, batch 4000, loss[loss=0.1866, simple_loss=0.2373, pruned_loss=0.06798, over 12951.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2386, pruned_loss=0.06824, over 2583689.42 frames. ], batch size: 39, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:21:21,134 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:21:29,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=359782.5, ans=0.125 2024-06-21 09:21:31,010 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.657e+02 1.941e+02 2.111e+02 2.278e+02 3.436e+02, threshold=4.223e+02, percent-clipped=0.0 2024-06-21 09:21:32,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=359782.5, ans=0.125 2024-06-21 09:21:37,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=359800.8333333333, ans=0.09899494936611666 2024-06-21 09:21:41,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=359819.1666666667, ans=0.125 2024-06-21 09:21:45,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=359819.1666666667, ans=0.025 2024-06-21 09:21:46,961 INFO [train.py:1028] (1/2) Epoch 20, batch 4050, loss[loss=0.1938, simple_loss=0.239, pruned_loss=0.07426, over 11013.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2385, pruned_loss=0.06823, over 2581417.41 frames. ], batch size: 304, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:21:49,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=359837.5, ans=15.0 2024-06-21 09:21:59,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=359874.1666666667, ans=0.125 2024-06-21 09:22:13,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=359892.5, ans=0.02 2024-06-21 09:22:16,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=359910.8333333333, ans=15.0 2024-06-21 09:22:16,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=359910.8333333333, ans=0.125 2024-06-21 09:22:18,332 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=12.0 2024-06-21 09:22:22,683 INFO [train.py:1028] (1/2) Epoch 20, batch 4100, loss[loss=0.1851, simple_loss=0.2299, pruned_loss=0.07013, over 13078.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2385, pruned_loss=0.06843, over 2578249.50 frames. ], batch size: 102, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:22:24,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=359929.1666666667, ans=0.125 2024-06-21 09:22:27,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=359929.1666666667, ans=0.2 2024-06-21 09:22:34,919 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=359947.5, ans=0.0 2024-06-21 09:22:43,107 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 1.979e+02 2.099e+02 2.305e+02 2.832e+02, threshold=4.197e+02, percent-clipped=0.0 2024-06-21 09:22:49,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=359984.1666666667, ans=0.0 2024-06-21 09:22:58,838 INFO [train.py:1028] (1/2) Epoch 20, batch 4150, loss[loss=0.1755, simple_loss=0.2275, pruned_loss=0.06174, over 13150.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2384, pruned_loss=0.06827, over 2576803.49 frames. ], batch size: 55, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:23:02,231 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:23:02,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=360020.8333333333, ans=0.2 2024-06-21 09:23:03,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=360020.8333333333, ans=0.125 2024-06-21 09:23:20,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=360075.8333333333, ans=0.0 2024-06-21 09:23:23,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=360075.8333333333, ans=0.1 2024-06-21 09:23:23,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=360075.8333333333, ans=0.125 2024-06-21 09:23:24,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=360075.8333333333, ans=0.125 2024-06-21 09:23:24,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=360094.1666666667, ans=0.125 2024-06-21 09:23:30,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=360094.1666666667, ans=0.0 2024-06-21 09:23:31,829 INFO [train.py:1028] (1/2) Epoch 20, batch 4200, loss[loss=0.1996, simple_loss=0.2443, pruned_loss=0.07742, over 13032.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2381, pruned_loss=0.0683, over 2578804.86 frames. ], batch size: 102, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:23:39,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=360130.8333333333, ans=0.125 2024-06-21 09:23:48,599 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 1.960e+02 2.062e+02 2.237e+02 3.205e+02, threshold=4.125e+02, percent-clipped=0.0 2024-06-21 09:23:54,981 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.23 vs. limit=22.5 2024-06-21 09:24:01,735 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:24:04,227 INFO [train.py:1028] (1/2) Epoch 20, batch 4250, loss[loss=0.186, simple_loss=0.2348, pruned_loss=0.06864, over 13210.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2376, pruned_loss=0.06808, over 2581286.17 frames. ], batch size: 46, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:24:16,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=360222.5, ans=0.125 2024-06-21 09:24:17,194 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2024-06-21 09:24:20,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=360240.8333333333, ans=0.1 2024-06-21 09:24:25,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=360240.8333333333, ans=0.025 2024-06-21 09:24:29,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=360259.1666666667, ans=0.125 2024-06-21 09:24:40,616 INFO [train.py:1028] (1/2) Epoch 20, batch 4300, loss[loss=0.1751, simple_loss=0.2284, pruned_loss=0.06085, over 13220.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2372, pruned_loss=0.06779, over 2581284.16 frames. ], batch size: 59, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:24:46,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=360295.8333333333, ans=0.1 2024-06-21 09:24:53,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=360314.1666666667, ans=0.125 2024-06-21 09:24:53,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=360314.1666666667, ans=0.125 2024-06-21 09:24:54,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=360314.1666666667, ans=0.0 2024-06-21 09:25:00,449 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 1.979e+02 2.066e+02 2.282e+02 3.069e+02, threshold=4.132e+02, percent-clipped=0.0 2024-06-21 09:25:01,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=360332.5, ans=0.0 2024-06-21 09:25:15,549 INFO [train.py:1028] (1/2) Epoch 20, batch 4350, loss[loss=0.1865, simple_loss=0.2351, pruned_loss=0.06894, over 13220.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2367, pruned_loss=0.06794, over 2585906.26 frames. ], batch size: 59, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:25:17,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=360387.5, ans=0.125 2024-06-21 09:25:21,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=360405.8333333333, ans=0.125 2024-06-21 09:25:24,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=360405.8333333333, ans=0.0 2024-06-21 09:25:41,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=360460.8333333333, ans=0.1 2024-06-21 09:25:48,126 INFO [train.py:1028] (1/2) Epoch 20, batch 4400, loss[loss=0.2056, simple_loss=0.2553, pruned_loss=0.07794, over 13208.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2367, pruned_loss=0.06794, over 2586548.38 frames. ], batch size: 83, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:25:50,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=360479.1666666667, ans=0.0 2024-06-21 09:25:52,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=360479.1666666667, ans=0.2 2024-06-21 09:25:52,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=360479.1666666667, ans=0.125 2024-06-21 09:25:52,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=360479.1666666667, ans=0.1 2024-06-21 09:25:55,010 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.57 vs. limit=6.0 2024-06-21 09:26:01,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=360515.8333333333, ans=0.1 2024-06-21 09:26:03,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=360515.8333333333, ans=0.09899494936611666 2024-06-21 09:26:04,342 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 1.980e+02 2.151e+02 2.336e+02 3.377e+02, threshold=4.301e+02, percent-clipped=0.0 2024-06-21 09:26:07,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=360515.8333333333, ans=0.125 2024-06-21 09:26:09,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=360515.8333333333, ans=0.0 2024-06-21 09:26:09,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=360534.1666666667, ans=0.2 2024-06-21 09:26:11,422 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.03 vs. limit=15.0 2024-06-21 09:26:15,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=360534.1666666667, ans=0.2 2024-06-21 09:26:23,415 INFO [train.py:1028] (1/2) Epoch 20, batch 4450, loss[loss=0.1749, simple_loss=0.2323, pruned_loss=0.05877, over 12924.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2375, pruned_loss=0.06825, over 2581920.34 frames. ], batch size: 33, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:26:23,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=360570.8333333333, ans=0.2 2024-06-21 09:26:29,054 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2024-06-21 09:26:36,198 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2024-06-21 09:26:48,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=360625.8333333333, ans=0.125 2024-06-21 09:26:56,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=360644.1666666667, ans=0.1 2024-06-21 09:26:56,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=360644.1666666667, ans=0.125 2024-06-21 09:27:01,232 INFO [train.py:1028] (1/2) Epoch 20, batch 4500, loss[loss=0.181, simple_loss=0.2302, pruned_loss=0.06596, over 13252.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2376, pruned_loss=0.06831, over 2585881.71 frames. ], batch size: 89, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:27:08,845 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.87 vs. limit=22.5 2024-06-21 09:27:19,125 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 1.926e+02 2.034e+02 2.197e+02 2.807e+02, threshold=4.069e+02, percent-clipped=0.0 2024-06-21 09:27:19,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=360699.1666666667, ans=0.1 2024-06-21 09:27:21,774 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:27:26,024 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2024-06-21 09:27:28,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=360735.8333333333, ans=0.07 2024-06-21 09:27:32,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=360735.8333333333, ans=0.125 2024-06-21 09:27:33,975 INFO [train.py:1028] (1/2) Epoch 20, batch 4550, loss[loss=0.1839, simple_loss=0.2406, pruned_loss=0.06358, over 13273.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2376, pruned_loss=0.0682, over 2589980.75 frames. ], batch size: 52, lr: 2.91e-03, grad_scale: 16.0 2024-06-21 09:27:34,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=360754.1666666667, ans=0.125 2024-06-21 09:27:41,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=360772.5, ans=0.1 2024-06-21 09:27:48,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=360790.8333333333, ans=0.1 2024-06-21 09:28:03,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=360827.5, ans=0.2 2024-06-21 09:28:11,151 INFO [train.py:1028] (1/2) Epoch 20, batch 4600, loss[loss=0.2013, simple_loss=0.2463, pruned_loss=0.07818, over 12527.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2373, pruned_loss=0.06805, over 2584425.69 frames. ], batch size: 202, lr: 2.91e-03, grad_scale: 16.0 2024-06-21 09:28:26,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=360882.5, ans=0.125 2024-06-21 09:28:28,498 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 2.026e+02 2.145e+02 2.422e+02 3.065e+02, threshold=4.289e+02, percent-clipped=0.0 2024-06-21 09:28:46,604 INFO [train.py:1028] (1/2) Epoch 20, batch 4650, loss[loss=0.1786, simple_loss=0.2199, pruned_loss=0.06862, over 13061.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2362, pruned_loss=0.06775, over 2586481.73 frames. ], batch size: 132, lr: 2.91e-03, grad_scale: 16.0 2024-06-21 09:28:48,409 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.01 vs. limit=6.0 2024-06-21 09:28:51,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=360937.5, ans=0.125 2024-06-21 09:28:57,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=360955.8333333333, ans=0.1 2024-06-21 09:29:00,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=360974.1666666667, ans=0.125 2024-06-21 09:29:13,599 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.41 vs. limit=15.0 2024-06-21 09:29:19,187 INFO [train.py:1028] (1/2) Epoch 20, batch 4700, loss[loss=0.1887, simple_loss=0.2463, pruned_loss=0.06552, over 12873.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2361, pruned_loss=0.06759, over 2582106.32 frames. ], batch size: 26, lr: 2.91e-03, grad_scale: 16.0 2024-06-21 09:29:36,316 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 1.991e+02 2.163e+02 2.422e+02 3.211e+02, threshold=4.325e+02, percent-clipped=0.0 2024-06-21 09:29:39,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=361084.1666666667, ans=0.1 2024-06-21 09:29:39,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361084.1666666667, ans=0.1 2024-06-21 09:29:41,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=361084.1666666667, ans=0.125 2024-06-21 09:29:42,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=361084.1666666667, ans=0.025 2024-06-21 09:29:42,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=361084.1666666667, ans=0.09899494936611666 2024-06-21 09:29:42,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=361084.1666666667, ans=0.0 2024-06-21 09:29:51,771 INFO [train.py:1028] (1/2) Epoch 20, batch 4750, loss[loss=0.2084, simple_loss=0.2465, pruned_loss=0.08518, over 12503.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2361, pruned_loss=0.06791, over 2579475.40 frames. ], batch size: 202, lr: 2.91e-03, grad_scale: 16.0 2024-06-21 09:30:06,790 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.43 vs. limit=15.0 2024-06-21 09:30:25,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=361194.1666666667, ans=0.125 2024-06-21 09:30:28,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=361194.1666666667, ans=0.0 2024-06-21 09:30:29,596 INFO [train.py:1028] (1/2) Epoch 20, batch 4800, loss[loss=0.193, simple_loss=0.2447, pruned_loss=0.07067, over 13299.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2362, pruned_loss=0.06785, over 2576239.02 frames. ], batch size: 63, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:30:47,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=361230.8333333333, ans=0.125 2024-06-21 09:30:53,585 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 1.956e+02 2.139e+02 2.366e+02 3.039e+02, threshold=4.278e+02, percent-clipped=0.0 2024-06-21 09:30:55,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=361267.5, ans=0.125 2024-06-21 09:30:55,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=361267.5, ans=0.2 2024-06-21 09:31:05,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=361285.8333333333, ans=0.125 2024-06-21 09:31:06,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=361285.8333333333, ans=0.2 2024-06-21 09:31:09,271 INFO [train.py:1028] (1/2) Epoch 20, batch 4850, loss[loss=0.1864, simple_loss=0.2387, pruned_loss=0.06704, over 13250.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2359, pruned_loss=0.06778, over 2574391.16 frames. ], batch size: 89, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:31:17,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=361322.5, ans=0.125 2024-06-21 09:31:18,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=361322.5, ans=0.0 2024-06-21 09:31:18,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=361322.5, ans=0.07 2024-06-21 09:31:23,893 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=361340.8333333333, ans=0.2 2024-06-21 09:31:39,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=361377.5, ans=0.07 2024-06-21 09:31:43,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=361395.8333333333, ans=0.015 2024-06-21 09:31:44,284 INFO [train.py:1028] (1/2) Epoch 20, batch 4900, loss[loss=0.1677, simple_loss=0.2271, pruned_loss=0.0541, over 13170.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2361, pruned_loss=0.06782, over 2574894.93 frames. ], batch size: 59, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:31:46,169 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.56 vs. limit=15.0 2024-06-21 09:31:52,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=361414.1666666667, ans=0.0 2024-06-21 09:32:07,304 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 1.960e+02 2.104e+02 2.344e+02 3.133e+02, threshold=4.209e+02, percent-clipped=0.0 2024-06-21 09:32:12,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=361450.8333333333, ans=0.125 2024-06-21 09:32:18,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=361469.1666666667, ans=0.125 2024-06-21 09:32:22,341 INFO [train.py:1028] (1/2) Epoch 20, batch 4950, loss[loss=0.1874, simple_loss=0.2238, pruned_loss=0.07557, over 11200.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2359, pruned_loss=0.06822, over 2569952.16 frames. ], batch size: 304, lr: 2.90e-03, grad_scale: 16.0 2024-06-21 09:32:23,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=361487.5, ans=0.0 2024-06-21 09:32:24,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=361487.5, ans=0.125 2024-06-21 09:32:30,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=361505.8333333333, ans=0.0 2024-06-21 09:32:36,806 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.76 vs. limit=22.5 2024-06-21 09:32:37,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361524.1666666667, ans=0.1 2024-06-21 09:32:48,527 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.65 vs. limit=15.0 2024-06-21 09:32:52,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=361560.8333333333, ans=0.0 2024-06-21 09:32:58,462 INFO [train.py:1028] (1/2) Epoch 20, batch 5000, loss[loss=0.1908, simple_loss=0.2366, pruned_loss=0.07248, over 13188.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2357, pruned_loss=0.0681, over 2574586.18 frames. ], batch size: 95, lr: 2.90e-03, grad_scale: 16.0 2024-06-21 09:33:04,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=361579.1666666667, ans=0.125 2024-06-21 09:33:06,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=361597.5, ans=0.2 2024-06-21 09:33:08,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=361597.5, ans=0.025 2024-06-21 09:33:10,170 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.04 vs. limit=12.0 2024-06-21 09:33:11,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=361597.5, ans=0.09899494936611666 2024-06-21 09:33:12,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=361615.8333333333, ans=0.2 2024-06-21 09:33:12,940 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.43 vs. limit=22.5 2024-06-21 09:33:16,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=361615.8333333333, ans=0.125 2024-06-21 09:33:17,822 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 1.929e+02 2.036e+02 2.247e+02 3.575e+02, threshold=4.073e+02, percent-clipped=0.0 2024-06-21 09:33:19,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=361634.1666666667, ans=0.125 2024-06-21 09:33:23,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361634.1666666667, ans=0.1 2024-06-21 09:33:24,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=361634.1666666667, ans=0.125 2024-06-21 09:33:24,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=361634.1666666667, ans=0.0 2024-06-21 09:33:26,831 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.12 vs. limit=15.0 2024-06-21 09:33:32,409 INFO [train.py:1028] (1/2) Epoch 20, batch 5050, loss[loss=0.1901, simple_loss=0.2422, pruned_loss=0.06899, over 12965.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2358, pruned_loss=0.06778, over 2574867.56 frames. ], batch size: 36, lr: 2.90e-03, grad_scale: 16.0 2024-06-21 09:34:01,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=361744.1666666667, ans=0.125 2024-06-21 09:34:02,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=361744.1666666667, ans=0.05 2024-06-21 09:34:08,336 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.50 vs. limit=22.5 2024-06-21 09:34:09,239 INFO [train.py:1028] (1/2) Epoch 20, batch 5100, loss[loss=0.2031, simple_loss=0.2627, pruned_loss=0.07172, over 12898.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2364, pruned_loss=0.0684, over 2570680.89 frames. ], batch size: 39, lr: 2.90e-03, grad_scale: 16.0 2024-06-21 09:34:11,054 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.69 vs. limit=6.0 2024-06-21 09:34:16,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=361780.8333333333, ans=0.125 2024-06-21 09:34:24,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=361799.1666666667, ans=0.125 2024-06-21 09:34:24,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=361799.1666666667, ans=0.2 2024-06-21 09:34:25,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=361799.1666666667, ans=0.0 2024-06-21 09:34:27,932 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 1.920e+02 2.021e+02 2.178e+02 2.707e+02, threshold=4.042e+02, percent-clipped=0.0 2024-06-21 09:34:30,235 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.96 vs. limit=22.5 2024-06-21 09:34:32,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=361817.5, ans=0.125 2024-06-21 09:34:39,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=361835.8333333333, ans=0.125 2024-06-21 09:34:45,383 INFO [train.py:1028] (1/2) Epoch 20, batch 5150, loss[loss=0.183, simple_loss=0.2257, pruned_loss=0.07011, over 13073.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2358, pruned_loss=0.0681, over 2572329.09 frames. ], batch size: 132, lr: 2.90e-03, grad_scale: 16.0 2024-06-21 09:34:58,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=361890.8333333333, ans=0.125 2024-06-21 09:35:00,697 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.89 vs. limit=22.5 2024-06-21 09:35:02,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=361890.8333333333, ans=0.0 2024-06-21 09:35:03,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=361890.8333333333, ans=0.125 2024-06-21 09:35:06,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=361909.1666666667, ans=0.025 2024-06-21 09:35:06,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=361909.1666666667, ans=0.125 2024-06-21 09:35:07,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=361909.1666666667, ans=0.2 2024-06-21 09:35:08,828 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2024-06-21 09:35:09,623 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.92 vs. limit=15.0 2024-06-21 09:35:13,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=361927.5, ans=10.0 2024-06-21 09:35:18,442 INFO [train.py:1028] (1/2) Epoch 20, batch 5200, loss[loss=0.186, simple_loss=0.2274, pruned_loss=0.07231, over 13134.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2358, pruned_loss=0.0681, over 2575678.76 frames. ], batch size: 95, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:35:31,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=361982.5, ans=0.125 2024-06-21 09:35:36,674 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 1.965e+02 2.076e+02 2.260e+02 3.205e+02, threshold=4.151e+02, percent-clipped=0.0 2024-06-21 09:35:39,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=362000.8333333333, ans=0.125 2024-06-21 09:35:45,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=362019.1666666667, ans=0.125 2024-06-21 09:35:50,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=362019.1666666667, ans=0.0 2024-06-21 09:35:51,182 INFO [train.py:1028] (1/2) Epoch 20, batch 5250, loss[loss=0.1768, simple_loss=0.2275, pruned_loss=0.06308, over 13266.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.236, pruned_loss=0.0681, over 2573978.47 frames. ], batch size: 52, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:36:11,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=362074.1666666667, ans=0.1 2024-06-21 09:36:17,248 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.90 vs. limit=22.5 2024-06-21 09:36:26,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=362110.8333333333, ans=0.2 2024-06-21 09:36:28,448 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.52 vs. limit=15.0 2024-06-21 09:36:28,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=362110.8333333333, ans=0.125 2024-06-21 09:36:29,753 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=22.5 2024-06-21 09:36:30,033 INFO [train.py:1028] (1/2) Epoch 20, batch 5300, loss[loss=0.1866, simple_loss=0.2353, pruned_loss=0.0689, over 13039.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2359, pruned_loss=0.06798, over 2569740.54 frames. ], batch size: 144, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:36:33,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=362129.1666666667, ans=0.125 2024-06-21 09:36:34,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=362129.1666666667, ans=0.1 2024-06-21 09:36:34,403 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.82 vs. limit=15.0 2024-06-21 09:36:37,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=362147.5, ans=0.0 2024-06-21 09:36:38,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=362147.5, ans=0.125 2024-06-21 09:36:48,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=362165.8333333333, ans=0.0 2024-06-21 09:36:51,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=362165.8333333333, ans=22.5 2024-06-21 09:36:52,037 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 1.930e+02 1.996e+02 2.162e+02 2.822e+02, threshold=3.992e+02, percent-clipped=0.0 2024-06-21 09:36:54,100 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:37:03,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=362202.5, ans=0.0 2024-06-21 09:37:05,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=362202.5, ans=0.125 2024-06-21 09:37:05,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=362202.5, ans=0.125 2024-06-21 09:37:05,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=362202.5, ans=0.125 2024-06-21 09:37:06,924 INFO [train.py:1028] (1/2) Epoch 20, batch 5350, loss[loss=0.1958, simple_loss=0.2407, pruned_loss=0.07538, over 11351.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2347, pruned_loss=0.06736, over 2575964.56 frames. ], batch size: 16, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:37:12,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=362220.8333333333, ans=0.0 2024-06-21 09:37:12,540 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.31 vs. limit=22.5 2024-06-21 09:37:26,067 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=22.5 2024-06-21 09:37:39,332 INFO [train.py:1028] (1/2) Epoch 20, batch 5400, loss[loss=0.1929, simple_loss=0.2333, pruned_loss=0.07624, over 12208.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2347, pruned_loss=0.06759, over 2568425.42 frames. ], batch size: 240, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:37:39,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=362312.5, ans=0.2 2024-06-21 09:37:40,333 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.39 vs. limit=10.0 2024-06-21 09:37:40,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=362312.5, ans=0.125 2024-06-21 09:37:49,285 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.45 vs. limit=6.0 2024-06-21 09:37:56,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=362349.1666666667, ans=0.04949747468305833 2024-06-21 09:38:02,640 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 1.962e+02 2.110e+02 2.280e+02 2.839e+02, threshold=4.221e+02, percent-clipped=0.0 2024-06-21 09:38:05,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=362367.5, ans=0.09899494936611666 2024-06-21 09:38:13,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=362385.8333333333, ans=0.07 2024-06-21 09:38:17,513 INFO [train.py:1028] (1/2) Epoch 20, batch 5450, loss[loss=0.2039, simple_loss=0.2425, pruned_loss=0.08268, over 12799.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.235, pruned_loss=0.06762, over 2572874.59 frames. ], batch size: 26, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:38:21,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=362404.1666666667, ans=0.0 2024-06-21 09:38:21,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=362404.1666666667, ans=0.1 2024-06-21 09:38:22,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=362404.1666666667, ans=0.0 2024-06-21 09:38:25,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=362422.5, ans=0.2 2024-06-21 09:38:43,439 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:38:47,876 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.63 vs. limit=6.0 2024-06-21 09:38:54,287 INFO [train.py:1028] (1/2) Epoch 20, batch 5500, loss[loss=0.2116, simple_loss=0.2469, pruned_loss=0.08813, over 12234.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2356, pruned_loss=0.06796, over 2566529.09 frames. ], batch size: 241, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:38:58,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=362495.8333333333, ans=0.0 2024-06-21 09:39:03,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=362514.1666666667, ans=0.0 2024-06-21 09:39:11,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=362532.5, ans=0.125 2024-06-21 09:39:13,057 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 1.990e+02 2.133e+02 2.331e+02 2.975e+02, threshold=4.266e+02, percent-clipped=0.0 2024-06-21 09:39:14,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=362550.8333333333, ans=0.1 2024-06-21 09:39:24,226 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.70 vs. limit=15.0 2024-06-21 09:39:27,595 INFO [train.py:1028] (1/2) Epoch 20, batch 5550, loss[loss=0.1912, simple_loss=0.2478, pruned_loss=0.06732, over 13266.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2352, pruned_loss=0.06777, over 2570278.15 frames. ], batch size: 43, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:39:29,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=362587.5, ans=0.125 2024-06-21 09:39:32,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=362587.5, ans=0.025 2024-06-21 09:39:36,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=362605.8333333333, ans=0.0 2024-06-21 09:39:36,425 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.84 vs. limit=10.0 2024-06-21 09:39:42,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=362624.1666666667, ans=0.07 2024-06-21 09:39:45,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=362624.1666666667, ans=0.125 2024-06-21 09:40:00,418 INFO [train.py:1028] (1/2) Epoch 20, batch 5600, loss[loss=0.1725, simple_loss=0.2263, pruned_loss=0.05931, over 13208.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2348, pruned_loss=0.0677, over 2571706.47 frames. ], batch size: 89, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:40:00,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=362679.1666666667, ans=0.0 2024-06-21 09:40:05,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=362679.1666666667, ans=0.05 2024-06-21 09:40:11,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=362697.5, ans=0.0 2024-06-21 09:40:18,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=362715.8333333333, ans=0.0 2024-06-21 09:40:21,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=362715.8333333333, ans=0.125 2024-06-21 09:40:22,465 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 1.910e+02 2.001e+02 2.140e+02 3.055e+02, threshold=4.002e+02, percent-clipped=0.0 2024-06-21 09:40:35,208 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.82 vs. limit=10.0 2024-06-21 09:40:36,635 INFO [train.py:1028] (1/2) Epoch 20, batch 5650, loss[loss=0.2026, simple_loss=0.2528, pruned_loss=0.07626, over 12524.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2348, pruned_loss=0.06745, over 2576141.60 frames. ], batch size: 202, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:40:49,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=362789.1666666667, ans=0.0 2024-06-21 09:40:53,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=362807.5, ans=0.125 2024-06-21 09:41:06,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=362844.1666666667, ans=0.125 2024-06-21 09:41:07,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=362844.1666666667, ans=0.125 2024-06-21 09:41:11,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=362844.1666666667, ans=0.125 2024-06-21 09:41:12,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=362862.5, ans=0.125 2024-06-21 09:41:13,448 INFO [train.py:1028] (1/2) Epoch 20, batch 5700, loss[loss=0.1857, simple_loss=0.2411, pruned_loss=0.0651, over 13245.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2343, pruned_loss=0.06732, over 2578808.62 frames. ], batch size: 63, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:41:13,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=362862.5, ans=0.0 2024-06-21 09:41:21,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=362880.8333333333, ans=0.95 2024-06-21 09:41:24,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=362880.8333333333, ans=0.125 2024-06-21 09:41:25,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=362880.8333333333, ans=0.125 2024-06-21 09:41:31,450 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 1.955e+02 2.086e+02 2.304e+02 2.978e+02, threshold=4.172e+02, percent-clipped=0.0 2024-06-21 09:41:37,647 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.51 vs. limit=15.0 2024-06-21 09:41:37,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=362917.5, ans=0.125 2024-06-21 09:41:39,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=362935.8333333333, ans=0.2 2024-06-21 09:41:43,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=362935.8333333333, ans=0.025 2024-06-21 09:41:45,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=362954.1666666667, ans=0.0 2024-06-21 09:41:45,588 INFO [train.py:1028] (1/2) Epoch 20, batch 5750, loss[loss=0.2094, simple_loss=0.2574, pruned_loss=0.08074, over 12752.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2353, pruned_loss=0.06777, over 2579092.32 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:41:47,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=362954.1666666667, ans=0.0 2024-06-21 09:41:51,299 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.22 vs. limit=15.0 2024-06-21 09:41:54,089 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.66 vs. limit=15.0 2024-06-21 09:42:09,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=363009.1666666667, ans=0.1 2024-06-21 09:42:21,507 INFO [train.py:1028] (1/2) Epoch 20, batch 5800, loss[loss=0.199, simple_loss=0.2433, pruned_loss=0.07739, over 12753.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2366, pruned_loss=0.06872, over 2578574.43 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:42:43,274 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.062e+02 2.209e+02 2.485e+02 3.394e+02, threshold=4.418e+02, percent-clipped=0.0 2024-06-21 09:42:58,168 INFO [train.py:1028] (1/2) Epoch 20, batch 5850, loss[loss=0.2102, simple_loss=0.2568, pruned_loss=0.08179, over 12554.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2381, pruned_loss=0.06929, over 2576075.38 frames. ], batch size: 202, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:43:22,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=363192.5, ans=0.125 2024-06-21 09:43:24,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=363210.8333333333, ans=10.0 2024-06-21 09:43:27,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=363210.8333333333, ans=0.125 2024-06-21 09:43:28,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=363210.8333333333, ans=0.025 2024-06-21 09:43:29,758 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=12.0 2024-06-21 09:43:31,190 INFO [train.py:1028] (1/2) Epoch 20, batch 5900, loss[loss=0.1882, simple_loss=0.235, pruned_loss=0.07064, over 13110.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2396, pruned_loss=0.06947, over 2575899.33 frames. ], batch size: 121, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:43:35,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=363229.1666666667, ans=0.035 2024-06-21 09:43:37,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=363247.5, ans=0.125 2024-06-21 09:43:42,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=363247.5, ans=0.1 2024-06-21 09:43:49,830 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 1.999e+02 2.139e+02 2.355e+02 3.485e+02, threshold=4.278e+02, percent-clipped=0.0 2024-06-21 09:43:55,425 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.99 vs. limit=10.0 2024-06-21 09:44:02,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=363302.5, ans=0.1 2024-06-21 09:44:04,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=363302.5, ans=0.125 2024-06-21 09:44:08,174 INFO [train.py:1028] (1/2) Epoch 20, batch 5950, loss[loss=0.193, simple_loss=0.243, pruned_loss=0.07148, over 13088.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2409, pruned_loss=0.06984, over 2580679.99 frames. ], batch size: 121, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:44:08,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=363320.8333333333, ans=0.125 2024-06-21 09:44:13,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=363320.8333333333, ans=0.025 2024-06-21 09:44:44,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=363412.5, ans=0.0 2024-06-21 09:44:44,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=363412.5, ans=15.0 2024-06-21 09:44:44,803 INFO [train.py:1028] (1/2) Epoch 20, batch 6000, loss[loss=0.2297, simple_loss=0.266, pruned_loss=0.09675, over 12230.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2426, pruned_loss=0.0707, over 2573901.48 frames. ], batch size: 241, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:44:44,803 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 09:44:50,539 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.8118, 2.4892, 3.2106, 1.7203], device='cuda:1') 2024-06-21 09:44:52,768 INFO [train.py:1060] (1/2) Epoch 20, validation: loss=0.1874, simple_loss=0.2514, pruned_loss=0.06175, over 351949.00 frames. 2024-06-21 09:44:52,768 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 09:45:03,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=363430.8333333333, ans=0.05 2024-06-21 09:45:08,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=363449.1666666667, ans=0.125 2024-06-21 09:45:10,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.01 vs. limit=15.0 2024-06-21 09:45:11,468 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.069e+02 2.236e+02 2.466e+02 3.790e+02, threshold=4.471e+02, percent-clipped=0.0 2024-06-21 09:45:15,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=363467.5, ans=0.0 2024-06-21 09:45:17,942 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=363467.5, ans=0.1 2024-06-21 09:45:18,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=363467.5, ans=0.125 2024-06-21 09:45:23,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=363485.8333333333, ans=0.125 2024-06-21 09:45:26,232 INFO [train.py:1028] (1/2) Epoch 20, batch 6050, loss[loss=0.1747, simple_loss=0.2319, pruned_loss=0.05876, over 12880.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2437, pruned_loss=0.07087, over 2577483.13 frames. ], batch size: 39, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:45:36,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=363522.5, ans=0.1 2024-06-21 09:45:43,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=363540.8333333333, ans=0.1 2024-06-21 09:45:43,468 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2024-06-21 09:45:55,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=363577.5, ans=0.025 2024-06-21 09:45:58,403 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.47 vs. limit=22.5 2024-06-21 09:45:59,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=363595.8333333333, ans=0.125 2024-06-21 09:45:59,940 INFO [train.py:1028] (1/2) Epoch 20, batch 6100, loss[loss=0.1932, simple_loss=0.2424, pruned_loss=0.07194, over 13141.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2448, pruned_loss=0.07107, over 2579732.19 frames. ], batch size: 121, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:46:21,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=363632.5, ans=0.125 2024-06-21 09:46:23,691 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.764e+02 2.015e+02 2.173e+02 2.355e+02 4.142e+02, threshold=4.346e+02, percent-clipped=0.0 2024-06-21 09:46:25,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=363650.8333333333, ans=0.0 2024-06-21 09:46:33,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=363669.1666666667, ans=0.0 2024-06-21 09:46:33,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=363669.1666666667, ans=0.0 2024-06-21 09:46:38,120 INFO [train.py:1028] (1/2) Epoch 20, batch 6150, loss[loss=0.2062, simple_loss=0.2485, pruned_loss=0.08198, over 10819.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2466, pruned_loss=0.07202, over 2578563.71 frames. ], batch size: 303, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:46:40,647 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.75 vs. limit=22.5 2024-06-21 09:46:46,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=363705.8333333333, ans=0.1 2024-06-21 09:46:50,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=363705.8333333333, ans=0.0 2024-06-21 09:47:03,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=363742.5, ans=0.1 2024-06-21 09:47:14,741 INFO [train.py:1028] (1/2) Epoch 20, batch 6200, loss[loss=0.2082, simple_loss=0.2665, pruned_loss=0.07491, over 13265.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2482, pruned_loss=0.07241, over 2576326.66 frames. ], batch size: 89, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:47:24,077 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.80 vs. limit=15.0 2024-06-21 09:47:30,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=363815.8333333333, ans=0.125 2024-06-21 09:47:33,340 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 2.066e+02 2.295e+02 2.576e+02 4.288e+02, threshold=4.590e+02, percent-clipped=0.0 2024-06-21 09:47:40,442 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.92 vs. limit=22.5 2024-06-21 09:47:48,058 INFO [train.py:1028] (1/2) Epoch 20, batch 6250, loss[loss=0.2115, simple_loss=0.2615, pruned_loss=0.08073, over 13234.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2499, pruned_loss=0.07321, over 2568986.97 frames. ], batch size: 83, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:48:03,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=363907.5, ans=0.1 2024-06-21 09:48:03,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=363907.5, ans=0.0 2024-06-21 09:48:04,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=363907.5, ans=0.125 2024-06-21 09:48:15,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=363925.8333333333, ans=0.125 2024-06-21 09:48:17,098 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.31 vs. limit=22.5 2024-06-21 09:48:20,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=363944.1666666667, ans=0.2 2024-06-21 09:48:21,657 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.28 vs. limit=12.0 2024-06-21 09:48:22,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=363944.1666666667, ans=0.125 2024-06-21 09:48:23,683 INFO [train.py:1028] (1/2) Epoch 20, batch 6300, loss[loss=0.208, simple_loss=0.2624, pruned_loss=0.07683, over 11351.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2512, pruned_loss=0.07356, over 2563371.81 frames. ], batch size: 16, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:48:31,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=363980.8333333333, ans=0.025 2024-06-21 09:48:34,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=363980.8333333333, ans=0.125 2024-06-21 09:48:34,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=363980.8333333333, ans=0.025 2024-06-21 09:48:36,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=363999.1666666667, ans=0.0 2024-06-21 09:48:36,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=363999.1666666667, ans=0.0 2024-06-21 09:48:42,317 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.710e+02 2.091e+02 2.323e+02 2.620e+02 4.679e+02, threshold=4.647e+02, percent-clipped=1.0 2024-06-21 09:48:50,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=364017.5, ans=0.0 2024-06-21 09:48:58,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=364035.8333333333, ans=0.125 2024-06-21 09:48:59,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=364035.8333333333, ans=0.0 2024-06-21 09:49:01,031 INFO [train.py:1028] (1/2) Epoch 20, batch 6350, loss[loss=0.2182, simple_loss=0.2688, pruned_loss=0.08382, over 12571.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2533, pruned_loss=0.07388, over 2572896.13 frames. ], batch size: 202, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:49:01,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=364054.1666666667, ans=0.0 2024-06-21 09:49:01,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=364054.1666666667, ans=0.2 2024-06-21 09:49:09,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=364072.5, ans=0.125 2024-06-21 09:49:11,112 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.62 vs. limit=22.5 2024-06-21 09:49:12,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=364072.5, ans=0.0 2024-06-21 09:49:13,534 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.79 vs. limit=15.0 2024-06-21 09:49:17,801 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.46 vs. limit=15.0 2024-06-21 09:49:19,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=364090.8333333333, ans=0.125 2024-06-21 09:49:30,873 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.25 vs. limit=15.0 2024-06-21 09:49:33,718 INFO [train.py:1028] (1/2) Epoch 20, batch 6400, loss[loss=0.1998, simple_loss=0.2577, pruned_loss=0.07099, over 13246.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2546, pruned_loss=0.07431, over 2574720.37 frames. ], batch size: 67, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:49:42,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=364164.1666666667, ans=0.125 2024-06-21 09:49:43,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=364164.1666666667, ans=0.025 2024-06-21 09:49:49,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=364182.5, ans=0.2 2024-06-21 09:49:52,045 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.063e+02 2.243e+02 2.479e+02 3.217e+02, threshold=4.485e+02, percent-clipped=0.0 2024-06-21 09:49:52,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=364200.8333333333, ans=0.025 2024-06-21 09:49:57,795 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.22 vs. limit=15.0 2024-06-21 09:49:59,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=364219.1666666667, ans=0.125 2024-06-21 09:50:06,520 INFO [train.py:1028] (1/2) Epoch 20, batch 6450, loss[loss=0.2457, simple_loss=0.2932, pruned_loss=0.09914, over 12540.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.257, pruned_loss=0.07556, over 2580032.95 frames. ], batch size: 202, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:50:07,553 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.676e-02 2024-06-21 09:50:13,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=364237.5, ans=0.2 2024-06-21 09:50:16,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=364237.5, ans=0.0 2024-06-21 09:50:30,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=364292.5, ans=0.2 2024-06-21 09:50:37,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=364310.8333333333, ans=0.0 2024-06-21 09:50:41,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=364310.8333333333, ans=0.125 2024-06-21 09:50:44,390 INFO [train.py:1028] (1/2) Epoch 20, batch 6500, loss[loss=0.2214, simple_loss=0.2638, pruned_loss=0.08954, over 10588.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2585, pruned_loss=0.07589, over 2582581.44 frames. ], batch size: 303, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:50:47,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=364329.1666666667, ans=0.125 2024-06-21 09:50:52,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=364347.5, ans=0.015 2024-06-21 09:51:06,098 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.114e+02 2.223e+02 2.494e+02 3.269e+02, threshold=4.445e+02, percent-clipped=0.0 2024-06-21 09:51:16,061 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.34 vs. limit=6.0 2024-06-21 09:51:17,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=364402.5, ans=0.125 2024-06-21 09:51:20,550 INFO [train.py:1028] (1/2) Epoch 20, batch 6550, loss[loss=0.1795, simple_loss=0.242, pruned_loss=0.05849, over 12604.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2593, pruned_loss=0.07565, over 2587268.68 frames. ], batch size: 22, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:51:26,177 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.97 vs. limit=15.0 2024-06-21 09:51:27,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=364439.1666666667, ans=0.2 2024-06-21 09:51:31,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=364439.1666666667, ans=0.125 2024-06-21 09:51:35,793 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.67 vs. limit=22.5 2024-06-21 09:51:53,286 INFO [train.py:1028] (1/2) Epoch 20, batch 6600, loss[loss=0.2202, simple_loss=0.2743, pruned_loss=0.0831, over 13106.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2592, pruned_loss=0.07564, over 2589793.94 frames. ], batch size: 71, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:51:55,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=364512.5, ans=0.125 2024-06-21 09:51:59,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=364530.8333333333, ans=0.0 2024-06-21 09:52:05,244 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2024-06-21 09:52:12,282 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.095e+02 2.268e+02 2.540e+02 3.528e+02, threshold=4.537e+02, percent-clipped=0.0 2024-06-21 09:52:12,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=364549.1666666667, ans=0.0 2024-06-21 09:52:25,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=364585.8333333333, ans=0.125 2024-06-21 09:52:31,186 INFO [train.py:1028] (1/2) Epoch 20, batch 6650, loss[loss=0.2187, simple_loss=0.2726, pruned_loss=0.08234, over 12937.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2611, pruned_loss=0.07637, over 2583746.70 frames. ], batch size: 158, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:52:33,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=364604.1666666667, ans=0.125 2024-06-21 09:52:35,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=364604.1666666667, ans=0.125 2024-06-21 09:52:39,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=364622.5, ans=0.125 2024-06-21 09:52:44,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=364640.8333333333, ans=0.1 2024-06-21 09:52:55,755 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.76 vs. limit=22.5 2024-06-21 09:53:07,200 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.56 vs. limit=15.0 2024-06-21 09:53:08,141 INFO [train.py:1028] (1/2) Epoch 20, batch 6700, loss[loss=0.231, simple_loss=0.2843, pruned_loss=0.08886, over 12762.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2618, pruned_loss=0.07665, over 2583520.56 frames. ], batch size: 176, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:53:08,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=364695.8333333333, ans=0.1 2024-06-21 09:53:13,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=364695.8333333333, ans=0.2 2024-06-21 09:53:24,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=364732.5, ans=0.125 2024-06-21 09:53:26,949 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.099e+02 2.245e+02 2.533e+02 3.822e+02, threshold=4.490e+02, percent-clipped=0.0 2024-06-21 09:53:39,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=364769.1666666667, ans=0.125 2024-06-21 09:53:40,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=364787.5, ans=0.2 2024-06-21 09:53:41,367 INFO [train.py:1028] (1/2) Epoch 20, batch 6750, loss[loss=0.269, simple_loss=0.3048, pruned_loss=0.1166, over 12168.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2632, pruned_loss=0.07747, over 2578103.39 frames. ], batch size: 240, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:53:52,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=364805.8333333333, ans=0.1 2024-06-21 09:54:11,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=364860.8333333333, ans=0.125 2024-06-21 09:54:14,125 INFO [train.py:1028] (1/2) Epoch 20, batch 6800, loss[loss=0.1822, simple_loss=0.2343, pruned_loss=0.06502, over 13249.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2632, pruned_loss=0.07723, over 2580562.88 frames. ], batch size: 67, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:54:29,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=364897.5, ans=0.025 2024-06-21 09:54:30,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=364915.8333333333, ans=0.1 2024-06-21 09:54:35,714 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.103e+02 2.246e+02 2.436e+02 3.776e+02, threshold=4.493e+02, percent-clipped=0.0 2024-06-21 09:54:45,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=364952.5, ans=0.125 2024-06-21 09:54:46,245 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2024-06-21 09:54:50,548 INFO [train.py:1028] (1/2) Epoch 20, batch 6850, loss[loss=0.2113, simple_loss=0.2766, pruned_loss=0.07299, over 13255.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2634, pruned_loss=0.0769, over 2584836.60 frames. ], batch size: 63, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:54:59,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=364989.1666666667, ans=0.125 2024-06-21 09:55:15,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=365025.8333333333, ans=0.2 2024-06-21 09:55:15,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=365025.8333333333, ans=0.125 2024-06-21 09:55:20,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=365044.1666666667, ans=0.07 2024-06-21 09:55:26,612 INFO [train.py:1028] (1/2) Epoch 20, batch 6900, loss[loss=0.2104, simple_loss=0.2714, pruned_loss=0.07466, over 13264.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.264, pruned_loss=0.07711, over 2586423.25 frames. ], batch size: 49, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:55:27,075 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.98 vs. limit=15.0 2024-06-21 09:55:31,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=365062.5, ans=0.125 2024-06-21 09:55:35,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=15.0 2024-06-21 09:55:44,613 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.109e+02 2.233e+02 2.426e+02 3.290e+02, threshold=4.467e+02, percent-clipped=0.0 2024-06-21 09:55:52,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=365135.8333333333, ans=0.125 2024-06-21 09:55:59,287 INFO [train.py:1028] (1/2) Epoch 20, batch 6950, loss[loss=0.183, simple_loss=0.2397, pruned_loss=0.06319, over 11692.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2648, pruned_loss=0.07709, over 2580344.73 frames. ], batch size: 17, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 09:56:08,218 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.67 vs. limit=6.0 2024-06-21 09:56:09,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=365172.5, ans=0.125 2024-06-21 09:56:15,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=365190.8333333333, ans=0.2 2024-06-21 09:56:15,898 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=5.152e-03 2024-06-21 09:56:17,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=365190.8333333333, ans=0.05 2024-06-21 09:56:31,293 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=15.0 2024-06-21 09:56:35,531 INFO [train.py:1028] (1/2) Epoch 20, batch 7000, loss[loss=0.218, simple_loss=0.2762, pruned_loss=0.07992, over 12932.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2648, pruned_loss=0.07695, over 2576080.55 frames. ], batch size: 158, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 09:56:40,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=365245.8333333333, ans=0.0 2024-06-21 09:56:45,712 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.35 vs. limit=22.5 2024-06-21 09:56:53,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=365282.5, ans=0.1 2024-06-21 09:56:53,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=365282.5, ans=0.0 2024-06-21 09:56:54,404 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.802e+02 2.092e+02 2.205e+02 2.416e+02 3.281e+02, threshold=4.410e+02, percent-clipped=0.0 2024-06-21 09:56:54,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=365282.5, ans=0.0 2024-06-21 09:57:14,386 INFO [train.py:1028] (1/2) Epoch 20, batch 7050, loss[loss=0.2149, simple_loss=0.2691, pruned_loss=0.08033, over 12770.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2667, pruned_loss=0.07794, over 2582820.71 frames. ], batch size: 176, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 09:57:16,955 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.87 vs. limit=15.0 2024-06-21 09:57:28,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=365374.1666666667, ans=0.025 2024-06-21 09:57:29,664 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.19 vs. limit=10.0 2024-06-21 09:57:32,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=365374.1666666667, ans=0.125 2024-06-21 09:57:37,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=365392.5, ans=0.125 2024-06-21 09:57:43,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=365410.8333333333, ans=0.025 2024-06-21 09:57:46,916 INFO [train.py:1028] (1/2) Epoch 20, batch 7100, loss[loss=0.2391, simple_loss=0.2929, pruned_loss=0.09262, over 13170.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2674, pruned_loss=0.07862, over 2575659.48 frames. ], batch size: 112, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 09:57:47,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=365429.1666666667, ans=0.125 2024-06-21 09:57:56,379 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.63 vs. limit=15.0 2024-06-21 09:57:56,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=365447.5, ans=0.1 2024-06-21 09:57:58,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=365447.5, ans=0.125 2024-06-21 09:58:05,313 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.127e+02 2.309e+02 2.474e+02 3.679e+02, threshold=4.619e+02, percent-clipped=0.0 2024-06-21 09:58:07,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=365484.1666666667, ans=0.125 2024-06-21 09:58:07,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=365484.1666666667, ans=0.125 2024-06-21 09:58:07,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=365484.1666666667, ans=0.025 2024-06-21 09:58:12,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=365502.5, ans=0.125 2024-06-21 09:58:19,855 INFO [train.py:1028] (1/2) Epoch 20, batch 7150, loss[loss=0.2485, simple_loss=0.2938, pruned_loss=0.1016, over 12509.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2681, pruned_loss=0.07868, over 2574605.36 frames. ], batch size: 202, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 09:58:24,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=365520.8333333333, ans=0.0 2024-06-21 09:58:25,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=365520.8333333333, ans=0.125 2024-06-21 09:58:37,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=365557.5, ans=0.1 2024-06-21 09:58:40,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=365557.5, ans=0.0 2024-06-21 09:58:41,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=365557.5, ans=0.0 2024-06-21 09:58:55,985 INFO [train.py:1028] (1/2) Epoch 20, batch 7200, loss[loss=0.2319, simple_loss=0.2853, pruned_loss=0.08924, over 13177.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2689, pruned_loss=0.07874, over 2578554.45 frames. ], batch size: 112, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 09:59:02,888 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.67 vs. limit=15.0 2024-06-21 09:59:10,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=365649.1666666667, ans=0.125 2024-06-21 09:59:16,620 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.43 vs. limit=12.0 2024-06-21 09:59:17,430 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.134e+02 2.298e+02 2.582e+02 4.025e+02, threshold=4.597e+02, percent-clipped=0.0 2024-06-21 09:59:17,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=365649.1666666667, ans=0.125 2024-06-21 09:59:17,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=365649.1666666667, ans=0.125 2024-06-21 09:59:31,001 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:59:32,237 INFO [train.py:1028] (1/2) Epoch 20, batch 7250, loss[loss=0.2141, simple_loss=0.281, pruned_loss=0.07362, over 12919.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2692, pruned_loss=0.07848, over 2578808.81 frames. ], batch size: 36, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 09:59:33,921 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.09 vs. limit=22.5 2024-06-21 09:59:37,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=365704.1666666667, ans=0.1 2024-06-21 09:59:42,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=365722.5, ans=0.2 2024-06-21 09:59:49,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=365740.8333333333, ans=0.1 2024-06-21 09:59:54,466 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.51 vs. limit=6.0 2024-06-21 10:00:04,063 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.79 vs. limit=10.0 2024-06-21 10:00:04,842 INFO [train.py:1028] (1/2) Epoch 20, batch 7300, loss[loss=0.2239, simple_loss=0.2887, pruned_loss=0.07955, over 12893.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2705, pruned_loss=0.0789, over 2579041.21 frames. ], batch size: 36, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 10:00:06,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=365795.8333333333, ans=0.5 2024-06-21 10:00:07,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=365795.8333333333, ans=0.0 2024-06-21 10:00:09,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=365795.8333333333, ans=0.025 2024-06-21 10:00:14,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=365814.1666666667, ans=0.04949747468305833 2024-06-21 10:00:16,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=365814.1666666667, ans=0.05 2024-06-21 10:00:23,785 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.125e+02 2.306e+02 2.527e+02 3.412e+02, threshold=4.613e+02, percent-clipped=0.0 2024-06-21 10:00:34,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=365869.1666666667, ans=0.125 2024-06-21 10:00:37,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=365869.1666666667, ans=0.1 2024-06-21 10:00:38,808 INFO [train.py:1028] (1/2) Epoch 20, batch 7350, loss[loss=0.2378, simple_loss=0.2972, pruned_loss=0.08918, over 13345.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2711, pruned_loss=0.07908, over 2581559.78 frames. ], batch size: 46, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 10:00:53,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=365905.8333333333, ans=0.125 2024-06-21 10:01:08,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=365942.5, ans=22.5 2024-06-21 10:01:16,004 INFO [train.py:1028] (1/2) Epoch 20, batch 7400, loss[loss=0.2355, simple_loss=0.297, pruned_loss=0.08695, over 13265.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2709, pruned_loss=0.07885, over 2586589.16 frames. ], batch size: 63, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 10:01:38,869 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.173e+02 2.370e+02 2.613e+02 3.530e+02, threshold=4.740e+02, percent-clipped=0.0 2024-06-21 10:01:53,637 INFO [train.py:1028] (1/2) Epoch 20, batch 7450, loss[loss=0.2025, simple_loss=0.2641, pruned_loss=0.07044, over 12639.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2711, pruned_loss=0.07881, over 2580411.12 frames. ], batch size: 29, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 10:01:54,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.77 vs. limit=15.0 2024-06-21 10:01:55,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=366070.8333333333, ans=0.125 2024-06-21 10:02:00,863 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.92 vs. limit=22.5 2024-06-21 10:02:19,105 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.11 vs. limit=22.5 2024-06-21 10:02:27,376 INFO [train.py:1028] (1/2) Epoch 20, batch 7500, loss[loss=0.2349, simple_loss=0.2807, pruned_loss=0.0945, over 10824.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2717, pruned_loss=0.07922, over 2578314.03 frames. ], batch size: 304, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 10:02:28,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=366162.5, ans=0.125 2024-06-21 10:02:29,029 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.87 vs. limit=6.0 2024-06-21 10:02:38,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=366180.8333333333, ans=0.125 2024-06-21 10:02:40,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=366199.1666666667, ans=0.05 2024-06-21 10:02:44,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=366199.1666666667, ans=0.0 2024-06-21 10:02:44,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366199.1666666667, ans=0.1 2024-06-21 10:02:48,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=366199.1666666667, ans=0.125 2024-06-21 10:02:49,103 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.107e+02 2.234e+02 2.363e+02 3.209e+02, threshold=4.469e+02, percent-clipped=0.0 2024-06-21 10:03:03,848 INFO [train.py:1028] (1/2) Epoch 20, batch 7550, loss[loss=0.2288, simple_loss=0.2835, pruned_loss=0.08698, over 12906.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2722, pruned_loss=0.07977, over 2576987.50 frames. ], batch size: 158, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 10:03:11,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=366272.5, ans=0.125 2024-06-21 10:03:17,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=366290.8333333333, ans=0.125 2024-06-21 10:03:17,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=366290.8333333333, ans=0.02 2024-06-21 10:03:18,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=366290.8333333333, ans=0.125 2024-06-21 10:03:30,706 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.06 vs. limit=6.0 2024-06-21 10:03:40,624 INFO [train.py:1028] (1/2) Epoch 20, batch 7600, loss[loss=0.2216, simple_loss=0.2745, pruned_loss=0.08433, over 13233.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2733, pruned_loss=0.08035, over 2577013.53 frames. ], batch size: 83, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 10:03:54,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366382.5, ans=0.1 2024-06-21 10:03:59,186 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.211e+02 2.418e+02 2.659e+02 3.492e+02, threshold=4.837e+02, percent-clipped=0.0 2024-06-21 10:04:00,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=366400.8333333333, ans=0.0 2024-06-21 10:04:01,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=366400.8333333333, ans=0.125 2024-06-21 10:04:04,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=366400.8333333333, ans=0.0 2024-06-21 10:04:12,975 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.71 vs. limit=12.0 2024-06-21 10:04:13,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=366437.5, ans=0.125 2024-06-21 10:04:13,914 INFO [train.py:1028] (1/2) Epoch 20, batch 7650, loss[loss=0.1938, simple_loss=0.255, pruned_loss=0.06629, over 12855.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2734, pruned_loss=0.08041, over 2572759.55 frames. ], batch size: 33, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:04:23,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=366455.8333333333, ans=0.1 2024-06-21 10:04:25,287 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.23 vs. limit=15.0 2024-06-21 10:04:45,719 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.41 vs. limit=12.0 2024-06-21 10:04:51,324 INFO [train.py:1028] (1/2) Epoch 20, batch 7700, loss[loss=0.2066, simple_loss=0.2681, pruned_loss=0.07253, over 13271.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2741, pruned_loss=0.0808, over 2570326.52 frames. ], batch size: 63, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:04:58,011 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.51 vs. limit=15.0 2024-06-21 10:04:59,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=366547.5, ans=0.125 2024-06-21 10:05:00,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=366547.5, ans=0.07 2024-06-21 10:05:04,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=366565.8333333333, ans=0.125 2024-06-21 10:05:04,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=366565.8333333333, ans=0.0 2024-06-21 10:05:05,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.28 vs. limit=15.0 2024-06-21 10:05:09,155 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.174e+02 2.359e+02 2.625e+02 3.754e+02, threshold=4.718e+02, percent-clipped=0.0 2024-06-21 10:05:09,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=366565.8333333333, ans=0.125 2024-06-21 10:05:25,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=366602.5, ans=0.2 2024-06-21 10:05:25,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=366602.5, ans=10.0 2024-06-21 10:05:26,867 INFO [train.py:1028] (1/2) Epoch 20, batch 7750, loss[loss=0.2007, simple_loss=0.2621, pruned_loss=0.06967, over 13248.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2745, pruned_loss=0.08137, over 2574362.54 frames. ], batch size: 72, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:05:33,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=366639.1666666667, ans=0.125 2024-06-21 10:05:40,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=366657.5, ans=0.1 2024-06-21 10:05:51,002 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:05:53,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=366675.8333333333, ans=0.1 2024-06-21 10:06:02,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=366694.1666666667, ans=0.125 2024-06-21 10:06:05,672 INFO [train.py:1028] (1/2) Epoch 20, batch 7800, loss[loss=0.2393, simple_loss=0.2918, pruned_loss=0.09336, over 13192.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2756, pruned_loss=0.08163, over 2579375.59 frames. ], batch size: 95, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:06:20,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff2.min_abs, batch_count=366749.1666666667, ans=0.1 2024-06-21 10:06:22,522 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2024-06-21 10:06:24,962 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.170e+02 2.348e+02 2.597e+02 3.485e+02, threshold=4.696e+02, percent-clipped=0.0 2024-06-21 10:06:29,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=366767.5, ans=15.0 2024-06-21 10:06:39,356 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2024-06-21 10:06:41,959 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.85 vs. limit=22.5 2024-06-21 10:06:42,971 INFO [train.py:1028] (1/2) Epoch 20, batch 7850, loss[loss=0.1837, simple_loss=0.2414, pruned_loss=0.06298, over 11452.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2761, pruned_loss=0.08179, over 2572964.36 frames. ], batch size: 17, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:06:45,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=366804.1666666667, ans=0.125 2024-06-21 10:06:46,907 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.61 vs. limit=15.0 2024-06-21 10:06:48,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=366804.1666666667, ans=0.0 2024-06-21 10:06:48,980 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.15 vs. limit=15.0 2024-06-21 10:06:49,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=366822.5, ans=0.125 2024-06-21 10:06:55,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=366822.5, ans=0.0 2024-06-21 10:06:55,450 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.61 vs. limit=15.0 2024-06-21 10:07:01,841 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.41 vs. limit=15.0 2024-06-21 10:07:03,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=366859.1666666667, ans=0.07 2024-06-21 10:07:06,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=366859.1666666667, ans=0.125 2024-06-21 10:07:09,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=366877.5, ans=0.0 2024-06-21 10:07:19,210 INFO [train.py:1028] (1/2) Epoch 20, batch 7900, loss[loss=0.2096, simple_loss=0.2685, pruned_loss=0.0754, over 13187.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2761, pruned_loss=0.08185, over 2572717.49 frames. ], batch size: 77, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:07:33,173 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.44 vs. limit=6.0 2024-06-21 10:07:37,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=366932.5, ans=0.125 2024-06-21 10:07:38,048 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.166e+02 2.375e+02 2.582e+02 3.560e+02, threshold=4.751e+02, percent-clipped=0.0 2024-06-21 10:07:47,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=366969.1666666667, ans=0.2 2024-06-21 10:07:52,817 INFO [train.py:1028] (1/2) Epoch 20, batch 7950, loss[loss=0.2368, simple_loss=0.2738, pruned_loss=0.0999, over 10706.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2764, pruned_loss=0.08191, over 2575580.76 frames. ], batch size: 303, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:07:56,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=366987.5, ans=0.125 2024-06-21 10:08:05,658 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.08 vs. limit=15.0 2024-06-21 10:08:07,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=367024.1666666667, ans=0.125 2024-06-21 10:08:18,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=367042.5, ans=0.125 2024-06-21 10:08:18,834 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.02 vs. limit=22.5 2024-06-21 10:08:24,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=367060.8333333333, ans=0.0 2024-06-21 10:08:25,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=367079.1666666667, ans=0.2 2024-06-21 10:08:26,371 INFO [train.py:1028] (1/2) Epoch 20, batch 8000, loss[loss=0.1964, simple_loss=0.2596, pruned_loss=0.06655, over 12602.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2764, pruned_loss=0.08183, over 2572755.12 frames. ], batch size: 29, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:08:48,011 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.749e+02 2.192e+02 2.342e+02 2.556e+02 3.161e+02, threshold=4.685e+02, percent-clipped=0.0 2024-06-21 10:08:48,411 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.25 vs. limit=10.0 2024-06-21 10:08:49,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=367134.1666666667, ans=0.125 2024-06-21 10:08:55,957 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.97 vs. limit=10.0 2024-06-21 10:08:56,541 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.99 vs. limit=22.5 2024-06-21 10:09:02,881 INFO [train.py:1028] (1/2) Epoch 20, batch 8050, loss[loss=0.212, simple_loss=0.2652, pruned_loss=0.07938, over 13226.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2758, pruned_loss=0.08155, over 2573026.06 frames. ], batch size: 83, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:09:09,139 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.93 vs. limit=12.0 2024-06-21 10:09:29,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=367225.8333333333, ans=0.0 2024-06-21 10:09:31,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=367225.8333333333, ans=0.015 2024-06-21 10:09:33,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=367225.8333333333, ans=0.125 2024-06-21 10:09:34,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=367244.1666666667, ans=0.125 2024-06-21 10:09:36,142 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.439e-01 2024-06-21 10:09:40,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=367262.5, ans=0.2 2024-06-21 10:09:40,886 INFO [train.py:1028] (1/2) Epoch 20, batch 8100, loss[loss=0.2242, simple_loss=0.2767, pruned_loss=0.08587, over 13174.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2762, pruned_loss=0.08174, over 2576935.23 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:09:51,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=367280.8333333333, ans=0.0 2024-06-21 10:09:52,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=367280.8333333333, ans=0.0 2024-06-21 10:09:59,139 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.165e+02 2.309e+02 2.498e+02 3.259e+02, threshold=4.617e+02, percent-clipped=0.0 2024-06-21 10:10:13,616 INFO [train.py:1028] (1/2) Epoch 20, batch 8150, loss[loss=0.2028, simple_loss=0.261, pruned_loss=0.07231, over 13105.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2765, pruned_loss=0.08126, over 2579087.78 frames. ], batch size: 121, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:10:17,166 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2024-06-21 10:10:21,760 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.21 vs. limit=22.5 2024-06-21 10:10:25,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=367390.8333333333, ans=0.0 2024-06-21 10:10:28,767 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.39 vs. limit=22.5 2024-06-21 10:10:49,096 INFO [train.py:1028] (1/2) Epoch 20, batch 8200, loss[loss=0.2211, simple_loss=0.2864, pruned_loss=0.0779, over 13157.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2763, pruned_loss=0.08121, over 2582934.86 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:10:54,026 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.77 vs. limit=22.5 2024-06-21 10:11:00,376 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=15.0 2024-06-21 10:11:07,730 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.150e+02 2.286e+02 2.575e+02 3.362e+02, threshold=4.572e+02, percent-clipped=0.0 2024-06-21 10:11:15,289 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2024-06-21 10:11:25,420 INFO [train.py:1028] (1/2) Epoch 20, batch 8250, loss[loss=0.2345, simple_loss=0.2967, pruned_loss=0.08618, over 13213.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2771, pruned_loss=0.08146, over 2584044.38 frames. ], batch size: 52, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:11:25,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=367537.5, ans=0.0 2024-06-21 10:11:35,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=367555.8333333333, ans=0.125 2024-06-21 10:11:37,540 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.75 vs. limit=6.0 2024-06-21 10:11:39,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=367574.1666666667, ans=0.2 2024-06-21 10:11:52,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=367592.5, ans=0.125 2024-06-21 10:11:53,949 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.78 vs. limit=15.0 2024-06-21 10:11:58,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=367610.8333333333, ans=0.0 2024-06-21 10:11:58,394 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2024-06-21 10:12:01,590 INFO [train.py:1028] (1/2) Epoch 20, batch 8300, loss[loss=0.2356, simple_loss=0.287, pruned_loss=0.09204, over 13151.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.277, pruned_loss=0.08121, over 2581757.52 frames. ], batch size: 103, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:12:04,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=367629.1666666667, ans=0.025 2024-06-21 10:12:18,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=367665.8333333333, ans=0.125 2024-06-21 10:12:23,889 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.208e+02 2.306e+02 2.518e+02 3.260e+02, threshold=4.612e+02, percent-clipped=0.0 2024-06-21 10:12:31,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=367684.1666666667, ans=0.0 2024-06-21 10:12:40,476 INFO [train.py:1028] (1/2) Epoch 20, batch 8350, loss[loss=0.2321, simple_loss=0.2907, pruned_loss=0.08678, over 13148.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2772, pruned_loss=0.08121, over 2582178.95 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:12:59,562 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.06 vs. limit=22.5 2024-06-21 10:13:07,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=367775.8333333333, ans=0.2 2024-06-21 10:13:07,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=367775.8333333333, ans=0.0 2024-06-21 10:13:11,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=367775.8333333333, ans=0.2 2024-06-21 10:13:23,251 INFO [train.py:1028] (1/2) Epoch 20, batch 8400, loss[loss=0.2139, simple_loss=0.2754, pruned_loss=0.07623, over 12991.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2773, pruned_loss=0.08145, over 2577257.36 frames. ], batch size: 39, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:13:27,244 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.68 vs. limit=15.0 2024-06-21 10:13:28,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=367812.5, ans=0.125 2024-06-21 10:13:31,599 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.44 vs. limit=22.5 2024-06-21 10:13:39,163 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.10 vs. limit=22.5 2024-06-21 10:13:46,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=367849.1666666667, ans=0.125 2024-06-21 10:13:50,110 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.893e+02 2.216e+02 2.349e+02 2.557e+02 3.174e+02, threshold=4.697e+02, percent-clipped=0.0 2024-06-21 10:13:51,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=367867.5, ans=0.0 2024-06-21 10:13:57,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=367867.5, ans=22.5 2024-06-21 10:14:05,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=367885.8333333333, ans=0.125 2024-06-21 10:14:06,531 INFO [train.py:1028] (1/2) Epoch 20, batch 8450, loss[loss=0.2251, simple_loss=0.2747, pruned_loss=0.08771, over 13146.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2779, pruned_loss=0.08169, over 2579928.14 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:14:21,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=367922.5, ans=0.2 2024-06-21 10:14:26,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=367940.8333333333, ans=0.125 2024-06-21 10:14:27,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=367940.8333333333, ans=0.2 2024-06-21 10:14:37,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=367977.5, ans=0.125 2024-06-21 10:14:37,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=367977.5, ans=0.2 2024-06-21 10:14:38,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=367977.5, ans=0.0 2024-06-21 10:14:39,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=367977.5, ans=6.0 2024-06-21 10:14:45,936 INFO [train.py:1028] (1/2) Epoch 20, batch 8500, loss[loss=0.2048, simple_loss=0.2627, pruned_loss=0.07347, over 12569.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2788, pruned_loss=0.08216, over 2578117.26 frames. ], batch size: 29, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:15:08,450 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.26 vs. limit=15.0 2024-06-21 10:15:08,562 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.751e+02 2.229e+02 2.358e+02 2.557e+02 3.464e+02, threshold=4.717e+02, percent-clipped=0.0 2024-06-21 10:15:21,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=368069.1666666667, ans=0.125 2024-06-21 10:15:29,378 INFO [train.py:1028] (1/2) Epoch 20, batch 8550, loss[loss=0.2352, simple_loss=0.2955, pruned_loss=0.08744, over 12647.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2785, pruned_loss=0.08198, over 2576666.82 frames. ], batch size: 22, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:15:36,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=368105.8333333333, ans=0.0 2024-06-21 10:15:55,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=368142.5, ans=0.1 2024-06-21 10:16:04,971 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=15.0 2024-06-21 10:16:05,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=368160.8333333333, ans=0.1 2024-06-21 10:16:09,545 INFO [train.py:1028] (1/2) Epoch 20, batch 8600, loss[loss=0.1999, simple_loss=0.2574, pruned_loss=0.07121, over 13117.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.28, pruned_loss=0.08247, over 2574876.91 frames. ], batch size: 121, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:16:19,701 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.52 vs. limit=10.0 2024-06-21 10:16:37,754 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.206e+02 2.450e+02 2.702e+02 3.750e+02, threshold=4.899e+02, percent-clipped=0.0 2024-06-21 10:16:55,000 INFO [train.py:1028] (1/2) Epoch 20, batch 8650, loss[loss=0.2266, simple_loss=0.2834, pruned_loss=0.08491, over 13040.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2797, pruned_loss=0.08204, over 2578022.54 frames. ], batch size: 102, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:16:59,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=368270.8333333333, ans=0.0 2024-06-21 10:17:17,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=368307.5, ans=0.025 2024-06-21 10:17:20,072 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.90 vs. limit=15.0 2024-06-21 10:17:32,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=368344.1666666667, ans=0.1 2024-06-21 10:17:35,239 INFO [train.py:1028] (1/2) Epoch 20, batch 8700, loss[loss=0.2285, simple_loss=0.2919, pruned_loss=0.08258, over 13187.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2806, pruned_loss=0.08289, over 2574015.67 frames. ], batch size: 59, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:18:02,650 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+02 2.123e+02 2.265e+02 2.534e+02 3.402e+02, threshold=4.530e+02, percent-clipped=0.0 2024-06-21 10:18:07,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=368417.5, ans=0.025 2024-06-21 10:18:19,914 INFO [train.py:1028] (1/2) Epoch 20, batch 8750, loss[loss=0.2223, simple_loss=0.279, pruned_loss=0.0828, over 13123.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2802, pruned_loss=0.08275, over 2569227.67 frames. ], batch size: 121, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:18:24,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=368454.1666666667, ans=0.125 2024-06-21 10:18:33,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=368472.5, ans=0.0 2024-06-21 10:18:34,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=368472.5, ans=0.0 2024-06-21 10:18:39,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=368490.8333333333, ans=0.025 2024-06-21 10:18:45,654 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.27 vs. limit=8.0 2024-06-21 10:18:48,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=368509.1666666667, ans=0.125 2024-06-21 10:18:53,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=368509.1666666667, ans=0.125 2024-06-21 10:19:03,788 INFO [train.py:1028] (1/2) Epoch 20, batch 8800, loss[loss=0.2216, simple_loss=0.2852, pruned_loss=0.07903, over 13286.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2803, pruned_loss=0.08267, over 2573619.51 frames. ], batch size: 72, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:19:07,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=368545.8333333333, ans=0.09899494936611666 2024-06-21 10:19:10,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=368545.8333333333, ans=0.125 2024-06-21 10:19:17,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=368564.1666666667, ans=0.125 2024-06-21 10:19:27,039 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 2.176e+02 2.338e+02 2.533e+02 3.840e+02, threshold=4.677e+02, percent-clipped=0.0 2024-06-21 10:19:32,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=368600.8333333333, ans=0.0 2024-06-21 10:19:35,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=368619.1666666667, ans=0.025 2024-06-21 10:19:38,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=368619.1666666667, ans=0.1 2024-06-21 10:19:44,434 INFO [train.py:1028] (1/2) Epoch 20, batch 8850, loss[loss=0.248, simple_loss=0.3046, pruned_loss=0.09569, over 12622.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2807, pruned_loss=0.08336, over 2562363.13 frames. ], batch size: 202, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:19:57,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=368655.8333333333, ans=0.015 2024-06-21 10:20:00,531 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2024-06-21 10:20:03,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=368674.1666666667, ans=0.125 2024-06-21 10:20:09,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=368674.1666666667, ans=0.025 2024-06-21 10:20:16,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=368692.5, ans=0.125 2024-06-21 10:20:17,606 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.82 vs. limit=10.0 2024-06-21 10:20:23,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=368710.8333333333, ans=0.2 2024-06-21 10:20:27,130 INFO [train.py:1028] (1/2) Epoch 20, batch 8900, loss[loss=0.2494, simple_loss=0.3083, pruned_loss=0.09529, over 12897.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2816, pruned_loss=0.08353, over 2559726.55 frames. ], batch size: 33, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:20:30,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=368729.1666666667, ans=0.0 2024-06-21 10:20:34,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=368747.5, ans=0.125 2024-06-21 10:20:40,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=368747.5, ans=0.0 2024-06-21 10:20:41,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=368765.8333333333, ans=0.0 2024-06-21 10:20:49,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=368765.8333333333, ans=0.1 2024-06-21 10:20:49,709 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.197e+02 2.369e+02 2.565e+02 3.635e+02, threshold=4.737e+02, percent-clipped=0.0 2024-06-21 10:21:07,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=368802.5, ans=0.125 2024-06-21 10:21:10,903 INFO [train.py:1028] (1/2) Epoch 20, batch 8950, loss[loss=0.2263, simple_loss=0.282, pruned_loss=0.08533, over 12577.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2817, pruned_loss=0.08284, over 2560484.89 frames. ], batch size: 202, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:21:12,156 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.56 vs. limit=15.0 2024-06-21 10:21:18,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=368839.1666666667, ans=0.125 2024-06-21 10:21:26,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=368839.1666666667, ans=0.125 2024-06-21 10:21:32,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=368857.5, ans=0.125 2024-06-21 10:21:32,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=368857.5, ans=0.5 2024-06-21 10:21:33,188 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.03 vs. limit=6.0 2024-06-21 10:21:36,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=368875.8333333333, ans=0.125 2024-06-21 10:21:36,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=368875.8333333333, ans=0.0 2024-06-21 10:21:52,466 INFO [train.py:1028] (1/2) Epoch 20, batch 9000, loss[loss=0.2203, simple_loss=0.2783, pruned_loss=0.08116, over 13360.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2818, pruned_loss=0.0826, over 2566838.29 frames. ], batch size: 46, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:21:52,467 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 10:22:01,281 INFO [train.py:1060] (1/2) Epoch 20, validation: loss=0.1881, simple_loss=0.2521, pruned_loss=0.06207, over 351949.00 frames. 2024-06-21 10:22:01,282 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 10:22:04,218 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.94 vs. limit=15.0 2024-06-21 10:22:28,265 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.175e+02 2.301e+02 2.539e+02 3.214e+02, threshold=4.602e+02, percent-clipped=0.0 2024-06-21 10:22:29,457 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=15.44 vs. limit=15.0 2024-06-21 10:22:35,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=368985.8333333333, ans=0.125 2024-06-21 10:22:43,220 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.54 vs. limit=15.0 2024-06-21 10:22:44,380 INFO [train.py:1028] (1/2) Epoch 20, batch 9050, loss[loss=0.2188, simple_loss=0.2821, pruned_loss=0.07772, over 12029.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2826, pruned_loss=0.08305, over 2567768.58 frames. ], batch size: 18, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:22:44,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=369004.1666666667, ans=0.0 2024-06-21 10:22:44,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=369004.1666666667, ans=0.0 2024-06-21 10:22:49,447 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.88 vs. limit=15.0 2024-06-21 10:22:51,844 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.70 vs. limit=15.0 2024-06-21 10:23:00,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=369040.8333333333, ans=0.05 2024-06-21 10:23:04,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=369040.8333333333, ans=0.125 2024-06-21 10:23:09,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=369059.1666666667, ans=0.95 2024-06-21 10:23:11,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=369059.1666666667, ans=0.125 2024-06-21 10:23:11,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=369059.1666666667, ans=0.125 2024-06-21 10:23:23,767 INFO [train.py:1028] (1/2) Epoch 20, batch 9100, loss[loss=0.2026, simple_loss=0.269, pruned_loss=0.06808, over 13251.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.282, pruned_loss=0.0825, over 2569230.28 frames. ], batch size: 72, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:23:26,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=369095.8333333333, ans=0.125 2024-06-21 10:23:46,560 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.173e+02 2.328e+02 2.501e+02 3.981e+02, threshold=4.655e+02, percent-clipped=0.0 2024-06-21 10:24:01,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=369187.5, ans=0.1 2024-06-21 10:24:02,400 INFO [train.py:1028] (1/2) Epoch 20, batch 9150, loss[loss=0.2121, simple_loss=0.2732, pruned_loss=0.07552, over 13167.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2816, pruned_loss=0.0825, over 2569644.99 frames. ], batch size: 77, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:24:45,072 INFO [train.py:1028] (1/2) Epoch 20, batch 9200, loss[loss=0.209, simple_loss=0.2754, pruned_loss=0.07129, over 12976.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2814, pruned_loss=0.08225, over 2572653.70 frames. ], batch size: 36, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:24:48,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=369279.1666666667, ans=0.125 2024-06-21 10:24:56,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=369297.5, ans=0.125 2024-06-21 10:25:07,139 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.169e+02 2.306e+02 2.454e+02 3.336e+02, threshold=4.613e+02, percent-clipped=0.0 2024-06-21 10:25:15,256 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.07 vs. limit=22.5 2024-06-21 10:25:15,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=369352.5, ans=0.125 2024-06-21 10:25:20,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=369352.5, ans=0.125 2024-06-21 10:25:22,826 INFO [train.py:1028] (1/2) Epoch 20, batch 9250, loss[loss=0.2211, simple_loss=0.2799, pruned_loss=0.08114, over 13247.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2814, pruned_loss=0.08203, over 2574797.26 frames. ], batch size: 67, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:25:28,906 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.75 vs. limit=6.0 2024-06-21 10:25:44,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=369407.5, ans=0.2 2024-06-21 10:25:48,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=369425.8333333333, ans=0.125 2024-06-21 10:26:00,952 INFO [train.py:1028] (1/2) Epoch 20, batch 9300, loss[loss=0.2239, simple_loss=0.2841, pruned_loss=0.08185, over 12868.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2813, pruned_loss=0.08201, over 2572502.91 frames. ], batch size: 39, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:26:04,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=369462.5, ans=0.025 2024-06-21 10:26:10,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=369480.8333333333, ans=0.0 2024-06-21 10:26:18,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=369499.1666666667, ans=0.1 2024-06-21 10:26:19,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=369499.1666666667, ans=0.0 2024-06-21 10:26:22,982 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.844e+02 2.157e+02 2.326e+02 2.504e+02 4.032e+02, threshold=4.651e+02, percent-clipped=0.0 2024-06-21 10:26:26,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=369517.5, ans=0.125 2024-06-21 10:26:30,794 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.71 vs. limit=10.0 2024-06-21 10:26:38,414 INFO [train.py:1028] (1/2) Epoch 20, batch 9350, loss[loss=0.1897, simple_loss=0.2512, pruned_loss=0.06414, over 12467.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2823, pruned_loss=0.08273, over 2568504.68 frames. ], batch size: 22, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:26:39,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=369554.1666666667, ans=0.0 2024-06-21 10:26:43,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=369554.1666666667, ans=0.1 2024-06-21 10:26:53,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=369590.8333333333, ans=0.2 2024-06-21 10:26:56,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=369590.8333333333, ans=0.125 2024-06-21 10:26:59,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=369590.8333333333, ans=0.0 2024-06-21 10:27:02,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=369609.1666666667, ans=0.125 2024-06-21 10:27:07,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=369627.5, ans=0.125 2024-06-21 10:27:18,727 INFO [train.py:1028] (1/2) Epoch 20, batch 9400, loss[loss=0.2365, simple_loss=0.2971, pruned_loss=0.08793, over 13290.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2827, pruned_loss=0.08296, over 2568001.77 frames. ], batch size: 52, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:27:29,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=369664.1666666667, ans=0.015 2024-06-21 10:27:40,365 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.216e+02 2.364e+02 2.531e+02 3.150e+02, threshold=4.728e+02, percent-clipped=0.0 2024-06-21 10:27:50,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=369719.1666666667, ans=0.125 2024-06-21 10:27:55,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=369737.5, ans=10.0 2024-06-21 10:27:55,938 INFO [train.py:1028] (1/2) Epoch 20, batch 9450, loss[loss=0.2211, simple_loss=0.2758, pruned_loss=0.08323, over 12957.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2832, pruned_loss=0.08327, over 2569684.45 frames. ], batch size: 22, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:27:59,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=369737.5, ans=0.0 2024-06-21 10:28:01,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=369737.5, ans=0.125 2024-06-21 10:28:02,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=369737.5, ans=0.125 2024-06-21 10:28:05,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=369755.8333333333, ans=0.125 2024-06-21 10:28:06,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=369755.8333333333, ans=0.125 2024-06-21 10:28:13,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=369774.1666666667, ans=0.025 2024-06-21 10:28:26,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=369810.8333333333, ans=0.0 2024-06-21 10:28:28,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=369810.8333333333, ans=0.0 2024-06-21 10:28:32,198 INFO [train.py:1028] (1/2) Epoch 20, batch 9500, loss[loss=0.2184, simple_loss=0.2784, pruned_loss=0.07919, over 13265.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2823, pruned_loss=0.08276, over 2578015.25 frames. ], batch size: 43, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:28:51,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=369865.8333333333, ans=0.125 2024-06-21 10:28:55,584 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.209e+02 2.389e+02 2.587e+02 3.204e+02, threshold=4.778e+02, percent-clipped=0.0 2024-06-21 10:29:10,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=369920.8333333333, ans=0.0 2024-06-21 10:29:10,806 INFO [train.py:1028] (1/2) Epoch 20, batch 9550, loss[loss=0.2018, simple_loss=0.2562, pruned_loss=0.0737, over 12932.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2813, pruned_loss=0.08236, over 2572700.98 frames. ], batch size: 39, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:29:20,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=369939.1666666667, ans=0.125 2024-06-21 10:29:21,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=369939.1666666667, ans=0.0 2024-06-21 10:29:31,842 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2024-06-21 10:29:36,853 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.47 vs. limit=15.0 2024-06-21 10:29:37,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=369975.8333333333, ans=0.0 2024-06-21 10:29:42,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=369994.1666666667, ans=0.125 2024-06-21 10:29:42,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=369994.1666666667, ans=0.1 2024-06-21 10:29:47,831 INFO [train.py:1028] (1/2) Epoch 20, batch 9600, loss[loss=0.2494, simple_loss=0.291, pruned_loss=0.1039, over 10829.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2815, pruned_loss=0.08271, over 2570128.89 frames. ], batch size: 303, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:29:48,917 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.43 vs. limit=22.5 2024-06-21 10:29:50,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=370012.5, ans=0.125 2024-06-21 10:29:54,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=370012.5, ans=0.0 2024-06-21 10:30:05,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=370049.1666666667, ans=0.2 2024-06-21 10:30:06,817 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.95 vs. limit=15.0 2024-06-21 10:30:11,730 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.269e+02 2.410e+02 2.615e+02 3.633e+02, threshold=4.820e+02, percent-clipped=0.0 2024-06-21 10:30:27,365 INFO [train.py:1028] (1/2) Epoch 20, batch 9650, loss[loss=0.2125, simple_loss=0.2659, pruned_loss=0.07957, over 13103.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2811, pruned_loss=0.08285, over 2559733.37 frames. ], batch size: 132, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:30:39,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=370122.5, ans=0.0 2024-06-21 10:30:45,051 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.85 vs. limit=22.5 2024-06-21 10:30:46,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=370140.8333333333, ans=0.125 2024-06-21 10:30:57,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=370177.5, ans=0.0 2024-06-21 10:31:04,558 INFO [train.py:1028] (1/2) Epoch 20, batch 9700, loss[loss=0.2426, simple_loss=0.2907, pruned_loss=0.09728, over 13025.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.281, pruned_loss=0.08317, over 2554756.89 frames. ], batch size: 144, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:31:06,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=370195.8333333333, ans=0.125 2024-06-21 10:31:09,889 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.39 vs. limit=22.5 2024-06-21 10:31:12,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=370214.1666666667, ans=0.125 2024-06-21 10:31:27,545 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.204e+02 2.336e+02 2.569e+02 4.894e+02, threshold=4.672e+02, percent-clipped=1.0 2024-06-21 10:31:31,526 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.05 vs. limit=15.0 2024-06-21 10:31:38,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=370269.1666666667, ans=0.1 2024-06-21 10:31:39,638 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.89 vs. limit=15.0 2024-06-21 10:31:42,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.20 vs. limit=6.0 2024-06-21 10:31:42,618 INFO [train.py:1028] (1/2) Epoch 20, batch 9750, loss[loss=0.2097, simple_loss=0.2644, pruned_loss=0.0775, over 13099.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2796, pruned_loss=0.08249, over 2550725.70 frames. ], batch size: 132, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:31:54,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=370305.8333333333, ans=0.025 2024-06-21 10:32:01,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=370324.1666666667, ans=0.1 2024-06-21 10:32:03,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=370324.1666666667, ans=0.0 2024-06-21 10:32:05,531 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.04 vs. limit=22.5 2024-06-21 10:32:09,701 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.52 vs. limit=5.0 2024-06-21 10:32:19,892 INFO [train.py:1028] (1/2) Epoch 20, batch 9800, loss[loss=0.1927, simple_loss=0.2525, pruned_loss=0.06644, over 12981.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2791, pruned_loss=0.08195, over 2545021.84 frames. ], batch size: 39, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:32:40,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=370415.8333333333, ans=0.025 2024-06-21 10:32:42,452 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.137e+02 2.320e+02 2.495e+02 3.768e+02, threshold=4.640e+02, percent-clipped=0.0 2024-06-21 10:32:44,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=370434.1666666667, ans=0.125 2024-06-21 10:32:49,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=370434.1666666667, ans=0.025 2024-06-21 10:32:54,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=370452.5, ans=0.0 2024-06-21 10:32:58,011 INFO [train.py:1028] (1/2) Epoch 20, batch 9850, loss[loss=0.2265, simple_loss=0.2826, pruned_loss=0.08523, over 13003.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.279, pruned_loss=0.08188, over 2538229.40 frames. ], batch size: 102, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:33:11,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=370489.1666666667, ans=0.125 2024-06-21 10:33:16,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=370507.5, ans=0.125 2024-06-21 10:33:34,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=370544.1666666667, ans=0.0 2024-06-21 10:33:36,074 INFO [train.py:1028] (1/2) Epoch 20, batch 9900, loss[loss=0.2132, simple_loss=0.2662, pruned_loss=0.08014, over 12997.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2774, pruned_loss=0.08148, over 2531091.69 frames. ], batch size: 39, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:33:36,618 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.85 vs. limit=10.0 2024-06-21 10:33:57,085 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.195e+02 2.313e+02 2.531e+02 3.404e+02, threshold=4.625e+02, percent-clipped=0.0 2024-06-21 10:34:01,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=370617.5, ans=0.0 2024-06-21 10:34:14,564 INFO [train.py:1028] (1/2) Epoch 20, batch 9950, loss[loss=0.2148, simple_loss=0.2775, pruned_loss=0.07608, over 12643.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2766, pruned_loss=0.08157, over 2523344.04 frames. ], batch size: 29, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:34:24,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=370672.5, ans=0.0 2024-06-21 10:34:29,722 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=370690.8333333333, ans=0.125 2024-06-21 10:34:32,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=370690.8333333333, ans=0.0 2024-06-21 10:34:50,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=370727.5, ans=0.0 2024-06-21 10:34:51,740 INFO [train.py:1028] (1/2) Epoch 20, batch 10000, loss[loss=0.2134, simple_loss=0.2741, pruned_loss=0.07633, over 12534.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2767, pruned_loss=0.08183, over 2483719.43 frames. ], batch size: 22, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:34:58,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=370745.8333333333, ans=0.1 2024-06-21 10:34:58,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=370745.8333333333, ans=0.125 2024-06-21 10:35:01,686 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.54 vs. limit=22.5 2024-06-21 10:35:14,328 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.167e+02 2.318e+02 2.481e+02 3.372e+02, threshold=4.637e+02, percent-clipped=0.0 2024-06-21 10:35:28,104 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.96 vs. limit=15.0 2024-06-21 10:35:29,760 INFO [train.py:1028] (1/2) Epoch 20, batch 10050, loss[loss=0.2218, simple_loss=0.2813, pruned_loss=0.08111, over 12556.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2761, pruned_loss=0.08195, over 2442963.72 frames. ], batch size: 22, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:35:34,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=370837.5, ans=0.125 2024-06-21 10:35:37,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=370855.8333333333, ans=0.125 2024-06-21 10:35:42,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=370855.8333333333, ans=0.125 2024-06-21 10:36:06,644 INFO [train.py:1028] (1/2) Epoch 20, batch 10100, loss[loss=0.1745, simple_loss=0.2408, pruned_loss=0.05404, over 12027.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2749, pruned_loss=0.0812, over 2426510.47 frames. ], batch size: 17, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:38:39,315 INFO [train.py:1028] (1/2) Epoch 21, batch 0, loss[loss=0.193, simple_loss=0.2491, pruned_loss=0.06843, over 12886.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2491, pruned_loss=0.06843, over 12886.00 frames. ], batch size: 36, lr: 2.80e-03, grad_scale: 32.0 2024-06-21 10:38:39,317 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 10:38:47,539 INFO [train.py:1060] (1/2) Epoch 21, validation: loss=0.1888, simple_loss=0.2532, pruned_loss=0.06218, over 351949.00 frames. 2024-06-21 10:38:47,540 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 10:38:59,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=370978.6666666667, ans=0.0 2024-06-21 10:39:00,499 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.102e+02 2.260e+02 2.518e+02 3.443e+02, threshold=4.520e+02, percent-clipped=0.0 2024-06-21 10:39:15,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=371015.3333333333, ans=0.07 2024-06-21 10:39:18,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=371015.3333333333, ans=0.2 2024-06-21 10:39:24,095 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.00 vs. limit=15.0 2024-06-21 10:39:25,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371033.6666666667, ans=0.1 2024-06-21 10:39:30,388 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2024-06-21 10:39:30,783 INFO [train.py:1028] (1/2) Epoch 21, batch 50, loss[loss=0.2145, simple_loss=0.2683, pruned_loss=0.08032, over 12758.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2601, pruned_loss=0.07658, over 574231.19 frames. ], batch size: 29, lr: 2.80e-03, grad_scale: 32.0 2024-06-21 10:39:31,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=371052.0, ans=0.2 2024-06-21 10:39:34,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=371052.0, ans=0.0 2024-06-21 10:39:45,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=371088.6666666667, ans=0.125 2024-06-21 10:39:52,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=371088.6666666667, ans=0.125 2024-06-21 10:40:02,153 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.54 vs. limit=22.5 2024-06-21 10:40:05,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.99 vs. limit=15.0 2024-06-21 10:40:08,408 INFO [train.py:1028] (1/2) Epoch 21, batch 100, loss[loss=0.2027, simple_loss=0.2659, pruned_loss=0.06974, over 13288.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2575, pruned_loss=0.07524, over 1017090.15 frames. ], batch size: 46, lr: 2.80e-03, grad_scale: 32.0 2024-06-21 10:40:17,103 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.045e+02 2.155e+02 2.378e+02 3.028e+02, threshold=4.310e+02, percent-clipped=0.0 2024-06-21 10:40:29,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=371180.3333333333, ans=0.2 2024-06-21 10:40:33,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=371198.6666666667, ans=0.0 2024-06-21 10:40:34,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=371198.6666666667, ans=0.2 2024-06-21 10:40:35,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=371198.6666666667, ans=0.1 2024-06-21 10:40:37,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=371198.6666666667, ans=0.07 2024-06-21 10:40:46,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371217.0, ans=0.1 2024-06-21 10:40:49,849 INFO [train.py:1028] (1/2) Epoch 21, batch 150, loss[loss=0.2142, simple_loss=0.2762, pruned_loss=0.07608, over 12741.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2563, pruned_loss=0.0734, over 1365265.76 frames. ], batch size: 29, lr: 2.80e-03, grad_scale: 32.0 2024-06-21 10:40:53,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.82 vs. limit=15.0 2024-06-21 10:41:01,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=371253.6666666667, ans=0.0 2024-06-21 10:41:15,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=371290.3333333333, ans=0.0 2024-06-21 10:41:18,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=371290.3333333333, ans=0.09899494936611666 2024-06-21 10:41:19,345 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2024-06-21 10:41:21,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=371290.3333333333, ans=0.95 2024-06-21 10:41:26,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=371308.6666666667, ans=0.2 2024-06-21 10:41:28,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=371308.6666666667, ans=0.0 2024-06-21 10:41:31,098 INFO [train.py:1028] (1/2) Epoch 21, batch 200, loss[loss=0.2226, simple_loss=0.2668, pruned_loss=0.08913, over 12501.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2576, pruned_loss=0.07392, over 1634216.34 frames. ], batch size: 202, lr: 2.80e-03, grad_scale: 32.0 2024-06-21 10:41:40,007 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.036e+02 2.192e+02 2.419e+02 3.710e+02, threshold=4.385e+02, percent-clipped=0.0 2024-06-21 10:41:42,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=371345.3333333333, ans=0.025 2024-06-21 10:41:47,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371363.6666666667, ans=0.1 2024-06-21 10:41:56,426 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.42 vs. limit=22.5 2024-06-21 10:42:01,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=371400.3333333333, ans=0.125 2024-06-21 10:42:07,484 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:42:08,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371418.6666666667, ans=0.1 2024-06-21 10:42:08,765 INFO [train.py:1028] (1/2) Epoch 21, batch 250, loss[loss=0.1997, simple_loss=0.2461, pruned_loss=0.07663, over 13039.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2578, pruned_loss=0.07395, over 1846265.37 frames. ], batch size: 144, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:42:20,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=371437.0, ans=0.1 2024-06-21 10:42:30,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=371455.3333333333, ans=0.125 2024-06-21 10:42:41,992 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.72 vs. limit=6.0 2024-06-21 10:42:43,332 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.06 vs. limit=15.0 2024-06-21 10:42:47,561 INFO [train.py:1028] (1/2) Epoch 21, batch 300, loss[loss=0.1907, simple_loss=0.2372, pruned_loss=0.07213, over 13162.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2585, pruned_loss=0.0743, over 2008810.34 frames. ], batch size: 112, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:42:59,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=371528.6666666667, ans=0.0 2024-06-21 10:43:00,418 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.089e+02 2.218e+02 2.407e+02 3.250e+02, threshold=4.437e+02, percent-clipped=0.0 2024-06-21 10:43:02,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=371528.6666666667, ans=0.0 2024-06-21 10:43:05,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=371528.6666666667, ans=0.125 2024-06-21 10:43:08,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=371547.0, ans=0.125 2024-06-21 10:43:09,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=371547.0, ans=0.015 2024-06-21 10:43:17,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=371565.3333333333, ans=0.125 2024-06-21 10:43:18,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=371565.3333333333, ans=0.1 2024-06-21 10:43:24,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=371583.6666666667, ans=0.1 2024-06-21 10:43:32,250 INFO [train.py:1028] (1/2) Epoch 21, batch 350, loss[loss=0.2033, simple_loss=0.2641, pruned_loss=0.07122, over 12844.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2583, pruned_loss=0.07407, over 2138154.52 frames. ], batch size: 33, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:43:56,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=371657.0, ans=0.2 2024-06-21 10:44:05,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=371675.3333333333, ans=0.125 2024-06-21 10:44:11,465 INFO [train.py:1028] (1/2) Epoch 21, batch 400, loss[loss=0.1835, simple_loss=0.24, pruned_loss=0.06346, over 13247.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2586, pruned_loss=0.07387, over 2238879.97 frames. ], batch size: 63, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:44:20,913 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.074e+02 2.205e+02 2.382e+02 3.244e+02, threshold=4.410e+02, percent-clipped=0.0 2024-06-21 10:44:27,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=371730.3333333333, ans=0.0 2024-06-21 10:44:34,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=371748.6666666667, ans=0.125 2024-06-21 10:44:47,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=371767.0, ans=0.1 2024-06-21 10:44:49,991 INFO [train.py:1028] (1/2) Epoch 21, batch 450, loss[loss=0.1812, simple_loss=0.2465, pruned_loss=0.05799, over 13162.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2576, pruned_loss=0.0733, over 2313130.10 frames. ], batch size: 67, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:45:01,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=371803.6666666667, ans=0.125 2024-06-21 10:45:01,907 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:45:09,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=371822.0, ans=0.125 2024-06-21 10:45:10,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=371822.0, ans=0.125 2024-06-21 10:45:19,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=371840.3333333333, ans=0.2 2024-06-21 10:45:23,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=371858.6666666667, ans=0.125 2024-06-21 10:45:28,863 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2024-06-21 10:45:29,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371858.6666666667, ans=0.1 2024-06-21 10:45:32,287 INFO [train.py:1028] (1/2) Epoch 21, batch 500, loss[loss=0.2105, simple_loss=0.2546, pruned_loss=0.08322, over 13201.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2576, pruned_loss=0.0731, over 2375331.36 frames. ], batch size: 121, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:45:34,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=371877.0, ans=0.1 2024-06-21 10:45:41,063 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.191e+02 2.322e+02 2.585e+02 3.301e+02, threshold=4.645e+02, percent-clipped=0.0 2024-06-21 10:45:52,283 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.51 vs. limit=15.0 2024-06-21 10:46:01,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=371932.0, ans=0.2 2024-06-21 10:46:09,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=371950.3333333333, ans=0.125 2024-06-21 10:46:09,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=371950.3333333333, ans=0.0 2024-06-21 10:46:10,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=371950.3333333333, ans=0.125 2024-06-21 10:46:13,505 INFO [train.py:1028] (1/2) Epoch 21, batch 550, loss[loss=0.2095, simple_loss=0.2555, pruned_loss=0.08176, over 12953.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2574, pruned_loss=0.0731, over 2419937.69 frames. ], batch size: 158, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:46:31,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=372005.3333333333, ans=0.0 2024-06-21 10:46:32,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=372005.3333333333, ans=0.0 2024-06-21 10:46:36,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=372023.6666666667, ans=0.05 2024-06-21 10:46:48,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=372042.0, ans=0.2 2024-06-21 10:46:49,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=372042.0, ans=0.0 2024-06-21 10:46:51,863 INFO [train.py:1028] (1/2) Epoch 21, batch 600, loss[loss=0.191, simple_loss=0.2394, pruned_loss=0.07135, over 13041.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2577, pruned_loss=0.07333, over 2457492.52 frames. ], batch size: 144, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:46:51,953 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:47:01,081 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.042e+02 2.175e+02 2.322e+02 3.019e+02, threshold=4.350e+02, percent-clipped=0.0 2024-06-21 10:47:15,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=372115.3333333333, ans=0.125 2024-06-21 10:47:20,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=372115.3333333333, ans=0.1 2024-06-21 10:47:20,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=372115.3333333333, ans=0.0 2024-06-21 10:47:28,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=372133.6666666667, ans=0.1 2024-06-21 10:47:31,109 INFO [train.py:1028] (1/2) Epoch 21, batch 650, loss[loss=0.1754, simple_loss=0.238, pruned_loss=0.05637, over 13205.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2572, pruned_loss=0.07268, over 2488695.21 frames. ], batch size: 59, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:47:42,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=372170.3333333333, ans=0.1 2024-06-21 10:47:43,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=372170.3333333333, ans=0.0 2024-06-21 10:47:57,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=372188.6666666667, ans=0.05 2024-06-21 10:48:00,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=372207.0, ans=0.0 2024-06-21 10:48:00,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=372207.0, ans=0.2 2024-06-21 10:48:08,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=372225.3333333333, ans=0.0 2024-06-21 10:48:09,467 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.26 vs. limit=22.5 2024-06-21 10:48:13,369 INFO [train.py:1028] (1/2) Epoch 21, batch 700, loss[loss=0.1958, simple_loss=0.2574, pruned_loss=0.06715, over 13280.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2567, pruned_loss=0.07247, over 2511161.96 frames. ], batch size: 46, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:48:22,400 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.047e+02 2.178e+02 2.374e+02 3.167e+02, threshold=4.357e+02, percent-clipped=0.0 2024-06-21 10:48:37,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=372280.3333333333, ans=0.07 2024-06-21 10:48:43,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=372298.6666666667, ans=0.0 2024-06-21 10:48:47,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=372317.0, ans=0.125 2024-06-21 10:48:52,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.97 vs. limit=22.5 2024-06-21 10:48:55,440 INFO [train.py:1028] (1/2) Epoch 21, batch 750, loss[loss=0.1893, simple_loss=0.2516, pruned_loss=0.06347, over 13322.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2567, pruned_loss=0.07237, over 2528432.66 frames. ], batch size: 63, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:49:04,023 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:49:12,622 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.63 vs. limit=6.0 2024-06-21 10:49:21,725 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:49:24,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=372390.3333333333, ans=0.125 2024-06-21 10:49:28,318 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.90 vs. limit=15.0 2024-06-21 10:49:34,957 INFO [train.py:1028] (1/2) Epoch 21, batch 800, loss[loss=0.1926, simple_loss=0.2491, pruned_loss=0.06799, over 12996.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2576, pruned_loss=0.07298, over 2541351.95 frames. ], batch size: 36, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:49:44,227 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.067e+02 2.212e+02 2.434e+02 3.333e+02, threshold=4.425e+02, percent-clipped=0.0 2024-06-21 10:49:49,313 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.92 vs. limit=10.0 2024-06-21 10:49:50,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=372463.6666666667, ans=0.07 2024-06-21 10:50:02,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=372482.0, ans=0.0 2024-06-21 10:50:05,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=372500.3333333333, ans=0.1 2024-06-21 10:50:07,427 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2024-06-21 10:50:14,347 INFO [train.py:1028] (1/2) Epoch 21, batch 850, loss[loss=0.203, simple_loss=0.2588, pruned_loss=0.07359, over 13160.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2579, pruned_loss=0.0732, over 2552274.75 frames. ], batch size: 95, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:50:25,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=372537.0, ans=0.0 2024-06-21 10:50:28,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=372537.0, ans=0.0 2024-06-21 10:50:28,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=372537.0, ans=0.125 2024-06-21 10:50:34,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=372555.3333333333, ans=0.0 2024-06-21 10:50:35,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=372555.3333333333, ans=0.1 2024-06-21 10:50:37,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=372555.3333333333, ans=0.125 2024-06-21 10:50:40,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=372573.6666666667, ans=0.125 2024-06-21 10:50:51,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=372592.0, ans=0.2 2024-06-21 10:50:53,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=372592.0, ans=0.0 2024-06-21 10:50:59,583 INFO [train.py:1028] (1/2) Epoch 21, batch 900, loss[loss=0.1923, simple_loss=0.2508, pruned_loss=0.06693, over 12852.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2575, pruned_loss=0.07313, over 2557380.50 frames. ], batch size: 36, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:51:03,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=372610.3333333333, ans=0.125 2024-06-21 10:51:04,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=372610.3333333333, ans=0.05 2024-06-21 10:51:06,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=372628.6666666667, ans=0.125 2024-06-21 10:51:08,047 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.034e+02 2.156e+02 2.310e+02 3.373e+02, threshold=4.312e+02, percent-clipped=0.0 2024-06-21 10:51:13,791 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:51:23,549 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.78 vs. limit=15.0 2024-06-21 10:51:33,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=372683.6666666667, ans=0.1 2024-06-21 10:51:33,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=372683.6666666667, ans=0.025 2024-06-21 10:51:37,598 INFO [train.py:1028] (1/2) Epoch 21, batch 950, loss[loss=0.2143, simple_loss=0.2733, pruned_loss=0.0777, over 12875.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2577, pruned_loss=0.07329, over 2559624.58 frames. ], batch size: 39, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:51:42,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=372702.0, ans=0.125 2024-06-21 10:52:04,265 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.28 vs. limit=10.0 2024-06-21 10:52:08,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=372775.3333333333, ans=0.0 2024-06-21 10:52:15,199 INFO [train.py:1028] (1/2) Epoch 21, batch 1000, loss[loss=0.2082, simple_loss=0.2673, pruned_loss=0.07458, over 13227.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2577, pruned_loss=0.07362, over 2561820.54 frames. ], batch size: 49, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:52:19,218 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.94 vs. limit=22.5 2024-06-21 10:52:24,409 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.113e+02 2.225e+02 2.411e+02 3.273e+02, threshold=4.449e+02, percent-clipped=0.0 2024-06-21 10:52:49,426 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.23 vs. limit=22.5 2024-06-21 10:52:52,893 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:52:55,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=372867.0, ans=0.125 2024-06-21 10:52:55,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=372867.0, ans=0.025 2024-06-21 10:52:57,328 INFO [train.py:1028] (1/2) Epoch 21, batch 1050, loss[loss=0.2006, simple_loss=0.2637, pruned_loss=0.06874, over 13128.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2579, pruned_loss=0.07345, over 2564973.23 frames. ], batch size: 77, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:53:11,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=372903.6666666667, ans=0.0 2024-06-21 10:53:32,397 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.10 vs. limit=15.0 2024-06-21 10:53:38,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=372977.0, ans=0.2 2024-06-21 10:53:39,238 INFO [train.py:1028] (1/2) Epoch 21, batch 1100, loss[loss=0.2033, simple_loss=0.2566, pruned_loss=0.07499, over 13283.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2584, pruned_loss=0.07382, over 2568829.29 frames. ], batch size: 52, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:53:48,553 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.024e+02 2.166e+02 2.326e+02 3.171e+02, threshold=4.332e+02, percent-clipped=0.0 2024-06-21 10:53:51,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=372995.3333333333, ans=0.0 2024-06-21 10:53:52,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=372995.3333333333, ans=0.025 2024-06-21 10:53:53,542 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=7.528e-02 2024-06-21 10:53:54,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=373013.6666666667, ans=0.0 2024-06-21 10:53:58,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=373013.6666666667, ans=0.125 2024-06-21 10:54:03,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=373032.0, ans=0.0 2024-06-21 10:54:06,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=373032.0, ans=0.0 2024-06-21 10:54:06,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=373032.0, ans=0.0 2024-06-21 10:54:18,095 INFO [train.py:1028] (1/2) Epoch 21, batch 1150, loss[loss=0.2126, simple_loss=0.2699, pruned_loss=0.07763, over 13283.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2588, pruned_loss=0.07406, over 2570982.30 frames. ], batch size: 52, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:54:19,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=373068.6666666667, ans=0.125 2024-06-21 10:54:23,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=373068.6666666667, ans=0.0 2024-06-21 10:54:24,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=373068.6666666667, ans=0.125 2024-06-21 10:54:30,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=373087.0, ans=0.025 2024-06-21 10:54:34,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=373105.3333333333, ans=0.125 2024-06-21 10:54:43,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=373123.6666666667, ans=0.125 2024-06-21 10:54:44,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=373123.6666666667, ans=0.125 2024-06-21 10:54:56,239 INFO [train.py:1028] (1/2) Epoch 21, batch 1200, loss[loss=0.1844, simple_loss=0.2448, pruned_loss=0.06201, over 13138.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2587, pruned_loss=0.07412, over 2573730.25 frames. ], batch size: 77, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:54:56,780 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.58 vs. limit=6.0 2024-06-21 10:55:03,739 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2024-06-21 10:55:08,514 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.069e+02 2.224e+02 2.492e+02 3.349e+02, threshold=4.447e+02, percent-clipped=0.0 2024-06-21 10:55:10,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=373178.6666666667, ans=0.0 2024-06-21 10:55:16,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=373197.0, ans=0.1 2024-06-21 10:55:23,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=373215.3333333333, ans=0.0 2024-06-21 10:55:25,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=373215.3333333333, ans=0.1 2024-06-21 10:55:30,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=373215.3333333333, ans=0.07 2024-06-21 10:55:40,013 INFO [train.py:1028] (1/2) Epoch 21, batch 1250, loss[loss=0.1938, simple_loss=0.2446, pruned_loss=0.07148, over 13164.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2583, pruned_loss=0.0737, over 2582949.19 frames. ], batch size: 112, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:55:47,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=373270.3333333333, ans=0.125 2024-06-21 10:55:49,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=373270.3333333333, ans=0.2 2024-06-21 10:55:53,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=373270.3333333333, ans=0.125 2024-06-21 10:55:59,700 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.58 vs. limit=15.0 2024-06-21 10:56:18,763 INFO [train.py:1028] (1/2) Epoch 21, batch 1300, loss[loss=0.2202, simple_loss=0.2681, pruned_loss=0.08621, over 12736.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.259, pruned_loss=0.07388, over 2582731.13 frames. ], batch size: 176, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:56:22,675 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2024-06-21 10:56:26,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=373362.0, ans=0.1 2024-06-21 10:56:27,315 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.060e+02 2.196e+02 2.424e+02 3.258e+02, threshold=4.391e+02, percent-clipped=0.0 2024-06-21 10:56:46,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=373398.6666666667, ans=0.0 2024-06-21 10:56:51,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=373417.0, ans=0.0 2024-06-21 10:56:56,681 INFO [train.py:1028] (1/2) Epoch 21, batch 1350, loss[loss=0.1813, simple_loss=0.2466, pruned_loss=0.058, over 13224.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2586, pruned_loss=0.07381, over 2585850.91 frames. ], batch size: 59, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:56:58,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=373435.3333333333, ans=0.0 2024-06-21 10:57:01,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=373435.3333333333, ans=0.0 2024-06-21 10:57:04,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=373453.6666666667, ans=0.0 2024-06-21 10:57:16,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=373472.0, ans=0.0 2024-06-21 10:57:32,103 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.80 vs. limit=15.0 2024-06-21 10:57:32,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=373508.6666666667, ans=0.125 2024-06-21 10:57:37,299 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.28 vs. limit=15.0 2024-06-21 10:57:38,434 INFO [train.py:1028] (1/2) Epoch 21, batch 1400, loss[loss=0.1851, simple_loss=0.2414, pruned_loss=0.06444, over 12346.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2583, pruned_loss=0.07381, over 2587692.66 frames. ], batch size: 25, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:57:42,159 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.66 vs. limit=15.0 2024-06-21 10:57:47,728 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 2.099e+02 2.236e+02 2.384e+02 2.942e+02, threshold=4.471e+02, percent-clipped=0.0 2024-06-21 10:58:06,954 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:58:13,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=373600.3333333333, ans=0.125 2024-06-21 10:58:17,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=373600.3333333333, ans=0.0 2024-06-21 10:58:19,469 INFO [train.py:1028] (1/2) Epoch 21, batch 1450, loss[loss=0.2026, simple_loss=0.2491, pruned_loss=0.07801, over 13160.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2583, pruned_loss=0.07402, over 2587136.06 frames. ], batch size: 121, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:58:24,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=373618.6666666667, ans=0.2 2024-06-21 10:58:49,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=373673.6666666667, ans=0.2 2024-06-21 10:58:57,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=373710.3333333333, ans=0.125 2024-06-21 10:58:58,121 INFO [train.py:1028] (1/2) Epoch 21, batch 1500, loss[loss=0.2197, simple_loss=0.2707, pruned_loss=0.08437, over 13200.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2584, pruned_loss=0.07398, over 2589006.84 frames. ], batch size: 83, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:58:59,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=373710.3333333333, ans=0.125 2024-06-21 10:58:59,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=373710.3333333333, ans=0.125 2024-06-21 10:59:06,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=373728.6666666667, ans=0.125 2024-06-21 10:59:06,858 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.096e+02 2.231e+02 2.416e+02 2.924e+02, threshold=4.462e+02, percent-clipped=0.0 2024-06-21 10:59:16,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=373747.0, ans=0.125 2024-06-21 10:59:19,311 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.04 vs. limit=15.0 2024-06-21 10:59:27,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=373765.3333333333, ans=0.5 2024-06-21 10:59:31,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=373783.6666666667, ans=0.125 2024-06-21 10:59:33,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=373783.6666666667, ans=0.0 2024-06-21 10:59:34,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=373783.6666666667, ans=0.125 2024-06-21 10:59:36,317 INFO [train.py:1028] (1/2) Epoch 21, batch 1550, loss[loss=0.1995, simple_loss=0.2551, pruned_loss=0.07198, over 13062.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2585, pruned_loss=0.07398, over 2584410.67 frames. ], batch size: 102, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:59:52,616 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.11 vs. limit=15.0 2024-06-21 10:59:57,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.64 vs. limit=10.0 2024-06-21 11:00:03,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=373857.0, ans=0.125 2024-06-21 11:00:22,183 INFO [train.py:1028] (1/2) Epoch 21, batch 1600, loss[loss=0.1807, simple_loss=0.2392, pruned_loss=0.06111, over 13123.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2585, pruned_loss=0.07383, over 2580408.61 frames. ], batch size: 77, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 11:00:31,779 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.068e+02 2.220e+02 2.423e+02 3.644e+02, threshold=4.441e+02, percent-clipped=0.0 2024-06-21 11:00:54,614 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 11:01:01,741 INFO [train.py:1028] (1/2) Epoch 21, batch 1650, loss[loss=0.2242, simple_loss=0.2696, pruned_loss=0.08943, over 13149.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.258, pruned_loss=0.07374, over 2575971.65 frames. ], batch size: 95, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 11:01:01,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=373985.3333333333, ans=0.035 2024-06-21 11:01:25,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=374022.0, ans=0.125 2024-06-21 11:01:46,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=374058.6666666667, ans=0.125 2024-06-21 11:01:47,561 INFO [train.py:1028] (1/2) Epoch 21, batch 1700, loss[loss=0.1997, simple_loss=0.2671, pruned_loss=0.06613, over 12471.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2583, pruned_loss=0.07362, over 2580414.71 frames. ], batch size: 25, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 11:01:56,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=374095.3333333333, ans=0.05 2024-06-21 11:01:57,502 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.105e+02 2.230e+02 2.399e+02 3.282e+02, threshold=4.461e+02, percent-clipped=0.0 2024-06-21 11:02:06,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=374113.6666666667, ans=0.1 2024-06-21 11:02:08,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=374113.6666666667, ans=0.125 2024-06-21 11:02:30,595 INFO [train.py:1028] (1/2) Epoch 21, batch 1750, loss[loss=0.1985, simple_loss=0.2629, pruned_loss=0.0671, over 12504.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2585, pruned_loss=0.07344, over 2581543.96 frames. ], batch size: 22, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:02:30,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=374168.6666666667, ans=0.125 2024-06-21 11:02:34,300 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.18 vs. limit=15.0 2024-06-21 11:02:35,809 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.15 vs. limit=15.0 2024-06-21 11:02:39,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=374187.0, ans=0.09899494936611666 2024-06-21 11:02:43,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=374187.0, ans=0.0 2024-06-21 11:02:46,260 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2024-06-21 11:02:55,861 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.44 vs. limit=15.0 2024-06-21 11:03:04,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=374242.0, ans=0.125 2024-06-21 11:03:08,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=374242.0, ans=0.2 2024-06-21 11:03:10,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=374242.0, ans=0.2 2024-06-21 11:03:12,295 INFO [train.py:1028] (1/2) Epoch 21, batch 1800, loss[loss=0.2006, simple_loss=0.2541, pruned_loss=0.07361, over 13252.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2584, pruned_loss=0.07345, over 2582256.33 frames. ], batch size: 67, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:03:15,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=374260.3333333333, ans=0.0 2024-06-21 11:03:16,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=374260.3333333333, ans=10.0 2024-06-21 11:03:16,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=374260.3333333333, ans=0.125 2024-06-21 11:03:21,061 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.145e+02 2.352e+02 2.614e+02 3.267e+02, threshold=4.704e+02, percent-clipped=0.0 2024-06-21 11:03:21,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=374278.6666666667, ans=0.0 2024-06-21 11:03:25,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=374278.6666666667, ans=0.05 2024-06-21 11:03:39,921 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2024-06-21 11:03:50,161 INFO [train.py:1028] (1/2) Epoch 21, batch 1850, loss[loss=0.1962, simple_loss=0.2516, pruned_loss=0.07044, over 13217.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2585, pruned_loss=0.07329, over 2583119.77 frames. ], batch size: 83, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:03:50,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=374352.0, ans=0.2 2024-06-21 11:03:54,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=374352.0, ans=0.09899494936611666 2024-06-21 11:04:07,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=374388.6666666667, ans=0.0 2024-06-21 11:04:18,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=374407.0, ans=0.0 2024-06-21 11:04:23,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=374425.3333333333, ans=0.1 2024-06-21 11:04:28,875 INFO [train.py:1028] (1/2) Epoch 21, batch 1900, loss[loss=0.1944, simple_loss=0.2432, pruned_loss=0.07278, over 13180.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2573, pruned_loss=0.07297, over 2585465.96 frames. ], batch size: 95, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:04:32,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=374443.6666666667, ans=0.07 2024-06-21 11:04:42,227 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.101e+02 2.277e+02 2.472e+02 4.132e+02, threshold=4.555e+02, percent-clipped=0.0 2024-06-21 11:05:15,545 INFO [train.py:1028] (1/2) Epoch 21, batch 1950, loss[loss=0.1952, simple_loss=0.2604, pruned_loss=0.06498, over 13199.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2571, pruned_loss=0.07307, over 2591735.52 frames. ], batch size: 52, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:05:16,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=374535.3333333333, ans=0.05 2024-06-21 11:05:34,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=374572.0, ans=0.0 2024-06-21 11:05:49,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=374608.6666666667, ans=0.125 2024-06-21 11:05:53,726 INFO [train.py:1028] (1/2) Epoch 21, batch 2000, loss[loss=0.1879, simple_loss=0.2514, pruned_loss=0.06217, over 12529.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2573, pruned_loss=0.07326, over 2587619.31 frames. ], batch size: 22, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:05:59,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=374627.0, ans=0.0 2024-06-21 11:06:03,044 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.033e+02 2.147e+02 2.345e+02 3.000e+02, threshold=4.295e+02, percent-clipped=0.0 2024-06-21 11:06:17,073 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2024-06-21 11:06:28,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=374700.3333333333, ans=0.125 2024-06-21 11:06:31,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=374700.3333333333, ans=0.125 2024-06-21 11:06:33,940 INFO [train.py:1028] (1/2) Epoch 21, batch 2050, loss[loss=0.2031, simple_loss=0.2593, pruned_loss=0.07339, over 12560.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2574, pruned_loss=0.07348, over 2583085.67 frames. ], batch size: 29, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:06:43,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=374737.0, ans=0.125 2024-06-21 11:06:50,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=374755.3333333333, ans=0.025 2024-06-21 11:06:52,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=374755.3333333333, ans=0.125 2024-06-21 11:07:10,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=374792.0, ans=0.125 2024-06-21 11:07:17,352 INFO [train.py:1028] (1/2) Epoch 21, batch 2100, loss[loss=0.1905, simple_loss=0.253, pruned_loss=0.064, over 13194.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2579, pruned_loss=0.07325, over 2585332.90 frames. ], batch size: 59, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:07:26,968 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.064e+02 2.188e+02 2.367e+02 3.745e+02, threshold=4.376e+02, percent-clipped=0.0 2024-06-21 11:07:28,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=374828.6666666667, ans=0.0 2024-06-21 11:07:47,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=374865.3333333333, ans=0.125 2024-06-21 11:07:47,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=374865.3333333333, ans=0.125 2024-06-21 11:07:59,837 INFO [train.py:1028] (1/2) Epoch 21, batch 2150, loss[loss=0.2031, simple_loss=0.2666, pruned_loss=0.06981, over 13304.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.258, pruned_loss=0.07319, over 2588713.45 frames. ], batch size: 52, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:08:03,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=374902.0, ans=0.125 2024-06-21 11:08:07,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.31 vs. limit=15.0 2024-06-21 11:08:09,498 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2024-06-21 11:08:14,255 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.96 vs. limit=10.0 2024-06-21 11:08:23,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=374957.0, ans=0.2 2024-06-21 11:08:27,284 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.68 vs. limit=15.0 2024-06-21 11:08:34,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=374975.3333333333, ans=0.2 2024-06-21 11:08:38,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=374975.3333333333, ans=0.0 2024-06-21 11:08:40,006 INFO [train.py:1028] (1/2) Epoch 21, batch 2200, loss[loss=0.1879, simple_loss=0.2348, pruned_loss=0.07047, over 13226.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2585, pruned_loss=0.0736, over 2588704.55 frames. ], batch size: 83, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:08:41,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=374993.6666666667, ans=0.0 2024-06-21 11:08:44,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=374993.6666666667, ans=0.0 2024-06-21 11:08:46,504 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2024-06-21 11:08:49,010 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.108e+02 2.260e+02 2.490e+02 3.168e+02, threshold=4.519e+02, percent-clipped=0.0 2024-06-21 11:08:49,980 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 11:08:53,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=375012.0, ans=0.2 2024-06-21 11:08:54,782 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2024-06-21 11:09:00,799 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.87 vs. limit=15.0 2024-06-21 11:09:18,903 INFO [train.py:1028] (1/2) Epoch 21, batch 2250, loss[loss=0.2101, simple_loss=0.2685, pruned_loss=0.07579, over 13266.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2585, pruned_loss=0.07354, over 2587064.57 frames. ], batch size: 63, lr: 2.78e-03, grad_scale: 128.0 2024-06-21 11:09:19,389 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.87 vs. limit=22.5 2024-06-21 11:09:27,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=375103.6666666667, ans=0.125 2024-06-21 11:09:29,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=375103.6666666667, ans=0.125 2024-06-21 11:09:32,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=375103.6666666667, ans=0.125 2024-06-21 11:09:34,654 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.75 vs. limit=10.0 2024-06-21 11:09:45,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=375140.3333333333, ans=0.0 2024-06-21 11:10:01,095 INFO [train.py:1028] (1/2) Epoch 21, batch 2300, loss[loss=0.2032, simple_loss=0.2691, pruned_loss=0.06867, over 12955.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2586, pruned_loss=0.07338, over 2581667.87 frames. ], batch size: 33, lr: 2.78e-03, grad_scale: 128.0 2024-06-21 11:10:07,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=375177.0, ans=0.0 2024-06-21 11:10:08,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=375195.3333333333, ans=0.125 2024-06-21 11:10:13,493 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.074e+02 2.185e+02 2.372e+02 3.265e+02, threshold=4.371e+02, percent-clipped=0.0 2024-06-21 11:10:26,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=375213.6666666667, ans=0.0 2024-06-21 11:10:28,719 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.04 vs. limit=15.0 2024-06-21 11:10:34,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=375250.3333333333, ans=0.125 2024-06-21 11:10:41,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=375250.3333333333, ans=0.1 2024-06-21 11:10:42,716 INFO [train.py:1028] (1/2) Epoch 21, batch 2350, loss[loss=0.1938, simple_loss=0.249, pruned_loss=0.06927, over 13193.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2582, pruned_loss=0.07331, over 2584649.99 frames. ], batch size: 67, lr: 2.78e-03, grad_scale: 128.0 2024-06-21 11:10:44,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=375268.6666666667, ans=0.125 2024-06-21 11:10:47,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=375268.6666666667, ans=0.0 2024-06-21 11:10:49,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=375287.0, ans=0.125 2024-06-21 11:10:51,105 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2024-06-21 11:10:53,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=375287.0, ans=0.035 2024-06-21 11:11:15,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=375342.0, ans=0.025 2024-06-21 11:11:19,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=375342.0, ans=0.125 2024-06-21 11:11:21,422 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.64 vs. limit=15.0 2024-06-21 11:11:21,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=375360.3333333333, ans=0.07 2024-06-21 11:11:22,489 INFO [train.py:1028] (1/2) Epoch 21, batch 2400, loss[loss=0.1963, simple_loss=0.252, pruned_loss=0.0703, over 13339.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.258, pruned_loss=0.0735, over 2587284.36 frames. ], batch size: 46, lr: 2.78e-03, grad_scale: 128.0 2024-06-21 11:11:26,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=375360.3333333333, ans=0.125 2024-06-21 11:11:27,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=375360.3333333333, ans=0.0 2024-06-21 11:11:31,502 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.061e+02 2.233e+02 2.357e+02 3.164e+02, threshold=4.467e+02, percent-clipped=0.0 2024-06-21 11:11:32,041 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.96 vs. limit=15.0 2024-06-21 11:11:38,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=375397.0, ans=0.125 2024-06-21 11:11:58,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=375433.6666666667, ans=0.0 2024-06-21 11:12:02,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=375452.0, ans=0.125 2024-06-21 11:12:02,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=375452.0, ans=0.125 2024-06-21 11:12:03,299 INFO [train.py:1028] (1/2) Epoch 21, batch 2450, loss[loss=0.1992, simple_loss=0.2568, pruned_loss=0.07074, over 13220.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2577, pruned_loss=0.07393, over 2584316.20 frames. ], batch size: 63, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:12:08,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=375452.0, ans=0.1 2024-06-21 11:12:17,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=375470.3333333333, ans=0.2 2024-06-21 11:12:17,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=375470.3333333333, ans=0.0 2024-06-21 11:12:22,670 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 11:12:29,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=375507.0, ans=0.125 2024-06-21 11:12:31,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=375507.0, ans=0.04949747468305833 2024-06-21 11:12:36,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=375507.0, ans=0.1 2024-06-21 11:12:45,281 INFO [train.py:1028] (1/2) Epoch 21, batch 2500, loss[loss=0.1943, simple_loss=0.2444, pruned_loss=0.07205, over 13243.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2567, pruned_loss=0.07348, over 2587167.69 frames. ], batch size: 83, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:12:45,709 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.14 vs. limit=10.0 2024-06-21 11:12:47,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=375543.6666666667, ans=0.125 2024-06-21 11:12:53,703 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=15.0 2024-06-21 11:12:55,372 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.055e+02 2.206e+02 2.433e+02 2.903e+02, threshold=4.411e+02, percent-clipped=0.0 2024-06-21 11:13:01,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=375580.3333333333, ans=0.125 2024-06-21 11:13:02,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=375580.3333333333, ans=0.0 2024-06-21 11:13:05,076 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.45 vs. limit=15.0 2024-06-21 11:13:05,716 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.11 vs. limit=22.5 2024-06-21 11:13:12,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=375598.6666666667, ans=0.125 2024-06-21 11:13:15,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=375617.0, ans=0.0 2024-06-21 11:13:22,459 INFO [train.py:1028] (1/2) Epoch 21, batch 2550, loss[loss=0.1964, simple_loss=0.2579, pruned_loss=0.06748, over 12617.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2556, pruned_loss=0.07311, over 2586003.03 frames. ], batch size: 22, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:13:27,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=375635.3333333333, ans=0.2 2024-06-21 11:13:36,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=375653.6666666667, ans=0.0 2024-06-21 11:13:36,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=375653.6666666667, ans=0.0 2024-06-21 11:13:50,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=375690.3333333333, ans=0.0 2024-06-21 11:13:50,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=375690.3333333333, ans=0.0 2024-06-21 11:13:51,550 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.03 vs. limit=15.0 2024-06-21 11:13:54,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=375690.3333333333, ans=0.0 2024-06-21 11:14:01,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=375708.6666666667, ans=0.1 2024-06-21 11:14:07,061 INFO [train.py:1028] (1/2) Epoch 21, batch 2600, loss[loss=0.1831, simple_loss=0.2355, pruned_loss=0.06536, over 13260.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2546, pruned_loss=0.07283, over 2586717.09 frames. ], batch size: 52, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:14:17,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=375727.0, ans=0.0 2024-06-21 11:14:21,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=375727.0, ans=0.125 2024-06-21 11:14:22,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=375745.3333333333, ans=0.025 2024-06-21 11:14:26,387 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.061e+02 2.159e+02 2.335e+02 2.992e+02, threshold=4.318e+02, percent-clipped=0.0 2024-06-21 11:14:33,981 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.80 vs. limit=15.0 2024-06-21 11:14:36,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=375763.6666666667, ans=0.025 2024-06-21 11:15:10,850 INFO [train.py:1028] (1/2) Epoch 21, batch 2650, loss[loss=0.1704, simple_loss=0.219, pruned_loss=0.06091, over 13028.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2531, pruned_loss=0.07229, over 2587443.39 frames. ], batch size: 144, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:15:14,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=375818.6666666667, ans=0.1 2024-06-21 11:15:14,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=375818.6666666667, ans=0.1 2024-06-21 11:15:18,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=375818.6666666667, ans=0.0 2024-06-21 11:15:26,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=375837.0, ans=0.1 2024-06-21 11:15:56,698 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.17 vs. limit=15.0 2024-06-21 11:15:57,764 INFO [train.py:1028] (1/2) Epoch 21, batch 2700, loss[loss=0.2052, simple_loss=0.2547, pruned_loss=0.07785, over 13242.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2518, pruned_loss=0.07203, over 2585181.00 frames. ], batch size: 89, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:15:57,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=375910.3333333333, ans=0.0 2024-06-21 11:15:59,075 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.12 vs. limit=22.5 2024-06-21 11:16:01,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=375910.3333333333, ans=0.2 2024-06-21 11:16:03,495 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 11:16:04,853 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.26 vs. limit=6.0 2024-06-21 11:16:08,287 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.26 vs. limit=22.5 2024-06-21 11:16:08,566 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.047e+02 2.185e+02 2.378e+02 3.062e+02, threshold=4.369e+02, percent-clipped=0.0 2024-06-21 11:16:14,458 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2024-06-21 11:16:16,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=375947.0, ans=0.1 2024-06-21 11:16:16,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=375947.0, ans=0.125 2024-06-21 11:16:17,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=375947.0, ans=0.07 2024-06-21 11:16:45,561 INFO [train.py:1028] (1/2) Epoch 21, batch 2750, loss[loss=0.1834, simple_loss=0.2382, pruned_loss=0.06429, over 13211.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2507, pruned_loss=0.07159, over 2582086.96 frames. ], batch size: 43, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:17:09,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=376020.3333333333, ans=0.2 2024-06-21 11:17:44,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=376075.3333333333, ans=0.125 2024-06-21 11:17:50,630 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.16 vs. limit=8.0 2024-06-21 11:17:51,703 INFO [train.py:1028] (1/2) Epoch 21, batch 2800, loss[loss=0.2012, simple_loss=0.2434, pruned_loss=0.07953, over 11025.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2504, pruned_loss=0.07153, over 2580625.30 frames. ], batch size: 304, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:17:53,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=376093.6666666667, ans=0.2 2024-06-21 11:18:04,188 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.066e+02 2.250e+02 2.456e+02 4.507e+02, threshold=4.501e+02, percent-clipped=1.0 2024-06-21 11:18:07,581 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.76 vs. limit=10.0 2024-06-21 11:18:34,738 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 11:18:35,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=376167.0, ans=0.1 2024-06-21 11:18:35,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=376167.0, ans=0.125 2024-06-21 11:18:42,537 INFO [train.py:1028] (1/2) Epoch 21, batch 2850, loss[loss=0.2028, simple_loss=0.2591, pruned_loss=0.0732, over 13297.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2496, pruned_loss=0.07141, over 2578844.68 frames. ], batch size: 49, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:18:47,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=376185.3333333333, ans=0.0 2024-06-21 11:19:22,148 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.33 vs. limit=22.5 2024-06-21 11:19:26,134 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.10 vs. limit=15.0 2024-06-21 11:19:30,911 INFO [train.py:1028] (1/2) Epoch 21, batch 2900, loss[loss=0.1888, simple_loss=0.2443, pruned_loss=0.06667, over 13147.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2481, pruned_loss=0.07076, over 2585820.96 frames. ], batch size: 55, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:19:45,856 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 2.033e+02 2.161e+02 2.416e+02 3.249e+02, threshold=4.321e+02, percent-clipped=0.0 2024-06-21 11:19:48,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=376295.3333333333, ans=0.125 2024-06-21 11:19:59,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=376313.6666666667, ans=0.0 2024-06-21 11:20:09,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=376332.0, ans=0.0 2024-06-21 11:20:15,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=376332.0, ans=0.125 2024-06-21 11:20:27,829 INFO [train.py:1028] (1/2) Epoch 21, batch 2950, loss[loss=0.205, simple_loss=0.2505, pruned_loss=0.07977, over 13253.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2472, pruned_loss=0.07051, over 2578502.85 frames. ], batch size: 43, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:20:28,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=376368.6666666667, ans=0.0 2024-06-21 11:21:12,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=376423.6666666667, ans=0.125 2024-06-21 11:21:27,913 INFO [train.py:1028] (1/2) Epoch 21, batch 3000, loss[loss=0.1885, simple_loss=0.2445, pruned_loss=0.06627, over 13208.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2465, pruned_loss=0.07025, over 2576388.94 frames. ], batch size: 59, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:21:27,914 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 11:21:40,212 INFO [train.py:1060] (1/2) Epoch 21, validation: loss=0.1865, simple_loss=0.2506, pruned_loss=0.06124, over 351949.00 frames. 2024-06-21 11:21:40,213 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 11:21:43,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=376460.3333333333, ans=0.0 2024-06-21 11:21:54,016 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.016e+02 2.141e+02 2.320e+02 3.810e+02, threshold=4.281e+02, percent-clipped=0.0 2024-06-21 11:22:01,680 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.92 vs. limit=15.0 2024-06-21 11:22:06,213 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.76 vs. limit=6.0 2024-06-21 11:22:22,479 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.04 vs. limit=15.0 2024-06-21 11:22:33,177 INFO [train.py:1028] (1/2) Epoch 21, batch 3050, loss[loss=0.1915, simple_loss=0.2428, pruned_loss=0.0701, over 13346.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.246, pruned_loss=0.07027, over 2576658.09 frames. ], batch size: 46, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:22:34,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=376552.0, ans=0.125 2024-06-21 11:22:36,674 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2024-06-21 11:22:37,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=376552.0, ans=0.125 2024-06-21 11:22:42,846 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=15.0 2024-06-21 11:23:23,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=376625.3333333333, ans=0.0 2024-06-21 11:23:28,242 INFO [train.py:1028] (1/2) Epoch 21, batch 3100, loss[loss=0.1853, simple_loss=0.232, pruned_loss=0.06934, over 13021.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2447, pruned_loss=0.06977, over 2578109.31 frames. ], batch size: 144, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:23:30,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=376643.6666666667, ans=0.1 2024-06-21 11:23:40,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=376643.6666666667, ans=0.025 2024-06-21 11:23:41,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=376643.6666666667, ans=0.125 2024-06-21 11:23:43,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=376662.0, ans=0.0 2024-06-21 11:23:43,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=376662.0, ans=0.07 2024-06-21 11:23:47,672 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.771e+02 2.042e+02 2.176e+02 2.465e+02 3.504e+02, threshold=4.352e+02, percent-clipped=0.0 2024-06-21 11:23:52,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=376662.0, ans=0.0 2024-06-21 11:24:19,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=376717.0, ans=0.2 2024-06-21 11:24:26,700 INFO [train.py:1028] (1/2) Epoch 21, batch 3150, loss[loss=0.1856, simple_loss=0.2343, pruned_loss=0.06841, over 12941.00 frames. ], tot_loss[loss=0.1913, simple_loss=0.2439, pruned_loss=0.06935, over 2581525.60 frames. ], batch size: 158, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:24:35,494 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=12.0 2024-06-21 11:25:14,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=376808.6666666667, ans=0.125 2024-06-21 11:25:16,520 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.38 vs. limit=15.0 2024-06-21 11:25:17,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=376808.6666666667, ans=0.125 2024-06-21 11:25:20,004 INFO [train.py:1028] (1/2) Epoch 21, batch 3200, loss[loss=0.1809, simple_loss=0.2371, pruned_loss=0.06236, over 13133.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2434, pruned_loss=0.06918, over 2582779.29 frames. ], batch size: 55, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:25:33,965 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.040e+02 2.208e+02 2.423e+02 3.162e+02, threshold=4.417e+02, percent-clipped=0.0 2024-06-21 11:25:39,995 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.12 vs. limit=15.0 2024-06-21 11:25:54,136 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.30 vs. limit=15.0 2024-06-21 11:26:03,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=376900.3333333333, ans=0.05 2024-06-21 11:26:10,423 INFO [train.py:1028] (1/2) Epoch 21, batch 3250, loss[loss=0.1812, simple_loss=0.2398, pruned_loss=0.06128, over 13260.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2427, pruned_loss=0.069, over 2587604.52 frames. ], batch size: 72, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:26:47,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=376955.3333333333, ans=0.1 2024-06-21 11:26:59,672 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.80 vs. limit=15.0 2024-06-21 11:27:14,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=376992.0, ans=0.125 2024-06-21 11:27:21,236 INFO [train.py:1028] (1/2) Epoch 21, batch 3300, loss[loss=0.201, simple_loss=0.2478, pruned_loss=0.07713, over 12765.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2421, pruned_loss=0.0687, over 2583381.42 frames. ], batch size: 176, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:27:34,090 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 1.982e+02 2.127e+02 2.237e+02 3.220e+02, threshold=4.255e+02, percent-clipped=0.0 2024-06-21 11:27:46,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=377047.0, ans=0.125 2024-06-21 11:27:56,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=377083.6666666667, ans=0.1 2024-06-21 11:28:03,801 INFO [train.py:1028] (1/2) Epoch 21, batch 3350, loss[loss=0.1846, simple_loss=0.233, pruned_loss=0.06806, over 12971.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.242, pruned_loss=0.06885, over 2577348.88 frames. ], batch size: 158, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:28:40,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=377157.0, ans=0.2 2024-06-21 11:28:47,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.23 vs. limit=15.0 2024-06-21 11:28:51,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=377175.3333333333, ans=0.125 2024-06-21 11:28:56,003 INFO [train.py:1028] (1/2) Epoch 21, batch 3400, loss[loss=0.2147, simple_loss=0.2694, pruned_loss=0.07998, over 12433.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2419, pruned_loss=0.06922, over 2575386.55 frames. ], batch size: 22, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:28:57,715 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.76 vs. limit=10.0 2024-06-21 11:29:17,014 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 1.997e+02 2.093e+02 2.252e+02 3.299e+02, threshold=4.187e+02, percent-clipped=0.0 2024-06-21 11:30:03,716 INFO [train.py:1028] (1/2) Epoch 21, batch 3450, loss[loss=0.2093, simple_loss=0.261, pruned_loss=0.07881, over 12795.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.241, pruned_loss=0.06879, over 2575866.37 frames. ], batch size: 176, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:30:48,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=377358.6666666667, ans=0.0 2024-06-21 11:30:57,836 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.33 vs. limit=15.0 2024-06-21 11:30:58,276 INFO [train.py:1028] (1/2) Epoch 21, batch 3500, loss[loss=0.1885, simple_loss=0.2387, pruned_loss=0.06912, over 12952.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2403, pruned_loss=0.06819, over 2575748.23 frames. ], batch size: 33, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:31:12,306 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.702e+02 1.995e+02 2.135e+02 2.298e+02 2.834e+02, threshold=4.270e+02, percent-clipped=0.0 2024-06-21 11:31:16,437 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.60 vs. limit=15.0 2024-06-21 11:31:24,748 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.93 vs. limit=15.0 2024-06-21 11:31:27,560 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.512e-03 2024-06-21 11:31:30,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=377432.0, ans=0.125 2024-06-21 11:31:31,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=377432.0, ans=0.125 2024-06-21 11:31:33,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=377432.0, ans=0.125 2024-06-21 11:31:44,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=377450.3333333333, ans=0.0 2024-06-21 11:31:51,531 INFO [train.py:1028] (1/2) Epoch 21, batch 3550, loss[loss=0.1809, simple_loss=0.2326, pruned_loss=0.06459, over 13158.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2395, pruned_loss=0.06774, over 2576621.38 frames. ], batch size: 95, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:31:54,376 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=377468.6666666667, ans=0.04949747468305833 2024-06-21 11:32:10,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=377505.3333333333, ans=0.0 2024-06-21 11:32:26,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=377523.6666666667, ans=0.125 2024-06-21 11:32:46,101 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.05 vs. limit=6.0 2024-06-21 11:32:47,314 INFO [train.py:1028] (1/2) Epoch 21, batch 3600, loss[loss=0.2, simple_loss=0.2508, pruned_loss=0.07456, over 13312.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2393, pruned_loss=0.06777, over 2580260.24 frames. ], batch size: 49, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:33:02,463 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 2.011e+02 2.156e+02 2.362e+02 3.391e+02, threshold=4.311e+02, percent-clipped=0.0 2024-06-21 11:33:16,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=377597.0, ans=0.125 2024-06-21 11:33:26,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=377615.3333333333, ans=0.125 2024-06-21 11:33:46,632 INFO [train.py:1028] (1/2) Epoch 21, batch 3650, loss[loss=0.1783, simple_loss=0.2263, pruned_loss=0.06515, over 13037.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2388, pruned_loss=0.06723, over 2577736.14 frames. ], batch size: 102, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:33:47,280 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.24 vs. limit=15.0 2024-06-21 11:33:56,033 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=377670.3333333333, ans=0.125 2024-06-21 11:34:04,737 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2024-06-21 11:34:10,073 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2024-06-21 11:34:10,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=377688.6666666667, ans=10.0 2024-06-21 11:34:20,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=377707.0, ans=0.125 2024-06-21 11:34:23,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=377707.0, ans=0.2 2024-06-21 11:34:25,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=377725.3333333333, ans=0.0 2024-06-21 11:34:35,476 INFO [train.py:1028] (1/2) Epoch 21, batch 3700, loss[loss=0.1854, simple_loss=0.2414, pruned_loss=0.06472, over 13262.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2382, pruned_loss=0.06698, over 2582613.30 frames. ], batch size: 72, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:34:35,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=377743.6666666667, ans=0.2 2024-06-21 11:34:40,208 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.14 vs. limit=22.5 2024-06-21 11:34:46,336 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 1.973e+02 2.105e+02 2.278e+02 3.267e+02, threshold=4.211e+02, percent-clipped=0.0 2024-06-21 11:34:52,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=377780.3333333333, ans=0.2 2024-06-21 11:35:01,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=377798.6666666667, ans=0.125 2024-06-21 11:35:15,044 INFO [train.py:1028] (1/2) Epoch 21, batch 3750, loss[loss=0.1991, simple_loss=0.2508, pruned_loss=0.07371, over 12938.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.238, pruned_loss=0.06687, over 2584814.78 frames. ], batch size: 23, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:35:22,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=377835.3333333333, ans=0.09899494936611666 2024-06-21 11:35:30,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=377853.6666666667, ans=0.125 2024-06-21 11:35:50,229 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.88 vs. limit=15.0 2024-06-21 11:35:59,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=377890.3333333333, ans=0.125 2024-06-21 11:36:21,244 INFO [train.py:1028] (1/2) Epoch 21, batch 3800, loss[loss=0.2001, simple_loss=0.25, pruned_loss=0.07511, over 13187.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2382, pruned_loss=0.06689, over 2582803.20 frames. ], batch size: 83, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:36:28,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=377927.0, ans=0.0 2024-06-21 11:36:34,630 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 1.971e+02 2.075e+02 2.219e+02 2.925e+02, threshold=4.151e+02, percent-clipped=0.0 2024-06-21 11:36:41,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=377963.6666666667, ans=0.2 2024-06-21 11:36:49,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=377963.6666666667, ans=0.0 2024-06-21 11:36:51,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=377982.0, ans=0.07 2024-06-21 11:36:58,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=377982.0, ans=0.125 2024-06-21 11:37:03,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=378000.3333333333, ans=0.125 2024-06-21 11:37:10,559 INFO [train.py:1028] (1/2) Epoch 21, batch 3850, loss[loss=0.2007, simple_loss=0.2457, pruned_loss=0.0779, over 13058.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2381, pruned_loss=0.0667, over 2584874.11 frames. ], batch size: 144, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:37:10,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=378018.6666666667, ans=0.2 2024-06-21 11:37:11,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=378018.6666666667, ans=0.0 2024-06-21 11:37:19,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=378018.6666666667, ans=0.025 2024-06-21 11:37:28,500 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.08 vs. limit=22.5 2024-06-21 11:37:31,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=378055.3333333333, ans=0.0 2024-06-21 11:37:50,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=378073.6666666667, ans=0.125 2024-06-21 11:37:52,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=378092.0, ans=0.025 2024-06-21 11:37:55,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=378092.0, ans=0.2 2024-06-21 11:38:03,131 INFO [train.py:1028] (1/2) Epoch 21, batch 3900, loss[loss=0.1961, simple_loss=0.2416, pruned_loss=0.07535, over 13205.00 frames. ], tot_loss[loss=0.186, simple_loss=0.238, pruned_loss=0.06695, over 2587804.64 frames. ], batch size: 83, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:38:04,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=378110.3333333333, ans=0.0 2024-06-21 11:38:16,215 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.783e+02 1.960e+02 2.140e+02 2.304e+02 3.178e+02, threshold=4.280e+02, percent-clipped=0.0 2024-06-21 11:38:16,484 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=378128.6666666667, ans=0.0 2024-06-21 11:38:22,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=378147.0, ans=0.125 2024-06-21 11:38:31,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=378147.0, ans=0.2 2024-06-21 11:38:41,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=378165.3333333333, ans=0.1 2024-06-21 11:38:44,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=378183.6666666667, ans=0.025 2024-06-21 11:38:50,701 INFO [train.py:1028] (1/2) Epoch 21, batch 3950, loss[loss=0.1858, simple_loss=0.2322, pruned_loss=0.0697, over 13075.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2378, pruned_loss=0.06692, over 2589846.04 frames. ], batch size: 132, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:38:51,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=378202.0, ans=0.2 2024-06-21 11:38:56,734 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.13 vs. limit=15.0 2024-06-21 11:38:59,823 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2024-06-21 11:39:31,682 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.79 vs. limit=15.0 2024-06-21 11:39:43,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=378257.0, ans=0.05 2024-06-21 11:39:58,581 INFO [train.py:1028] (1/2) Epoch 21, batch 4000, loss[loss=0.1965, simple_loss=0.2461, pruned_loss=0.07348, over 12942.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2375, pruned_loss=0.06699, over 2584981.21 frames. ], batch size: 39, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:39:59,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=378293.6666666667, ans=0.2 2024-06-21 11:40:11,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=378312.0, ans=0.125 2024-06-21 11:40:12,279 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.769e+02 1.997e+02 2.118e+02 2.321e+02 2.856e+02, threshold=4.236e+02, percent-clipped=0.0 2024-06-21 11:40:17,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=378312.0, ans=0.025 2024-06-21 11:40:24,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=378330.3333333333, ans=0.125 2024-06-21 11:40:25,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=378330.3333333333, ans=0.125 2024-06-21 11:40:29,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=378348.6666666667, ans=0.1 2024-06-21 11:40:31,668 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.21 vs. limit=15.0 2024-06-21 11:40:36,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378348.6666666667, ans=0.1 2024-06-21 11:40:39,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=378367.0, ans=0.1 2024-06-21 11:40:48,496 INFO [train.py:1028] (1/2) Epoch 21, batch 4050, loss[loss=0.215, simple_loss=0.2459, pruned_loss=0.09202, over 11098.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2373, pruned_loss=0.06711, over 2582564.58 frames. ], batch size: 304, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:40:49,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=378385.3333333333, ans=0.0 2024-06-21 11:41:10,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=378422.0, ans=15.0 2024-06-21 11:41:14,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=378422.0, ans=0.125 2024-06-21 11:41:27,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=378458.6666666667, ans=0.0 2024-06-21 11:41:39,257 INFO [train.py:1028] (1/2) Epoch 21, batch 4100, loss[loss=0.1863, simple_loss=0.2277, pruned_loss=0.07246, over 13189.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2372, pruned_loss=0.06757, over 2579262.11 frames. ], batch size: 103, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:41:51,633 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.007e+02 2.113e+02 2.348e+02 3.014e+02, threshold=4.226e+02, percent-clipped=0.0 2024-06-21 11:42:05,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=378513.6666666667, ans=0.02 2024-06-21 11:42:13,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=378532.0, ans=0.0 2024-06-21 11:42:33,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=378568.6666666667, ans=0.09899494936611666 2024-06-21 11:42:33,844 INFO [train.py:1028] (1/2) Epoch 21, batch 4150, loss[loss=0.1906, simple_loss=0.2407, pruned_loss=0.0702, over 13126.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2375, pruned_loss=0.06767, over 2578607.61 frames. ], batch size: 55, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:42:38,567 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 11:42:46,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=378587.0, ans=0.125 2024-06-21 11:42:51,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=378587.0, ans=0.125 2024-06-21 11:42:57,489 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 11:43:13,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=378642.0, ans=0.025 2024-06-21 11:43:18,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=378642.0, ans=0.07 2024-06-21 11:43:21,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=378660.3333333333, ans=0.2 2024-06-21 11:43:22,130 INFO [train.py:1028] (1/2) Epoch 21, batch 4200, loss[loss=0.1906, simple_loss=0.2398, pruned_loss=0.07065, over 13102.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2369, pruned_loss=0.06735, over 2580911.06 frames. ], batch size: 103, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:43:22,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378660.3333333333, ans=0.1 2024-06-21 11:43:35,566 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.657e+02 1.929e+02 2.064e+02 2.185e+02 2.833e+02, threshold=4.129e+02, percent-clipped=0.0 2024-06-21 11:43:47,183 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2024-06-21 11:44:01,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=378715.3333333333, ans=0.025 2024-06-21 11:44:13,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=378733.6666666667, ans=0.5 2024-06-21 11:44:15,209 INFO [train.py:1028] (1/2) Epoch 21, batch 4250, loss[loss=0.1748, simple_loss=0.2304, pruned_loss=0.0596, over 13311.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2361, pruned_loss=0.06689, over 2582225.55 frames. ], batch size: 46, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:44:15,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=378752.0, ans=0.05 2024-06-21 11:44:17,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=378752.0, ans=0.2 2024-06-21 11:44:21,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=378752.0, ans=0.125 2024-06-21 11:44:31,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=378770.3333333333, ans=0.2 2024-06-21 11:44:36,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378788.6666666667, ans=0.1 2024-06-21 11:44:43,038 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.78 vs. limit=15.0 2024-06-21 11:44:55,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378825.3333333333, ans=0.1 2024-06-21 11:44:55,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=378825.3333333333, ans=0.125 2024-06-21 11:44:56,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378825.3333333333, ans=0.1 2024-06-21 11:44:56,834 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.96 vs. limit=15.0 2024-06-21 11:45:00,440 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=12.0 2024-06-21 11:45:00,643 INFO [train.py:1028] (1/2) Epoch 21, batch 4300, loss[loss=0.1735, simple_loss=0.2252, pruned_loss=0.06087, over 13145.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2353, pruned_loss=0.06657, over 2581498.24 frames. ], batch size: 59, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:45:08,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=378843.6666666667, ans=0.2 2024-06-21 11:45:21,275 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.993e+02 2.074e+02 2.276e+02 4.053e+02, threshold=4.148e+02, percent-clipped=0.0 2024-06-21 11:45:27,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=378862.0, ans=0.2 2024-06-21 11:45:46,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=378898.6666666667, ans=0.125 2024-06-21 11:46:00,472 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.71 vs. limit=15.0 2024-06-21 11:46:06,552 INFO [train.py:1028] (1/2) Epoch 21, batch 4350, loss[loss=0.1844, simple_loss=0.2381, pruned_loss=0.06528, over 13205.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2352, pruned_loss=0.06659, over 2586188.70 frames. ], batch size: 59, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:46:08,262 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=15.0 2024-06-21 11:46:39,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=378990.3333333333, ans=0.125 2024-06-21 11:46:50,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=379008.6666666667, ans=0.035 2024-06-21 11:46:57,162 INFO [train.py:1028] (1/2) Epoch 21, batch 4400, loss[loss=0.1868, simple_loss=0.2252, pruned_loss=0.0742, over 13198.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2355, pruned_loss=0.0668, over 2585432.28 frames. ], batch size: 83, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:47:03,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=379027.0, ans=0.125 2024-06-21 11:47:03,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=379027.0, ans=0.2 2024-06-21 11:47:06,220 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.16 vs. limit=22.5 2024-06-21 11:47:09,804 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 2.025e+02 2.119e+02 2.257e+02 3.901e+02, threshold=4.239e+02, percent-clipped=0.0 2024-06-21 11:47:10,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=379045.3333333333, ans=0.125 2024-06-21 11:47:17,310 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=379063.6666666667, ans=0.125 2024-06-21 11:47:42,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=379100.3333333333, ans=0.0 2024-06-21 11:47:47,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=379100.3333333333, ans=0.0 2024-06-21 11:47:51,670 INFO [train.py:1028] (1/2) Epoch 21, batch 4450, loss[loss=0.1487, simple_loss=0.204, pruned_loss=0.04676, over 12922.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2359, pruned_loss=0.0672, over 2580108.08 frames. ], batch size: 33, lr: 2.77e-03, grad_scale: 128.0 2024-06-21 11:48:01,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=379137.0, ans=0.0 2024-06-21 11:48:01,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=379137.0, ans=0.0 2024-06-21 11:48:02,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=379137.0, ans=0.125 2024-06-21 11:48:07,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=379137.0, ans=0.2 2024-06-21 11:48:11,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=379155.3333333333, ans=0.1 2024-06-21 11:48:11,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=379155.3333333333, ans=0.2 2024-06-21 11:48:16,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=379155.3333333333, ans=0.0 2024-06-21 11:48:18,033 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=379155.3333333333, ans=0.0 2024-06-21 11:48:54,355 INFO [train.py:1028] (1/2) Epoch 21, batch 4500, loss[loss=0.1953, simple_loss=0.2394, pruned_loss=0.07561, over 13266.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2352, pruned_loss=0.06672, over 2584944.02 frames. ], batch size: 89, lr: 2.77e-03, grad_scale: 128.0 2024-06-21 11:49:04,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=379228.6666666667, ans=0.125 2024-06-21 11:49:07,967 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.744e+02 1.953e+02 2.087e+02 2.290e+02 2.681e+02, threshold=4.175e+02, percent-clipped=0.0 2024-06-21 11:49:08,262 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 11:49:24,728 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2024-06-21 11:49:26,617 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.55 vs. limit=22.5 2024-06-21 11:49:28,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=379265.3333333333, ans=0.0 2024-06-21 11:49:37,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=379283.6666666667, ans=0.125 2024-06-21 11:49:38,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=379283.6666666667, ans=0.125 2024-06-21 11:49:45,359 INFO [train.py:1028] (1/2) Epoch 21, batch 4550, loss[loss=0.1658, simple_loss=0.2193, pruned_loss=0.05617, over 13284.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.235, pruned_loss=0.06658, over 2588287.67 frames. ], batch size: 52, lr: 2.77e-03, grad_scale: 128.0 2024-06-21 11:49:46,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=379302.0, ans=0.125 2024-06-21 11:50:10,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=379338.6666666667, ans=0.07 2024-06-21 11:50:17,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=379357.0, ans=0.125 2024-06-21 11:50:25,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=379375.3333333333, ans=0.125 2024-06-21 11:50:28,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=379375.3333333333, ans=10.0 2024-06-21 11:50:32,378 INFO [train.py:1028] (1/2) Epoch 21, batch 4600, loss[loss=0.1913, simple_loss=0.2422, pruned_loss=0.07016, over 12528.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.235, pruned_loss=0.06643, over 2583020.62 frames. ], batch size: 202, lr: 2.77e-03, grad_scale: 128.0 2024-06-21 11:50:43,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=379412.0, ans=0.0 2024-06-21 11:50:45,303 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 1.978e+02 2.104e+02 2.299e+02 3.681e+02, threshold=4.208e+02, percent-clipped=0.0 2024-06-21 11:50:47,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=379412.0, ans=0.125 2024-06-21 11:51:07,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=379448.6666666667, ans=0.1 2024-06-21 11:51:20,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=379467.0, ans=0.125 2024-06-21 11:51:22,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=379485.3333333333, ans=0.125 2024-06-21 11:51:23,515 INFO [train.py:1028] (1/2) Epoch 21, batch 4650, loss[loss=0.1781, simple_loss=0.2234, pruned_loss=0.06637, over 13172.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.235, pruned_loss=0.06646, over 2586295.02 frames. ], batch size: 132, lr: 2.77e-03, grad_scale: 128.0 2024-06-21 11:51:58,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=379522.0, ans=0.0 2024-06-21 11:52:03,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=379522.0, ans=0.125 2024-06-21 11:52:08,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=379540.3333333333, ans=0.0 2024-06-21 11:52:23,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=379558.6666666667, ans=0.125 2024-06-21 11:52:26,433 INFO [train.py:1028] (1/2) Epoch 21, batch 4700, loss[loss=0.194, simple_loss=0.2385, pruned_loss=0.07476, over 12301.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2349, pruned_loss=0.06658, over 2581665.63 frames. ], batch size: 25, lr: 2.76e-03, grad_scale: 128.0 2024-06-21 11:52:28,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=379577.0, ans=0.025 2024-06-21 11:52:36,794 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.95 vs. limit=15.0 2024-06-21 11:52:40,416 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.021e+02 2.119e+02 2.216e+02 3.354e+02, threshold=4.238e+02, percent-clipped=0.0 2024-06-21 11:52:43,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=379595.3333333333, ans=0.2 2024-06-21 11:52:45,489 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.92 vs. limit=15.0 2024-06-21 11:52:55,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=379613.6666666667, ans=0.0 2024-06-21 11:52:57,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=379632.0, ans=0.1 2024-06-21 11:53:03,552 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.43 vs. limit=15.0 2024-06-21 11:53:15,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=379650.3333333333, ans=0.05 2024-06-21 11:53:17,938 INFO [train.py:1028] (1/2) Epoch 21, batch 4750, loss[loss=0.1806, simple_loss=0.2307, pruned_loss=0.06527, over 12522.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2348, pruned_loss=0.06685, over 2578098.85 frames. ], batch size: 202, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:53:27,037 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.58 vs. limit=15.0 2024-06-21 11:53:32,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=379687.0, ans=0.2 2024-06-21 11:53:44,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=379705.3333333333, ans=0.125 2024-06-21 11:53:48,644 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.49 vs. limit=10.0 2024-06-21 11:53:56,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=379723.6666666667, ans=0.0 2024-06-21 11:54:06,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=379742.0, ans=0.0 2024-06-21 11:54:06,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=379760.3333333333, ans=0.2 2024-06-21 11:54:07,397 INFO [train.py:1028] (1/2) Epoch 21, batch 4800, loss[loss=0.1542, simple_loss=0.2032, pruned_loss=0.05259, over 13308.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2345, pruned_loss=0.06656, over 2574727.25 frames. ], batch size: 63, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:54:22,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=379778.6666666667, ans=0.125 2024-06-21 11:54:22,380 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.75 vs. limit=15.0 2024-06-21 11:54:22,601 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.001e+02 2.155e+02 2.349e+02 3.104e+02, threshold=4.310e+02, percent-clipped=0.0 2024-06-21 11:54:28,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=379797.0, ans=0.125 2024-06-21 11:54:50,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=379815.3333333333, ans=0.0 2024-06-21 11:54:56,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=379833.6666666667, ans=0.2 2024-06-21 11:54:58,001 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.49 vs. limit=15.0 2024-06-21 11:55:04,017 INFO [train.py:1028] (1/2) Epoch 21, batch 4850, loss[loss=0.1669, simple_loss=0.2148, pruned_loss=0.05955, over 13244.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2346, pruned_loss=0.06666, over 2573448.39 frames. ], batch size: 89, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:55:04,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=379852.0, ans=0.05 2024-06-21 11:55:07,760 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.27 vs. limit=22.5 2024-06-21 11:55:32,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=379888.6666666667, ans=0.125 2024-06-21 11:55:45,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=379907.0, ans=0.015 2024-06-21 11:55:49,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=379925.3333333333, ans=0.0 2024-06-21 11:55:54,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=379925.3333333333, ans=0.0 2024-06-21 11:55:59,165 INFO [train.py:1028] (1/2) Epoch 21, batch 4900, loss[loss=0.1695, simple_loss=0.2281, pruned_loss=0.05543, over 13200.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2347, pruned_loss=0.06671, over 2574322.61 frames. ], batch size: 59, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:56:07,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=379962.0, ans=0.09899494936611666 2024-06-21 11:56:08,844 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 1.960e+02 2.125e+02 2.332e+02 3.281e+02, threshold=4.250e+02, percent-clipped=0.0 2024-06-21 11:56:27,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=380017.0, ans=0.0 2024-06-21 11:56:33,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=380017.0, ans=0.0 2024-06-21 11:56:34,659 INFO [train.py:1028] (1/2) Epoch 21, batch 4950, loss[loss=0.2105, simple_loss=0.2471, pruned_loss=0.08694, over 10983.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2345, pruned_loss=0.06688, over 2569479.18 frames. ], batch size: 303, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:56:48,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=380053.6666666667, ans=0.0 2024-06-21 11:56:53,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=380072.0, ans=0.0 2024-06-21 11:57:00,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=380072.0, ans=0.025 2024-06-21 11:57:31,190 INFO [train.py:1028] (1/2) Epoch 21, batch 5000, loss[loss=0.1737, simple_loss=0.2164, pruned_loss=0.06543, over 13180.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2344, pruned_loss=0.06658, over 2574053.48 frames. ], batch size: 95, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:57:48,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=380145.3333333333, ans=0.0 2024-06-21 11:57:48,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=380145.3333333333, ans=0.2 2024-06-21 11:57:50,090 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.68 vs. limit=6.0 2024-06-21 11:57:50,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=380145.3333333333, ans=0.09899494936611666 2024-06-21 11:57:53,918 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.39 vs. limit=15.0 2024-06-21 11:57:54,187 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.008e+02 2.122e+02 2.300e+02 4.136e+02, threshold=4.244e+02, percent-clipped=0.0 2024-06-21 11:57:55,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=380145.3333333333, ans=0.125 2024-06-21 11:57:58,886 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2024-06-21 11:58:00,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=380163.6666666667, ans=0.0 2024-06-21 11:58:03,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=380163.6666666667, ans=0.1 2024-06-21 11:58:20,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=380182.0, ans=0.2 2024-06-21 11:58:32,040 INFO [train.py:1028] (1/2) Epoch 21, batch 5050, loss[loss=0.1917, simple_loss=0.2483, pruned_loss=0.06758, over 12952.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2344, pruned_loss=0.06624, over 2574782.38 frames. ], batch size: 36, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:58:33,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=380218.6666666667, ans=0.2 2024-06-21 11:58:42,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380237.0, ans=0.1 2024-06-21 11:58:44,416 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.93 vs. limit=6.0 2024-06-21 11:58:44,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=380237.0, ans=0.0 2024-06-21 11:58:48,447 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.05 vs. limit=15.0 2024-06-21 11:58:58,041 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2024-06-21 11:59:03,821 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.54 vs. limit=15.0 2024-06-21 11:59:05,392 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 11:59:19,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=380292.0, ans=0.125 2024-06-21 11:59:24,685 INFO [train.py:1028] (1/2) Epoch 21, batch 5100, loss[loss=0.1835, simple_loss=0.2479, pruned_loss=0.05962, over 12874.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.234, pruned_loss=0.06636, over 2571020.54 frames. ], batch size: 39, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:59:26,050 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.84 vs. limit=12.0 2024-06-21 11:59:39,467 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 1.969e+02 2.185e+02 2.408e+02 3.009e+02, threshold=4.369e+02, percent-clipped=0.0 2024-06-21 11:59:41,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=380328.6666666667, ans=0.125 2024-06-21 11:59:48,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=380347.0, ans=0.125 2024-06-21 12:00:01,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=380365.3333333333, ans=0.125 2024-06-21 12:00:11,161 INFO [train.py:1028] (1/2) Epoch 21, batch 5150, loss[loss=0.1689, simple_loss=0.2149, pruned_loss=0.06148, over 13084.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.2339, pruned_loss=0.06653, over 2572879.88 frames. ], batch size: 132, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:00:14,240 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.40 vs. limit=22.5 2024-06-21 12:00:14,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=380402.0, ans=0.2 2024-06-21 12:00:30,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=380420.3333333333, ans=0.0 2024-06-21 12:00:31,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=380420.3333333333, ans=0.1 2024-06-21 12:00:35,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=380420.3333333333, ans=0.1 2024-06-21 12:00:39,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=380438.6666666667, ans=0.125 2024-06-21 12:00:43,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380438.6666666667, ans=0.1 2024-06-21 12:00:58,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=380475.3333333333, ans=12.0 2024-06-21 12:01:02,028 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.38 vs. limit=15.0 2024-06-21 12:01:07,163 INFO [train.py:1028] (1/2) Epoch 21, batch 5200, loss[loss=0.1686, simple_loss=0.2186, pruned_loss=0.05933, over 13210.00 frames. ], tot_loss[loss=0.1836, simple_loss=0.2343, pruned_loss=0.06645, over 2575690.03 frames. ], batch size: 95, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:01:08,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=380493.6666666667, ans=0.1 2024-06-21 12:01:12,806 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.01 vs. limit=22.5 2024-06-21 12:01:15,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380493.6666666667, ans=0.1 2024-06-21 12:01:21,407 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 1.959e+02 2.069e+02 2.215e+02 3.282e+02, threshold=4.139e+02, percent-clipped=0.0 2024-06-21 12:01:50,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380567.0, ans=0.1 2024-06-21 12:01:59,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=380585.3333333333, ans=0.0 2024-06-21 12:02:00,284 INFO [train.py:1028] (1/2) Epoch 21, batch 5250, loss[loss=0.1811, simple_loss=0.2267, pruned_loss=0.0678, over 13284.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.234, pruned_loss=0.06648, over 2572503.06 frames. ], batch size: 52, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:02:04,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=380585.3333333333, ans=0.025 2024-06-21 12:02:17,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=380603.6666666667, ans=0.125 2024-06-21 12:02:25,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=380622.0, ans=0.1 2024-06-21 12:02:41,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=380658.6666666667, ans=0.125 2024-06-21 12:02:46,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=380658.6666666667, ans=0.2 2024-06-21 12:02:47,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=380658.6666666667, ans=0.0 2024-06-21 12:02:51,739 INFO [train.py:1028] (1/2) Epoch 21, batch 5300, loss[loss=0.1728, simple_loss=0.2186, pruned_loss=0.06352, over 13062.00 frames. ], tot_loss[loss=0.1825, simple_loss=0.2331, pruned_loss=0.06597, over 2569281.62 frames. ], batch size: 144, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:03:05,830 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 1.999e+02 2.124e+02 2.266e+02 3.967e+02, threshold=4.247e+02, percent-clipped=0.0 2024-06-21 12:03:16,129 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.11 vs. limit=15.0 2024-06-21 12:03:16,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=380713.6666666667, ans=0.125 2024-06-21 12:03:17,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=380713.6666666667, ans=0.1 2024-06-21 12:03:28,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=380713.6666666667, ans=0.0 2024-06-21 12:03:42,556 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=22.5 2024-06-21 12:03:55,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=380750.3333333333, ans=0.125 2024-06-21 12:03:59,427 INFO [train.py:1028] (1/2) Epoch 21, batch 5350, loss[loss=0.1937, simple_loss=0.249, pruned_loss=0.06924, over 12118.00 frames. ], tot_loss[loss=0.1828, simple_loss=0.2335, pruned_loss=0.06608, over 2575805.20 frames. ], batch size: 17, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:04:44,916 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.34 vs. limit=15.0 2024-06-21 12:04:47,463 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 12:04:50,368 INFO [train.py:1028] (1/2) Epoch 21, batch 5400, loss[loss=0.1976, simple_loss=0.2405, pruned_loss=0.07731, over 12218.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.2337, pruned_loss=0.06662, over 2568589.50 frames. ], batch size: 240, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:05:05,208 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.022e+02 2.138e+02 2.285e+02 3.160e+02, threshold=4.276e+02, percent-clipped=0.0 2024-06-21 12:05:08,122 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.91 vs. limit=6.0 2024-06-21 12:05:12,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.50 vs. limit=12.0 2024-06-21 12:05:17,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=380897.0, ans=0.125 2024-06-21 12:05:24,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=380915.3333333333, ans=0.125 2024-06-21 12:05:32,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=380933.6666666667, ans=0.0 2024-06-21 12:05:41,668 INFO [train.py:1028] (1/2) Epoch 21, batch 5450, loss[loss=0.184, simple_loss=0.2308, pruned_loss=0.06863, over 12756.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2338, pruned_loss=0.06647, over 2572763.06 frames. ], batch size: 26, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:05:48,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=380952.0, ans=0.1 2024-06-21 12:06:01,479 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.33 vs. limit=22.5 2024-06-21 12:06:10,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=381007.0, ans=0.5 2024-06-21 12:06:13,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=381007.0, ans=0.125 2024-06-21 12:06:14,827 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.99 vs. limit=15.0 2024-06-21 12:06:15,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=381007.0, ans=0.125 2024-06-21 12:06:27,477 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.55 vs. limit=15.0 2024-06-21 12:06:35,777 INFO [train.py:1028] (1/2) Epoch 21, batch 5500, loss[loss=0.2175, simple_loss=0.2534, pruned_loss=0.09078, over 12240.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2336, pruned_loss=0.06633, over 2564663.39 frames. ], batch size: 241, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:06:49,361 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.731e+02 1.947e+02 2.057e+02 2.209e+02 2.931e+02, threshold=4.113e+02, percent-clipped=0.0 2024-06-21 12:06:59,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=381080.3333333333, ans=0.2 2024-06-21 12:07:00,495 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.56 vs. limit=22.5 2024-06-21 12:07:05,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=381080.3333333333, ans=0.125 2024-06-21 12:07:10,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=381098.6666666667, ans=0.125 2024-06-21 12:07:12,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=381098.6666666667, ans=0.125 2024-06-21 12:07:13,929 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.90 vs. limit=22.5 2024-06-21 12:07:26,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=381117.0, ans=0.0 2024-06-21 12:07:27,947 INFO [train.py:1028] (1/2) Epoch 21, batch 5550, loss[loss=0.1666, simple_loss=0.2258, pruned_loss=0.05372, over 13289.00 frames. ], tot_loss[loss=0.182, simple_loss=0.2327, pruned_loss=0.06569, over 2568482.61 frames. ], batch size: 43, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:07:31,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=381135.3333333333, ans=0.125 2024-06-21 12:07:45,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=381153.6666666667, ans=0.09899494936611666 2024-06-21 12:07:50,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=381172.0, ans=0.1 2024-06-21 12:08:03,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=381190.3333333333, ans=0.125 2024-06-21 12:08:06,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=381208.6666666667, ans=0.125 2024-06-21 12:08:14,087 INFO [train.py:1028] (1/2) Epoch 21, batch 5600, loss[loss=0.1681, simple_loss=0.2241, pruned_loss=0.05607, over 13260.00 frames. ], tot_loss[loss=0.1815, simple_loss=0.2323, pruned_loss=0.06535, over 2570041.43 frames. ], batch size: 89, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:08:27,059 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 1.967e+02 2.061e+02 2.231e+02 2.900e+02, threshold=4.122e+02, percent-clipped=0.0 2024-06-21 12:08:29,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=381245.3333333333, ans=0.0 2024-06-21 12:08:38,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=381263.6666666667, ans=0.125 2024-06-21 12:08:40,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=381263.6666666667, ans=0.0 2024-06-21 12:08:44,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=381282.0, ans=0.125 2024-06-21 12:08:45,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=381282.0, ans=0.125 2024-06-21 12:08:51,802 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 12:08:56,529 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2024-06-21 12:09:04,951 INFO [train.py:1028] (1/2) Epoch 21, batch 5650, loss[loss=0.2004, simple_loss=0.2475, pruned_loss=0.07668, over 12589.00 frames. ], tot_loss[loss=0.1822, simple_loss=0.233, pruned_loss=0.06572, over 2574749.20 frames. ], batch size: 202, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:09:23,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=381337.0, ans=0.0 2024-06-21 12:10:18,625 INFO [train.py:1028] (1/2) Epoch 21, batch 5700, loss[loss=0.174, simple_loss=0.2245, pruned_loss=0.06177, over 13257.00 frames. ], tot_loss[loss=0.1818, simple_loss=0.2325, pruned_loss=0.06558, over 2579314.57 frames. ], batch size: 63, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:10:33,069 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 1.980e+02 2.122e+02 2.311e+02 3.396e+02, threshold=4.244e+02, percent-clipped=0.0 2024-06-21 12:10:41,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=381447.0, ans=0.125 2024-06-21 12:10:47,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=381447.0, ans=0.125 2024-06-21 12:10:50,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=381465.3333333333, ans=0.0 2024-06-21 12:11:02,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=381483.6666666667, ans=0.0 2024-06-21 12:11:06,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=381483.6666666667, ans=0.125 2024-06-21 12:11:11,722 INFO [train.py:1028] (1/2) Epoch 21, batch 5750, loss[loss=0.2047, simple_loss=0.2567, pruned_loss=0.07639, over 12740.00 frames. ], tot_loss[loss=0.1832, simple_loss=0.234, pruned_loss=0.06622, over 2579741.78 frames. ], batch size: 176, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:11:38,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=381538.6666666667, ans=0.0 2024-06-21 12:11:56,416 INFO [train.py:1028] (1/2) Epoch 21, batch 5800, loss[loss=0.1798, simple_loss=0.232, pruned_loss=0.0638, over 12709.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2353, pruned_loss=0.06689, over 2578890.32 frames. ], batch size: 176, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:11:57,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=381593.6666666667, ans=0.125 2024-06-21 12:12:02,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=381593.6666666667, ans=0.125 2024-06-21 12:12:02,351 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.45 vs. limit=22.5 2024-06-21 12:12:07,139 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.004e+02 2.134e+02 2.278e+02 2.953e+02, threshold=4.268e+02, percent-clipped=0.0 2024-06-21 12:12:08,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=381612.0, ans=0.2 2024-06-21 12:12:19,769 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.01 vs. limit=15.0 2024-06-21 12:13:01,809 INFO [train.py:1028] (1/2) Epoch 21, batch 5850, loss[loss=0.1902, simple_loss=0.239, pruned_loss=0.07065, over 12609.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2367, pruned_loss=0.06725, over 2576942.50 frames. ], batch size: 202, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:13:12,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=381703.6666666667, ans=0.125 2024-06-21 12:13:29,922 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 12:13:38,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=381740.3333333333, ans=0.0 2024-06-21 12:13:47,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=381758.6666666667, ans=0.05 2024-06-21 12:13:51,649 INFO [train.py:1028] (1/2) Epoch 21, batch 5900, loss[loss=0.1965, simple_loss=0.2418, pruned_loss=0.07557, over 13083.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2385, pruned_loss=0.06779, over 2576204.22 frames. ], batch size: 121, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:14:00,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=381795.3333333333, ans=0.0 2024-06-21 12:14:03,422 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=15.0 2024-06-21 12:14:03,597 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.032e+02 2.211e+02 2.378e+02 3.858e+02, threshold=4.421e+02, percent-clipped=0.0 2024-06-21 12:14:08,193 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.14 vs. limit=15.0 2024-06-21 12:14:11,237 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=381813.6666666667, ans=0.0 2024-06-21 12:14:12,115 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 12:14:27,213 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.25 vs. limit=15.0 2024-06-21 12:14:34,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=381850.3333333333, ans=0.1 2024-06-21 12:14:41,248 INFO [train.py:1028] (1/2) Epoch 21, batch 5950, loss[loss=0.1719, simple_loss=0.2211, pruned_loss=0.06134, over 13116.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2398, pruned_loss=0.06836, over 2581700.20 frames. ], batch size: 121, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:14:59,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=381905.3333333333, ans=0.125 2024-06-21 12:15:00,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=381905.3333333333, ans=0.95 2024-06-21 12:15:02,603 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.85 vs. limit=15.0 2024-06-21 12:15:04,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=381905.3333333333, ans=15.0 2024-06-21 12:15:14,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=381923.6666666667, ans=0.1 2024-06-21 12:15:15,830 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.35 vs. limit=10.0 2024-06-21 12:15:31,311 INFO [train.py:1028] (1/2) Epoch 21, batch 6000, loss[loss=0.2318, simple_loss=0.2731, pruned_loss=0.09526, over 12215.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2413, pruned_loss=0.06927, over 2574384.38 frames. ], batch size: 241, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:15:31,313 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 12:15:50,221 INFO [train.py:1060] (1/2) Epoch 21, validation: loss=0.1876, simple_loss=0.2514, pruned_loss=0.06191, over 351949.00 frames. 2024-06-21 12:15:50,222 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 12:15:51,836 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.88 vs. limit=15.0 2024-06-21 12:15:55,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=381960.3333333333, ans=0.125 2024-06-21 12:16:02,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=381978.6666666667, ans=15.0 2024-06-21 12:16:05,711 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.107e+02 2.296e+02 2.494e+02 3.406e+02, threshold=4.593e+02, percent-clipped=0.0 2024-06-21 12:16:27,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=382015.3333333333, ans=0.125 2024-06-21 12:16:35,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=382033.6666666667, ans=0.125 2024-06-21 12:16:38,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=382033.6666666667, ans=0.125 2024-06-21 12:16:42,755 INFO [train.py:1028] (1/2) Epoch 21, batch 6050, loss[loss=0.1862, simple_loss=0.2415, pruned_loss=0.0655, over 13368.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2427, pruned_loss=0.0696, over 2578579.75 frames. ], batch size: 40, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:16:43,994 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.84 vs. limit=15.0 2024-06-21 12:16:47,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=382052.0, ans=0.025 2024-06-21 12:17:17,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=382107.0, ans=0.125 2024-06-21 12:17:30,735 INFO [train.py:1028] (1/2) Epoch 21, batch 6100, loss[loss=0.1917, simple_loss=0.2401, pruned_loss=0.07166, over 13098.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2439, pruned_loss=0.06997, over 2580114.24 frames. ], batch size: 121, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:17:45,188 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.085e+02 2.221e+02 2.429e+02 3.595e+02, threshold=4.443e+02, percent-clipped=0.0 2024-06-21 12:18:21,959 INFO [train.py:1028] (1/2) Epoch 21, batch 6150, loss[loss=0.2094, simple_loss=0.2537, pruned_loss=0.08253, over 10858.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2452, pruned_loss=0.0706, over 2577315.40 frames. ], batch size: 303, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:18:51,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=382272.0, ans=0.125 2024-06-21 12:18:56,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=382290.3333333333, ans=0.125 2024-06-21 12:18:58,590 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.04 vs. limit=22.5 2024-06-21 12:19:01,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=382308.6666666667, ans=0.125 2024-06-21 12:19:07,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=382308.6666666667, ans=0.0 2024-06-21 12:19:09,098 INFO [train.py:1028] (1/2) Epoch 21, batch 6200, loss[loss=0.2226, simple_loss=0.2785, pruned_loss=0.08336, over 13259.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2465, pruned_loss=0.07098, over 2575861.50 frames. ], batch size: 89, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:19:14,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=382327.0, ans=0.0 2024-06-21 12:19:19,703 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.77 vs. limit=15.0 2024-06-21 12:19:21,977 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.153e+02 2.390e+02 2.627e+02 3.668e+02, threshold=4.780e+02, percent-clipped=0.0 2024-06-21 12:20:00,994 INFO [train.py:1028] (1/2) Epoch 21, batch 6250, loss[loss=0.1941, simple_loss=0.2434, pruned_loss=0.0724, over 13236.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2471, pruned_loss=0.07124, over 2568650.17 frames. ], batch size: 83, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:20:09,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=382418.6666666667, ans=0.2 2024-06-21 12:20:23,220 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=15.0 2024-06-21 12:20:26,060 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.06 vs. limit=15.0 2024-06-21 12:20:27,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=382455.3333333333, ans=0.1 2024-06-21 12:20:39,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=382473.6666666667, ans=0.125 2024-06-21 12:20:47,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=382492.0, ans=0.0 2024-06-21 12:20:52,363 INFO [train.py:1028] (1/2) Epoch 21, batch 6300, loss[loss=0.199, simple_loss=0.2542, pruned_loss=0.07194, over 11502.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2488, pruned_loss=0.0719, over 2563529.75 frames. ], batch size: 16, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:21:06,484 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.108e+02 2.236e+02 2.563e+02 3.165e+02, threshold=4.471e+02, percent-clipped=0.0 2024-06-21 12:21:11,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=382547.0, ans=0.125 2024-06-21 12:21:12,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=382547.0, ans=0.0 2024-06-21 12:21:17,305 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.03 vs. limit=22.5 2024-06-21 12:21:41,421 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.72 vs. limit=10.0 2024-06-21 12:22:00,104 INFO [train.py:1028] (1/2) Epoch 21, batch 6350, loss[loss=0.22, simple_loss=0.2713, pruned_loss=0.08433, over 12562.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2504, pruned_loss=0.07225, over 2572242.65 frames. ], batch size: 202, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:22:18,294 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 12:22:21,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=382638.6666666667, ans=0.125 2024-06-21 12:22:25,565 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.05 vs. limit=15.0 2024-06-21 12:22:27,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=382638.6666666667, ans=0.1 2024-06-21 12:22:27,475 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.89 vs. limit=10.0 2024-06-21 12:22:38,707 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=12.0 2024-06-21 12:22:48,780 INFO [train.py:1028] (1/2) Epoch 21, batch 6400, loss[loss=0.1862, simple_loss=0.2482, pruned_loss=0.06214, over 13164.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2529, pruned_loss=0.07325, over 2573674.66 frames. ], batch size: 67, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:23:03,414 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.216e+02 2.378e+02 2.690e+02 3.374e+02, threshold=4.755e+02, percent-clipped=0.0 2024-06-21 12:23:17,061 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=382730.3333333333, ans=0.125 2024-06-21 12:23:19,612 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.73 vs. limit=12.0 2024-06-21 12:23:31,417 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.73 vs. limit=15.0 2024-06-21 12:23:31,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=382767.0, ans=0.0 2024-06-21 12:23:36,216 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.44 vs. limit=15.0 2024-06-21 12:23:39,509 INFO [train.py:1028] (1/2) Epoch 21, batch 6450, loss[loss=0.2367, simple_loss=0.2832, pruned_loss=0.09513, over 12507.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2538, pruned_loss=0.0735, over 2579920.93 frames. ], batch size: 202, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:23:52,951 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.34 vs. limit=15.0 2024-06-21 12:24:09,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=382840.3333333333, ans=0.95 2024-06-21 12:24:16,910 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 12:24:30,927 INFO [train.py:1028] (1/2) Epoch 21, batch 6500, loss[loss=0.2417, simple_loss=0.282, pruned_loss=0.1007, over 10750.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2561, pruned_loss=0.07388, over 2582819.48 frames. ], batch size: 303, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:24:42,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=382895.3333333333, ans=0.125 2024-06-21 12:24:44,405 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.148e+02 2.299e+02 2.526e+02 3.259e+02, threshold=4.598e+02, percent-clipped=0.0 2024-06-21 12:24:52,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=382913.6666666667, ans=0.125 2024-06-21 12:24:53,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=382913.6666666667, ans=0.0 2024-06-21 12:24:57,449 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 12:24:59,149 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.22 vs. limit=15.0 2024-06-21 12:25:15,580 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.71 vs. limit=10.0 2024-06-21 12:25:31,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=382950.3333333333, ans=0.05 2024-06-21 12:25:37,008 INFO [train.py:1028] (1/2) Epoch 21, batch 6550, loss[loss=0.1862, simple_loss=0.2481, pruned_loss=0.06217, over 12365.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2567, pruned_loss=0.07389, over 2587440.25 frames. ], batch size: 22, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:25:39,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=382968.6666666667, ans=0.2 2024-06-21 12:25:48,065 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.37 vs. limit=10.0 2024-06-21 12:25:53,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=382987.0, ans=0.0 2024-06-21 12:25:55,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=382987.0, ans=0.0 2024-06-21 12:25:56,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=383005.3333333333, ans=0.125 2024-06-21 12:26:15,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=383042.0, ans=0.025 2024-06-21 12:26:15,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=383042.0, ans=0.125 2024-06-21 12:26:27,445 INFO [train.py:1028] (1/2) Epoch 21, batch 6600, loss[loss=0.2014, simple_loss=0.2636, pruned_loss=0.06961, over 13065.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2567, pruned_loss=0.07377, over 2590664.11 frames. ], batch size: 71, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:26:39,443 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.101e+02 2.248e+02 2.521e+02 3.532e+02, threshold=4.497e+02, percent-clipped=0.0 2024-06-21 12:26:43,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=383078.6666666667, ans=0.0 2024-06-21 12:26:47,554 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.27 vs. limit=15.0 2024-06-21 12:26:48,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=383097.0, ans=0.2 2024-06-21 12:26:49,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=383097.0, ans=0.0 2024-06-21 12:27:06,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=383133.6666666667, ans=0.2 2024-06-21 12:27:09,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=383133.6666666667, ans=0.125 2024-06-21 12:27:11,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=383133.6666666667, ans=0.125 2024-06-21 12:27:12,939 INFO [train.py:1028] (1/2) Epoch 21, batch 6650, loss[loss=0.2285, simple_loss=0.2743, pruned_loss=0.09136, over 12961.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2581, pruned_loss=0.07432, over 2584687.82 frames. ], batch size: 158, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:27:20,981 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.32 vs. limit=15.0 2024-06-21 12:27:25,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=383170.3333333333, ans=0.1 2024-06-21 12:27:33,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=383188.6666666667, ans=0.125 2024-06-21 12:28:00,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=383225.3333333333, ans=0.0 2024-06-21 12:28:01,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=383225.3333333333, ans=0.0 2024-06-21 12:28:03,525 INFO [train.py:1028] (1/2) Epoch 21, batch 6700, loss[loss=0.2181, simple_loss=0.265, pruned_loss=0.08558, over 12786.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2589, pruned_loss=0.07472, over 2583548.44 frames. ], batch size: 176, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:28:33,613 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.167e+02 2.292e+02 2.521e+02 3.060e+02, threshold=4.585e+02, percent-clipped=0.0 2024-06-21 12:28:39,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=383280.3333333333, ans=0.0 2024-06-21 12:28:43,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=383280.3333333333, ans=0.1 2024-06-21 12:29:07,372 INFO [train.py:1028] (1/2) Epoch 21, batch 6750, loss[loss=0.2694, simple_loss=0.3086, pruned_loss=0.1151, over 12207.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2601, pruned_loss=0.07544, over 2576719.31 frames. ], batch size: 241, lr: 2.75e-03, grad_scale: 128.0 2024-06-21 12:29:10,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=383335.3333333333, ans=0.2 2024-06-21 12:29:10,797 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.15 vs. limit=15.0 2024-06-21 12:29:20,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=383353.6666666667, ans=0.125 2024-06-21 12:29:20,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=383353.6666666667, ans=0.125 2024-06-21 12:29:42,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=383390.3333333333, ans=0.125 2024-06-21 12:29:42,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=383390.3333333333, ans=0.125 2024-06-21 12:29:55,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=383408.6666666667, ans=0.0 2024-06-21 12:29:59,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=383427.0, ans=0.2 2024-06-21 12:29:59,843 INFO [train.py:1028] (1/2) Epoch 21, batch 6800, loss[loss=0.1862, simple_loss=0.2474, pruned_loss=0.06254, over 13236.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2613, pruned_loss=0.07566, over 2578305.29 frames. ], batch size: 67, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:30:10,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=383445.3333333333, ans=0.1 2024-06-21 12:30:14,494 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.177e+02 2.338e+02 2.571e+02 3.506e+02, threshold=4.676e+02, percent-clipped=0.0 2024-06-21 12:30:45,461 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.74 vs. limit=15.0 2024-06-21 12:30:49,528 INFO [train.py:1028] (1/2) Epoch 21, batch 6850, loss[loss=0.2431, simple_loss=0.3077, pruned_loss=0.08928, over 13233.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2626, pruned_loss=0.07588, over 2583065.64 frames. ], batch size: 63, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:31:01,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=383537.0, ans=0.125 2024-06-21 12:31:04,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=383537.0, ans=0.125 2024-06-21 12:31:06,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=383537.0, ans=0.015 2024-06-21 12:31:07,544 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.78 vs. limit=22.5 2024-06-21 12:31:12,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=383555.3333333333, ans=10.0 2024-06-21 12:31:17,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=383555.3333333333, ans=0.125 2024-06-21 12:31:31,408 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.94 vs. limit=15.0 2024-06-21 12:31:40,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=383592.0, ans=0.0 2024-06-21 12:31:51,809 INFO [train.py:1028] (1/2) Epoch 21, batch 6900, loss[loss=0.1979, simple_loss=0.2571, pruned_loss=0.06933, over 13111.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2635, pruned_loss=0.07617, over 2585060.89 frames. ], batch size: 48, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:31:52,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=383610.3333333333, ans=0.125 2024-06-21 12:31:53,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=383610.3333333333, ans=0.125 2024-06-21 12:32:03,971 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.84 vs. limit=15.0 2024-06-21 12:32:06,028 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.80 vs. limit=15.0 2024-06-21 12:32:07,544 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.153e+02 2.349e+02 2.609e+02 3.576e+02, threshold=4.698e+02, percent-clipped=0.0 2024-06-21 12:32:16,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=383647.0, ans=0.0 2024-06-21 12:32:17,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=383647.0, ans=0.2 2024-06-21 12:32:30,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=383683.6666666667, ans=0.125 2024-06-21 12:32:35,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=383683.6666666667, ans=0.07 2024-06-21 12:32:38,850 INFO [train.py:1028] (1/2) Epoch 21, batch 6950, loss[loss=0.2164, simple_loss=0.268, pruned_loss=0.08237, over 12032.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2631, pruned_loss=0.07598, over 2579797.96 frames. ], batch size: 17, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:32:48,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=383720.3333333333, ans=0.125 2024-06-21 12:33:00,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=383738.6666666667, ans=0.0 2024-06-21 12:33:12,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=383757.0, ans=0.125 2024-06-21 12:33:29,379 INFO [train.py:1028] (1/2) Epoch 21, batch 7000, loss[loss=0.2243, simple_loss=0.2732, pruned_loss=0.08771, over 12936.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2634, pruned_loss=0.07601, over 2576173.06 frames. ], batch size: 158, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:33:31,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=383793.6666666667, ans=0.125 2024-06-21 12:33:32,114 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.70 vs. limit=15.0 2024-06-21 12:33:42,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=383812.0, ans=0.125 2024-06-21 12:33:45,606 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.184e+02 2.336e+02 2.566e+02 3.383e+02, threshold=4.673e+02, percent-clipped=0.0 2024-06-21 12:34:17,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=383867.0, ans=0.1 2024-06-21 12:34:29,919 INFO [train.py:1028] (1/2) Epoch 21, batch 7050, loss[loss=0.2361, simple_loss=0.2875, pruned_loss=0.09235, over 12732.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2645, pruned_loss=0.07618, over 2583223.80 frames. ], batch size: 176, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:34:49,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=383903.6666666667, ans=0.0 2024-06-21 12:34:51,482 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 12:34:55,664 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2024-06-21 12:35:02,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.88 vs. limit=12.0 2024-06-21 12:35:24,046 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2024-06-21 12:35:24,426 INFO [train.py:1028] (1/2) Epoch 21, batch 7100, loss[loss=0.2216, simple_loss=0.2828, pruned_loss=0.08023, over 13195.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2651, pruned_loss=0.07663, over 2575142.18 frames. ], batch size: 112, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:35:25,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=383977.0, ans=0.09899494936611666 2024-06-21 12:35:39,984 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.185e+02 2.359e+02 2.605e+02 3.350e+02, threshold=4.718e+02, percent-clipped=0.0 2024-06-21 12:35:41,387 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.32 vs. limit=6.0 2024-06-21 12:35:43,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=383995.3333333333, ans=0.2 2024-06-21 12:35:43,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=384013.6666666667, ans=0.1 2024-06-21 12:36:16,001 INFO [train.py:1028] (1/2) Epoch 21, batch 7150, loss[loss=0.2438, simple_loss=0.2966, pruned_loss=0.09545, over 12504.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2656, pruned_loss=0.07684, over 2572512.82 frames. ], batch size: 202, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:36:44,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=384123.6666666667, ans=0.125 2024-06-21 12:36:45,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=384123.6666666667, ans=0.0 2024-06-21 12:37:00,102 INFO [train.py:1028] (1/2) Epoch 21, batch 7200, loss[loss=0.2281, simple_loss=0.2821, pruned_loss=0.08702, over 13186.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.267, pruned_loss=0.0774, over 2577407.66 frames. ], batch size: 112, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:37:08,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=384160.3333333333, ans=22.5 2024-06-21 12:37:16,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=384178.6666666667, ans=0.125 2024-06-21 12:37:17,122 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.152e+02 2.281e+02 2.462e+02 3.178e+02, threshold=4.562e+02, percent-clipped=0.0 2024-06-21 12:37:18,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=384178.6666666667, ans=0.2 2024-06-21 12:37:22,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=384197.0, ans=0.05 2024-06-21 12:37:23,490 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.98 vs. limit=15.0 2024-06-21 12:37:31,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=384197.0, ans=0.125 2024-06-21 12:37:39,229 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.57 vs. limit=15.0 2024-06-21 12:37:57,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=384233.6666666667, ans=0.05 2024-06-21 12:38:02,405 INFO [train.py:1028] (1/2) Epoch 21, batch 7250, loss[loss=0.1907, simple_loss=0.2534, pruned_loss=0.06397, over 12929.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2676, pruned_loss=0.07723, over 2578650.10 frames. ], batch size: 36, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:38:03,059 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.92 vs. limit=15.0 2024-06-21 12:38:15,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=384252.0, ans=0.0 2024-06-21 12:38:25,064 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.31 vs. limit=22.5 2024-06-21 12:38:53,109 INFO [train.py:1028] (1/2) Epoch 21, batch 7300, loss[loss=0.199, simple_loss=0.2576, pruned_loss=0.07017, over 12923.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2691, pruned_loss=0.07803, over 2579090.53 frames. ], batch size: 36, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:39:01,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=384343.6666666667, ans=0.125 2024-06-21 12:39:02,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=384343.6666666667, ans=0.035 2024-06-21 12:39:09,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=384362.0, ans=0.125 2024-06-21 12:39:10,095 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.160e+02 2.331e+02 2.479e+02 3.218e+02, threshold=4.662e+02, percent-clipped=0.0 2024-06-21 12:39:20,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=384380.3333333333, ans=0.125 2024-06-21 12:39:47,981 INFO [train.py:1028] (1/2) Epoch 21, batch 7350, loss[loss=0.2352, simple_loss=0.2893, pruned_loss=0.09057, over 13289.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2697, pruned_loss=0.07806, over 2579859.88 frames. ], batch size: 46, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:39:57,465 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=3.928e-02 2024-06-21 12:40:00,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=384453.6666666667, ans=0.125 2024-06-21 12:40:03,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=384453.6666666667, ans=0.1 2024-06-21 12:40:14,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=384472.0, ans=0.1 2024-06-21 12:40:18,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=384490.3333333333, ans=0.125 2024-06-21 12:40:19,553 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.39 vs. limit=22.5 2024-06-21 12:40:21,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=384490.3333333333, ans=0.125 2024-06-21 12:40:25,395 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.79 vs. limit=15.0 2024-06-21 12:40:31,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=384508.6666666667, ans=0.07 2024-06-21 12:40:37,642 INFO [train.py:1028] (1/2) Epoch 21, batch 7400, loss[loss=0.237, simple_loss=0.3124, pruned_loss=0.08076, over 13269.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2698, pruned_loss=0.07805, over 2585920.58 frames. ], batch size: 63, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:40:42,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=384527.0, ans=0.1 2024-06-21 12:40:53,494 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.156e+02 2.322e+02 2.547e+02 3.464e+02, threshold=4.644e+02, percent-clipped=0.0 2024-06-21 12:41:21,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=384582.0, ans=0.125 2024-06-21 12:41:27,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=384582.0, ans=0.125 2024-06-21 12:41:28,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=384582.0, ans=0.0 2024-06-21 12:41:38,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=384600.3333333333, ans=0.125 2024-06-21 12:41:45,925 INFO [train.py:1028] (1/2) Epoch 21, batch 7450, loss[loss=0.1914, simple_loss=0.2439, pruned_loss=0.06946, over 12475.00 frames. ], tot_loss[loss=0.213, simple_loss=0.27, pruned_loss=0.07796, over 2581138.35 frames. ], batch size: 29, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:42:06,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=384655.3333333333, ans=0.0 2024-06-21 12:42:18,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=384673.6666666667, ans=0.2 2024-06-21 12:42:24,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=384673.6666666667, ans=0.125 2024-06-21 12:42:36,894 INFO [train.py:1028] (1/2) Epoch 21, batch 7500, loss[loss=0.2185, simple_loss=0.261, pruned_loss=0.08799, over 10412.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2707, pruned_loss=0.07837, over 2578118.88 frames. ], batch size: 303, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:42:41,893 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.66 vs. limit=22.5 2024-06-21 12:42:49,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=384728.6666666667, ans=0.0 2024-06-21 12:42:52,473 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.176e+02 2.354e+02 2.539e+02 3.656e+02, threshold=4.707e+02, percent-clipped=0.0 2024-06-21 12:42:55,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=384728.6666666667, ans=0.125 2024-06-21 12:43:07,577 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.65 vs. limit=10.0 2024-06-21 12:43:27,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=384802.0, ans=0.2 2024-06-21 12:43:28,124 INFO [train.py:1028] (1/2) Epoch 21, batch 7550, loss[loss=0.2068, simple_loss=0.2638, pruned_loss=0.07493, over 12910.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2711, pruned_loss=0.07856, over 2578692.18 frames. ], batch size: 158, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:43:31,883 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.15 vs. limit=15.0 2024-06-21 12:43:33,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=384802.0, ans=0.0 2024-06-21 12:43:40,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=384820.3333333333, ans=0.0 2024-06-21 12:43:42,959 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.09 vs. limit=22.5 2024-06-21 12:43:43,703 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.39 vs. limit=15.0 2024-06-21 12:43:45,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=384838.6666666667, ans=10.0 2024-06-21 12:43:51,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=384838.6666666667, ans=0.125 2024-06-21 12:44:06,027 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.39 vs. limit=15.0 2024-06-21 12:44:33,341 INFO [train.py:1028] (1/2) Epoch 21, batch 7600, loss[loss=0.2154, simple_loss=0.2742, pruned_loss=0.0783, over 13200.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2719, pruned_loss=0.0792, over 2576409.36 frames. ], batch size: 83, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:44:50,038 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.172e+02 2.401e+02 2.621e+02 3.887e+02, threshold=4.802e+02, percent-clipped=0.0 2024-06-21 12:45:16,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=384967.0, ans=0.1 2024-06-21 12:45:21,793 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.32 vs. limit=15.0 2024-06-21 12:45:28,114 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.30 vs. limit=15.0 2024-06-21 12:45:28,415 INFO [train.py:1028] (1/2) Epoch 21, batch 7650, loss[loss=0.2078, simple_loss=0.2705, pruned_loss=0.0725, over 12861.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2717, pruned_loss=0.0791, over 2571989.31 frames. ], batch size: 33, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:45:40,325 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.90 vs. limit=15.0 2024-06-21 12:45:40,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=385003.6666666667, ans=0.2 2024-06-21 12:45:42,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=385003.6666666667, ans=0.04949747468305833 2024-06-21 12:45:44,383 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.10 vs. limit=15.0 2024-06-21 12:45:50,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=385022.0, ans=0.125 2024-06-21 12:46:14,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=385058.6666666667, ans=0.125 2024-06-21 12:46:20,867 INFO [train.py:1028] (1/2) Epoch 21, batch 7700, loss[loss=0.2169, simple_loss=0.2773, pruned_loss=0.07821, over 13255.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2725, pruned_loss=0.07942, over 2568605.26 frames. ], batch size: 63, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:46:36,742 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.250e+02 2.473e+02 2.802e+02 4.511e+02, threshold=4.946e+02, percent-clipped=0.0 2024-06-21 12:46:43,152 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.88 vs. limit=22.5 2024-06-21 12:46:44,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=385113.6666666667, ans=15.0 2024-06-21 12:47:16,334 INFO [train.py:1028] (1/2) Epoch 21, batch 7750, loss[loss=0.2138, simple_loss=0.2748, pruned_loss=0.07638, over 13190.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2726, pruned_loss=0.07963, over 2572987.57 frames. ], batch size: 72, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:47:21,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=385168.6666666667, ans=0.2 2024-06-21 12:47:38,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=385187.0, ans=0.0 2024-06-21 12:47:44,874 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.47 vs. limit=12.0 2024-06-21 12:48:17,870 INFO [train.py:1028] (1/2) Epoch 21, batch 7800, loss[loss=0.2218, simple_loss=0.2663, pruned_loss=0.08872, over 13174.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2734, pruned_loss=0.0801, over 2578498.08 frames. ], batch size: 95, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:48:22,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=385260.3333333333, ans=0.125 2024-06-21 12:48:26,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=385260.3333333333, ans=0.0 2024-06-21 12:48:31,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=385278.6666666667, ans=0.125 2024-06-21 12:48:33,730 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.219e+02 2.381e+02 2.638e+02 3.851e+02, threshold=4.762e+02, percent-clipped=0.0 2024-06-21 12:48:47,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=385297.0, ans=0.0 2024-06-21 12:48:49,755 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.99 vs. limit=15.0 2024-06-21 12:48:55,363 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.06 vs. limit=15.0 2024-06-21 12:48:58,045 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.31 vs. limit=22.5 2024-06-21 12:49:08,363 INFO [train.py:1028] (1/2) Epoch 21, batch 7850, loss[loss=0.1829, simple_loss=0.2365, pruned_loss=0.06467, over 10918.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2742, pruned_loss=0.08063, over 2572068.22 frames. ], batch size: 16, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:49:14,655 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 12:49:23,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=385370.3333333333, ans=0.0 2024-06-21 12:49:26,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=385370.3333333333, ans=0.125 2024-06-21 12:49:29,891 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.80 vs. limit=15.0 2024-06-21 12:49:33,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=385388.6666666667, ans=0.125 2024-06-21 12:49:34,807 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2024-06-21 12:49:43,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=385407.0, ans=0.125 2024-06-21 12:49:49,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=385425.3333333333, ans=0.125 2024-06-21 12:49:58,059 INFO [train.py:1028] (1/2) Epoch 21, batch 7900, loss[loss=0.1815, simple_loss=0.2461, pruned_loss=0.05844, over 13131.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2745, pruned_loss=0.08081, over 2571918.67 frames. ], batch size: 77, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:50:16,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=385462.0, ans=0.0 2024-06-21 12:50:16,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=385462.0, ans=0.0 2024-06-21 12:50:17,943 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.857e+02 2.264e+02 2.434e+02 2.713e+02 4.080e+02, threshold=4.868e+02, percent-clipped=0.0 2024-06-21 12:50:24,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=385480.3333333333, ans=0.2 2024-06-21 12:50:39,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=385480.3333333333, ans=0.025 2024-06-21 12:50:50,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=385498.6666666667, ans=0.1 2024-06-21 12:51:01,157 INFO [train.py:1028] (1/2) Epoch 21, batch 7950, loss[loss=0.2246, simple_loss=0.2692, pruned_loss=0.08995, over 10425.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2745, pruned_loss=0.0806, over 2575654.19 frames. ], batch size: 303, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:51:04,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=385535.3333333333, ans=0.1 2024-06-21 12:51:11,079 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2024-06-21 12:51:14,235 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.20 vs. limit=15.0 2024-06-21 12:51:28,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=385572.0, ans=0.0 2024-06-21 12:51:46,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=385608.6666666667, ans=0.1 2024-06-21 12:51:48,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=385608.6666666667, ans=0.125 2024-06-21 12:51:51,248 INFO [train.py:1028] (1/2) Epoch 21, batch 8000, loss[loss=0.2, simple_loss=0.2584, pruned_loss=0.07087, over 12654.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2744, pruned_loss=0.08041, over 2571930.89 frames. ], batch size: 29, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:51:55,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=385627.0, ans=0.125 2024-06-21 12:52:02,746 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.214e+02 2.447e+02 2.667e+02 3.806e+02, threshold=4.894e+02, percent-clipped=0.0 2024-06-21 12:52:08,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=385663.6666666667, ans=0.0 2024-06-21 12:52:08,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=385663.6666666667, ans=0.1 2024-06-21 12:52:15,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=385663.6666666667, ans=0.0 2024-06-21 12:52:36,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=385700.3333333333, ans=0.0 2024-06-21 12:52:41,628 INFO [train.py:1028] (1/2) Epoch 21, batch 8050, loss[loss=0.2123, simple_loss=0.2715, pruned_loss=0.07655, over 13218.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2737, pruned_loss=0.08, over 2572208.04 frames. ], batch size: 83, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:52:53,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=385737.0, ans=0.1 2024-06-21 12:52:58,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=385737.0, ans=0.0 2024-06-21 12:53:17,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=385755.3333333333, ans=0.125 2024-06-21 12:53:20,221 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.90 vs. limit=6.0 2024-06-21 12:53:30,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=385792.0, ans=0.2 2024-06-21 12:53:40,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=385792.0, ans=0.0 2024-06-21 12:53:44,310 INFO [train.py:1028] (1/2) Epoch 21, batch 8100, loss[loss=0.2202, simple_loss=0.2742, pruned_loss=0.08308, over 13208.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2747, pruned_loss=0.08031, over 2576179.10 frames. ], batch size: 112, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:53:57,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=385828.6666666667, ans=0.015 2024-06-21 12:53:59,827 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.198e+02 2.340e+02 2.543e+02 3.469e+02, threshold=4.680e+02, percent-clipped=0.0 2024-06-21 12:54:02,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=385828.6666666667, ans=0.1 2024-06-21 12:54:39,462 INFO [train.py:1028] (1/2) Epoch 21, batch 8150, loss[loss=0.1965, simple_loss=0.2491, pruned_loss=0.07191, over 13090.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2747, pruned_loss=0.0799, over 2579646.24 frames. ], batch size: 121, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:54:45,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=385902.0, ans=0.2 2024-06-21 12:54:56,553 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.01 vs. limit=15.0 2024-06-21 12:55:07,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=385938.6666666667, ans=0.125 2024-06-21 12:55:08,523 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.36 vs. limit=12.0 2024-06-21 12:55:15,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=385957.0, ans=0.125 2024-06-21 12:55:30,685 INFO [train.py:1028] (1/2) Epoch 21, batch 8200, loss[loss=0.2187, simple_loss=0.2749, pruned_loss=0.08127, over 13187.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2753, pruned_loss=0.0799, over 2582914.14 frames. ], batch size: 112, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:55:45,996 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.203e+02 2.319e+02 2.485e+02 3.518e+02, threshold=4.638e+02, percent-clipped=0.0 2024-06-21 12:56:04,865 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.62 vs. limit=6.0 2024-06-21 12:56:28,008 INFO [train.py:1028] (1/2) Epoch 21, batch 8250, loss[loss=0.2289, simple_loss=0.2932, pruned_loss=0.08233, over 13252.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2756, pruned_loss=0.07985, over 2582664.76 frames. ], batch size: 52, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:56:30,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=386085.3333333333, ans=0.125 2024-06-21 12:57:02,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=386122.0, ans=0.125 2024-06-21 12:57:04,540 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.02 vs. limit=22.5 2024-06-21 12:57:05,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=386140.3333333333, ans=0.05 2024-06-21 12:57:06,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=386140.3333333333, ans=0.125 2024-06-21 12:57:10,856 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.01 vs. limit=15.0 2024-06-21 12:57:21,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=386158.6666666667, ans=0.09899494936611666 2024-06-21 12:57:24,046 INFO [train.py:1028] (1/2) Epoch 21, batch 8300, loss[loss=0.2237, simple_loss=0.2805, pruned_loss=0.08341, over 13077.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2749, pruned_loss=0.07963, over 2579438.04 frames. ], batch size: 102, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:57:27,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386177.0, ans=0.1 2024-06-21 12:57:39,073 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.195e+02 2.336e+02 2.498e+02 2.937e+02, threshold=4.671e+02, percent-clipped=0.0 2024-06-21 12:57:51,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=386213.6666666667, ans=0.0 2024-06-21 12:58:09,449 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.40 vs. limit=12.0 2024-06-21 12:58:13,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=386250.3333333333, ans=0.2 2024-06-21 12:58:14,973 INFO [train.py:1028] (1/2) Epoch 21, batch 8350, loss[loss=0.2371, simple_loss=0.293, pruned_loss=0.09057, over 13170.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2752, pruned_loss=0.07953, over 2579975.39 frames. ], batch size: 112, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:58:20,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=386268.6666666667, ans=0.125 2024-06-21 12:58:28,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=386287.0, ans=0.125 2024-06-21 12:58:38,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=386305.3333333333, ans=0.035 2024-06-21 12:58:39,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=386323.6666666667, ans=0.125 2024-06-21 12:58:43,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=386323.6666666667, ans=0.125 2024-06-21 12:58:51,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=386342.0, ans=0.125 2024-06-21 12:58:54,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=386342.0, ans=0.125 2024-06-21 12:59:03,853 INFO [train.py:1028] (1/2) Epoch 21, batch 8400, loss[loss=0.1857, simple_loss=0.2427, pruned_loss=0.06434, over 12930.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2754, pruned_loss=0.07966, over 2575775.90 frames. ], batch size: 39, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 12:59:05,430 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.69 vs. limit=15.0 2024-06-21 12:59:05,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=386360.3333333333, ans=0.0 2024-06-21 12:59:11,791 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.68 vs. limit=12.0 2024-06-21 12:59:17,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=386378.6666666667, ans=0.125 2024-06-21 12:59:20,293 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.196e+02 2.335e+02 2.513e+02 3.037e+02, threshold=4.669e+02, percent-clipped=0.0 2024-06-21 12:59:21,209 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2024-06-21 12:59:26,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=386397.0, ans=0.07 2024-06-21 12:59:39,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=386397.0, ans=0.0 2024-06-21 13:00:06,755 INFO [train.py:1028] (1/2) Epoch 21, batch 8450, loss[loss=0.2229, simple_loss=0.2825, pruned_loss=0.08167, over 13157.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.276, pruned_loss=0.07975, over 2577401.63 frames. ], batch size: 112, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:00:21,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=386470.3333333333, ans=0.2 2024-06-21 13:00:22,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=386470.3333333333, ans=0.025 2024-06-21 13:00:32,702 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2024-06-21 13:00:39,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=386507.0, ans=0.125 2024-06-21 13:00:56,215 INFO [train.py:1028] (1/2) Epoch 21, batch 8500, loss[loss=0.2194, simple_loss=0.2768, pruned_loss=0.08101, over 12628.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.277, pruned_loss=0.0801, over 2577176.33 frames. ], batch size: 29, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:01:14,225 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.200e+02 2.357e+02 2.557e+02 3.307e+02, threshold=4.713e+02, percent-clipped=0.0 2024-06-21 13:01:14,889 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.69 vs. limit=12.0 2024-06-21 13:01:30,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=386598.6666666667, ans=0.125 2024-06-21 13:01:39,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=386598.6666666667, ans=0.125 2024-06-21 13:01:47,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=386617.0, ans=0.0 2024-06-21 13:01:52,023 INFO [train.py:1028] (1/2) Epoch 21, batch 8550, loss[loss=0.2227, simple_loss=0.283, pruned_loss=0.08125, over 12692.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2766, pruned_loss=0.07986, over 2575932.82 frames. ], batch size: 22, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:02:33,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=386708.6666666667, ans=0.125 2024-06-21 13:02:50,052 INFO [train.py:1028] (1/2) Epoch 21, batch 8600, loss[loss=0.2099, simple_loss=0.2633, pruned_loss=0.07825, over 13132.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.277, pruned_loss=0.07989, over 2573964.21 frames. ], batch size: 121, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:02:51,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386727.0, ans=0.1 2024-06-21 13:02:56,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=386727.0, ans=0.125 2024-06-21 13:03:01,664 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.32 vs. limit=15.0 2024-06-21 13:03:05,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=386745.3333333333, ans=0.07 2024-06-21 13:03:06,631 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.186e+02 2.374e+02 2.608e+02 3.640e+02, threshold=4.749e+02, percent-clipped=0.0 2024-06-21 13:03:08,746 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.68 vs. limit=10.0 2024-06-21 13:03:09,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386745.3333333333, ans=0.1 2024-06-21 13:03:09,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=386745.3333333333, ans=15.0 2024-06-21 13:03:10,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=386763.6666666667, ans=0.0 2024-06-21 13:03:10,801 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2024-06-21 13:03:12,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=386763.6666666667, ans=0.125 2024-06-21 13:03:22,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=386763.6666666667, ans=0.0 2024-06-21 13:03:41,223 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.94 vs. limit=15.0 2024-06-21 13:03:44,631 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.26 vs. limit=6.0 2024-06-21 13:03:50,444 INFO [train.py:1028] (1/2) Epoch 21, batch 8650, loss[loss=0.2278, simple_loss=0.2859, pruned_loss=0.08489, over 13022.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2777, pruned_loss=0.08001, over 2577128.05 frames. ], batch size: 102, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:03:52,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=386818.6666666667, ans=0.125 2024-06-21 13:03:55,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=386818.6666666667, ans=0.025 2024-06-21 13:03:55,934 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=8.09 vs. limit=8.0 2024-06-21 13:04:08,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=386837.0, ans=0.1 2024-06-21 13:04:25,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=386892.0, ans=0.2 2024-06-21 13:04:26,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=386892.0, ans=0.1 2024-06-21 13:04:28,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=386892.0, ans=0.125 2024-06-21 13:04:30,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=386892.0, ans=0.0 2024-06-21 13:04:31,826 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.97 vs. limit=15.0 2024-06-21 13:04:33,974 INFO [train.py:1028] (1/2) Epoch 21, batch 8700, loss[loss=0.2209, simple_loss=0.2882, pruned_loss=0.07677, over 13198.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2786, pruned_loss=0.08077, over 2573920.39 frames. ], batch size: 59, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:04:43,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=386928.6666666667, ans=0.125 2024-06-21 13:04:49,315 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.225e+02 2.459e+02 2.761e+02 4.390e+02, threshold=4.917e+02, percent-clipped=0.0 2024-06-21 13:04:59,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=386947.0, ans=0.0 2024-06-21 13:05:09,321 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2024-06-21 13:05:13,664 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2024-06-21 13:05:28,201 INFO [train.py:1028] (1/2) Epoch 21, batch 8750, loss[loss=0.2208, simple_loss=0.2742, pruned_loss=0.08373, over 13098.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2782, pruned_loss=0.08064, over 2569198.36 frames. ], batch size: 121, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:05:32,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=387002.0, ans=0.125 2024-06-21 13:06:06,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=387057.0, ans=0.125 2024-06-21 13:06:08,602 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2024-06-21 13:06:09,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=387057.0, ans=0.125 2024-06-21 13:06:09,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=387057.0, ans=0.125 2024-06-21 13:06:25,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=387075.3333333333, ans=0.2 2024-06-21 13:06:32,578 INFO [train.py:1028] (1/2) Epoch 21, batch 8800, loss[loss=0.2168, simple_loss=0.2771, pruned_loss=0.07828, over 13204.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2789, pruned_loss=0.08131, over 2574220.27 frames. ], batch size: 72, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:06:47,529 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.180e+02 2.301e+02 2.489e+02 3.222e+02, threshold=4.602e+02, percent-clipped=0.0 2024-06-21 13:06:58,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=387148.6666666667, ans=0.1 2024-06-21 13:07:18,113 INFO [train.py:1028] (1/2) Epoch 21, batch 8850, loss[loss=0.2422, simple_loss=0.2959, pruned_loss=0.09426, over 12630.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.279, pruned_loss=0.08177, over 2561466.01 frames. ], batch size: 202, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:07:18,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=387185.3333333333, ans=0.09899494936611666 2024-06-21 13:07:27,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=387185.3333333333, ans=0.0 2024-06-21 13:07:28,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=387203.6666666667, ans=0.04949747468305833 2024-06-21 13:07:33,518 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.40 vs. limit=15.0 2024-06-21 13:07:42,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=387222.0, ans=0.0 2024-06-21 13:07:49,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=387240.3333333333, ans=0.125 2024-06-21 13:08:12,676 INFO [train.py:1028] (1/2) Epoch 21, batch 8900, loss[loss=0.2362, simple_loss=0.2963, pruned_loss=0.08802, over 12937.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2796, pruned_loss=0.08184, over 2560716.02 frames. ], batch size: 33, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:08:25,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.20 vs. limit=15.0 2024-06-21 13:08:30,056 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.224e+02 2.323e+02 2.535e+02 3.265e+02, threshold=4.646e+02, percent-clipped=0.0 2024-06-21 13:09:04,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=387350.3333333333, ans=0.1 2024-06-21 13:09:11,572 INFO [train.py:1028] (1/2) Epoch 21, batch 8950, loss[loss=0.2365, simple_loss=0.2955, pruned_loss=0.08878, over 12597.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2799, pruned_loss=0.08163, over 2561622.93 frames. ], batch size: 202, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:09:27,318 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=15.0 2024-06-21 13:09:29,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=387387.0, ans=0.0 2024-06-21 13:09:31,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=387387.0, ans=0.2 2024-06-21 13:09:40,421 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.70 vs. limit=15.0 2024-06-21 13:09:44,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=387405.3333333333, ans=0.2 2024-06-21 13:09:48,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=387405.3333333333, ans=0.09899494936611666 2024-06-21 13:09:55,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=387423.6666666667, ans=0.1 2024-06-21 13:10:12,638 INFO [train.py:1028] (1/2) Epoch 21, batch 9000, loss[loss=0.2134, simple_loss=0.2778, pruned_loss=0.07448, over 13324.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2794, pruned_loss=0.08091, over 2567625.35 frames. ], batch size: 46, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:10:12,641 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 13:10:24,708 INFO [train.py:1060] (1/2) Epoch 21, validation: loss=0.1874, simple_loss=0.2512, pruned_loss=0.06183, over 351949.00 frames. 2024-06-21 13:10:24,708 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 13:10:35,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=387478.6666666667, ans=0.125 2024-06-21 13:10:37,288 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.186e+02 2.357e+02 2.472e+02 3.835e+02, threshold=4.713e+02, percent-clipped=0.0 2024-06-21 13:10:38,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=387478.6666666667, ans=0.125 2024-06-21 13:10:41,743 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.69 vs. limit=12.0 2024-06-21 13:10:43,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=387497.0, ans=0.0 2024-06-21 13:10:52,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=387515.3333333333, ans=0.125 2024-06-21 13:10:57,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=387533.6666666667, ans=0.2 2024-06-21 13:11:08,534 INFO [train.py:1028] (1/2) Epoch 21, batch 9050, loss[loss=0.1811, simple_loss=0.2472, pruned_loss=0.05754, over 11383.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2797, pruned_loss=0.08109, over 2567890.91 frames. ], batch size: 16, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:11:11,815 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2024-06-21 13:11:28,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=387588.6666666667, ans=0.0 2024-06-21 13:11:36,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=387588.6666666667, ans=0.125 2024-06-21 13:11:39,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=387607.0, ans=0.125 2024-06-21 13:11:59,871 INFO [train.py:1028] (1/2) Epoch 21, batch 9100, loss[loss=0.2102, simple_loss=0.2684, pruned_loss=0.07598, over 13280.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2793, pruned_loss=0.08095, over 2568192.66 frames. ], batch size: 72, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:12:16,351 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.210e+02 2.354e+02 2.538e+02 3.319e+02, threshold=4.707e+02, percent-clipped=0.0 2024-06-21 13:12:33,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=387698.6666666667, ans=0.0 2024-06-21 13:12:34,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=387698.6666666667, ans=0.0 2024-06-21 13:12:37,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=387717.0, ans=0.1 2024-06-21 13:12:43,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=387717.0, ans=0.0 2024-06-21 13:12:47,335 INFO [train.py:1028] (1/2) Epoch 21, batch 9150, loss[loss=0.2421, simple_loss=0.3079, pruned_loss=0.08815, over 13155.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2796, pruned_loss=0.08125, over 2571209.13 frames. ], batch size: 77, lr: 2.74e-03, grad_scale: 16.0 2024-06-21 13:12:54,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=387735.3333333333, ans=0.0 2024-06-21 13:12:56,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=387753.6666666667, ans=0.1 2024-06-21 13:13:19,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=387790.3333333333, ans=0.0 2024-06-21 13:13:21,981 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.51 vs. limit=15.0 2024-06-21 13:13:23,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=387790.3333333333, ans=0.0 2024-06-21 13:13:41,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=387808.6666666667, ans=0.025 2024-06-21 13:13:45,486 INFO [train.py:1028] (1/2) Epoch 21, batch 9200, loss[loss=0.2021, simple_loss=0.261, pruned_loss=0.07162, over 12975.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2793, pruned_loss=0.08078, over 2573994.91 frames. ], batch size: 36, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:13:48,701 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.75 vs. limit=6.0 2024-06-21 13:13:57,473 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:13:57,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=387845.3333333333, ans=0.125 2024-06-21 13:14:00,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=387845.3333333333, ans=0.125 2024-06-21 13:14:02,816 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.182e+02 2.300e+02 2.486e+02 3.614e+02, threshold=4.600e+02, percent-clipped=0.0 2024-06-21 13:14:23,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=387900.3333333333, ans=0.125 2024-06-21 13:14:30,581 INFO [train.py:1028] (1/2) Epoch 21, batch 9250, loss[loss=0.2268, simple_loss=0.2814, pruned_loss=0.08613, over 13220.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2792, pruned_loss=0.08062, over 2575471.13 frames. ], batch size: 67, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:14:33,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=387918.6666666667, ans=0.1 2024-06-21 13:14:40,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=387937.0, ans=0.125 2024-06-21 13:14:43,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=387937.0, ans=0.125 2024-06-21 13:14:47,185 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.98 vs. limit=22.5 2024-06-21 13:14:47,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=387955.3333333333, ans=0.0 2024-06-21 13:14:59,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=387973.6666666667, ans=0.125 2024-06-21 13:15:04,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=387973.6666666667, ans=0.07 2024-06-21 13:15:10,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=387992.0, ans=0.0 2024-06-21 13:15:15,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=388010.3333333333, ans=0.1 2024-06-21 13:15:15,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=388010.3333333333, ans=0.125 2024-06-21 13:15:15,886 INFO [train.py:1028] (1/2) Epoch 21, batch 9300, loss[loss=0.1987, simple_loss=0.2629, pruned_loss=0.06729, over 12950.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2793, pruned_loss=0.08035, over 2573062.88 frames. ], batch size: 39, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:15:26,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=388028.6666666667, ans=0.125 2024-06-21 13:15:30,976 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.211e+02 2.421e+02 2.633e+02 3.486e+02, threshold=4.842e+02, percent-clipped=0.0 2024-06-21 13:15:41,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=388047.0, ans=0.04949747468305833 2024-06-21 13:15:41,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=388047.0, ans=0.0 2024-06-21 13:15:42,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=388065.3333333333, ans=0.1 2024-06-21 13:15:45,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=388065.3333333333, ans=0.125 2024-06-21 13:15:46,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=388065.3333333333, ans=0.125 2024-06-21 13:15:47,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=388065.3333333333, ans=0.125 2024-06-21 13:15:50,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=388065.3333333333, ans=0.125 2024-06-21 13:16:03,129 INFO [train.py:1028] (1/2) Epoch 21, batch 9350, loss[loss=0.2135, simple_loss=0.2722, pruned_loss=0.07744, over 12637.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2793, pruned_loss=0.0804, over 2569158.61 frames. ], batch size: 22, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:16:07,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=388102.0, ans=0.125 2024-06-21 13:16:13,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=388120.3333333333, ans=0.0 2024-06-21 13:16:22,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=388138.6666666667, ans=0.0 2024-06-21 13:16:42,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=388175.3333333333, ans=0.125 2024-06-21 13:16:47,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=388175.3333333333, ans=0.0 2024-06-21 13:16:49,252 INFO [train.py:1028] (1/2) Epoch 21, batch 9400, loss[loss=0.2189, simple_loss=0.2793, pruned_loss=0.0793, over 13220.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2789, pruned_loss=0.08044, over 2568150.23 frames. ], batch size: 52, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:16:52,107 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:17:01,981 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.21 vs. limit=12.0 2024-06-21 13:17:04,802 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.205e+02 2.325e+02 2.536e+02 3.081e+02, threshold=4.649e+02, percent-clipped=0.0 2024-06-21 13:17:11,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=388230.3333333333, ans=0.2 2024-06-21 13:17:17,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=388248.6666666667, ans=0.125 2024-06-21 13:17:20,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=388248.6666666667, ans=0.1 2024-06-21 13:17:25,273 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.03 vs. limit=22.5 2024-06-21 13:17:25,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=388248.6666666667, ans=0.125 2024-06-21 13:17:37,565 INFO [train.py:1028] (1/2) Epoch 21, batch 9450, loss[loss=0.2175, simple_loss=0.2886, pruned_loss=0.07317, over 12454.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2796, pruned_loss=0.08068, over 2568264.97 frames. ], batch size: 22, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:18:00,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=388322.0, ans=0.125 2024-06-21 13:18:18,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=388358.6666666667, ans=0.125 2024-06-21 13:18:19,858 INFO [train.py:1028] (1/2) Epoch 21, batch 9500, loss[loss=0.2281, simple_loss=0.2846, pruned_loss=0.08579, over 13254.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2787, pruned_loss=0.08, over 2577276.70 frames. ], batch size: 43, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:18:25,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=388377.0, ans=0.0 2024-06-21 13:18:31,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=388395.3333333333, ans=0.125 2024-06-21 13:18:32,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=388395.3333333333, ans=0.125 2024-06-21 13:18:35,511 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.195e+02 2.363e+02 2.468e+02 3.468e+02, threshold=4.726e+02, percent-clipped=0.0 2024-06-21 13:18:51,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=388432.0, ans=0.1 2024-06-21 13:19:07,293 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2024-06-21 13:19:07,311 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2024-06-21 13:19:07,873 INFO [train.py:1028] (1/2) Epoch 21, batch 9550, loss[loss=0.2082, simple_loss=0.2713, pruned_loss=0.07252, over 12890.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2789, pruned_loss=0.08018, over 2574013.11 frames. ], batch size: 39, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:19:09,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=388468.6666666667, ans=0.125 2024-06-21 13:19:15,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=388468.6666666667, ans=0.125 2024-06-21 13:19:21,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=388487.0, ans=0.2 2024-06-21 13:19:22,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=388487.0, ans=0.125 2024-06-21 13:19:35,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=388523.6666666667, ans=0.125 2024-06-21 13:19:43,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=388523.6666666667, ans=0.0 2024-06-21 13:19:56,007 INFO [train.py:1028] (1/2) Epoch 21, batch 9600, loss[loss=0.2401, simple_loss=0.282, pruned_loss=0.09913, over 10438.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.278, pruned_loss=0.07977, over 2572601.07 frames. ], batch size: 304, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:19:56,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=388560.3333333333, ans=0.1 2024-06-21 13:19:58,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=388560.3333333333, ans=0.0 2024-06-21 13:20:02,711 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.59 vs. limit=15.0 2024-06-21 13:20:03,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=388578.6666666667, ans=0.125 2024-06-21 13:20:07,267 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.221e+02 2.371e+02 2.590e+02 3.110e+02, threshold=4.743e+02, percent-clipped=0.0 2024-06-21 13:20:10,015 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.88 vs. limit=15.0 2024-06-21 13:20:34,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=388633.6666666667, ans=0.0 2024-06-21 13:20:38,423 INFO [train.py:1028] (1/2) Epoch 21, batch 9650, loss[loss=0.209, simple_loss=0.2602, pruned_loss=0.07891, over 13114.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2787, pruned_loss=0.0807, over 2563320.23 frames. ], batch size: 132, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:21:08,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=388688.6666666667, ans=0.07 2024-06-21 13:21:21,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=388725.3333333333, ans=0.125 2024-06-21 13:21:24,046 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:21:33,797 INFO [train.py:1028] (1/2) Epoch 21, batch 9700, loss[loss=0.2129, simple_loss=0.27, pruned_loss=0.07788, over 13007.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2781, pruned_loss=0.08056, over 2557066.45 frames. ], batch size: 144, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:21:42,964 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.91 vs. limit=6.0 2024-06-21 13:21:47,706 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.215e+02 2.393e+02 2.605e+02 3.985e+02, threshold=4.787e+02, percent-clipped=0.0 2024-06-21 13:22:01,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=388798.6666666667, ans=0.125 2024-06-21 13:22:11,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=388817.0, ans=0.125 2024-06-21 13:22:15,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=388835.3333333333, ans=0.0 2024-06-21 13:22:16,304 INFO [train.py:1028] (1/2) Epoch 21, batch 9750, loss[loss=0.2068, simple_loss=0.2618, pruned_loss=0.07585, over 13115.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2776, pruned_loss=0.08031, over 2553320.25 frames. ], batch size: 132, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:22:26,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=388853.6666666667, ans=0.0 2024-06-21 13:22:39,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=388872.0, ans=0.125 2024-06-21 13:22:58,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=388908.6666666667, ans=0.2 2024-06-21 13:23:00,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=388927.0, ans=0.125 2024-06-21 13:23:00,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=388927.0, ans=0.05 2024-06-21 13:23:01,585 INFO [train.py:1028] (1/2) Epoch 21, batch 9800, loss[loss=0.2076, simple_loss=0.2633, pruned_loss=0.07593, over 12908.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2773, pruned_loss=0.07978, over 2544558.18 frames. ], batch size: 39, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:23:04,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=388927.0, ans=0.0 2024-06-21 13:23:07,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=388927.0, ans=0.125 2024-06-21 13:23:16,764 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.228e+02 2.421e+02 2.637e+02 3.419e+02, threshold=4.842e+02, percent-clipped=0.0 2024-06-21 13:23:24,491 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.83 vs. limit=15.0 2024-06-21 13:23:28,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=388982.0, ans=0.1 2024-06-21 13:23:33,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=388982.0, ans=0.125 2024-06-21 13:23:40,263 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:23:45,861 INFO [train.py:1028] (1/2) Epoch 21, batch 9850, loss[loss=0.216, simple_loss=0.2714, pruned_loss=0.08036, over 12996.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2771, pruned_loss=0.07978, over 2537536.67 frames. ], batch size: 102, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:23:55,416 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.10 vs. limit=15.0 2024-06-21 13:23:58,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.49 vs. limit=12.0 2024-06-21 13:24:04,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=389055.3333333333, ans=0.0 2024-06-21 13:24:04,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=389055.3333333333, ans=0.125 2024-06-21 13:24:23,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=389092.0, ans=0.2 2024-06-21 13:24:24,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=389092.0, ans=0.025 2024-06-21 13:24:27,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=389092.0, ans=0.1 2024-06-21 13:24:29,826 INFO [train.py:1028] (1/2) Epoch 21, batch 9900, loss[loss=0.2249, simple_loss=0.2877, pruned_loss=0.08108, over 12976.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2765, pruned_loss=0.08, over 2531046.88 frames. ], batch size: 39, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:24:44,379 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.281e+02 2.442e+02 2.681e+02 3.633e+02, threshold=4.884e+02, percent-clipped=0.0 2024-06-21 13:24:44,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=389128.6666666667, ans=0.0 2024-06-21 13:24:54,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=389165.3333333333, ans=0.125 2024-06-21 13:24:55,379 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.23 vs. limit=12.0 2024-06-21 13:24:59,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=389165.3333333333, ans=0.025 2024-06-21 13:25:09,331 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:25:13,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=389202.0, ans=0.2 2024-06-21 13:25:13,864 INFO [train.py:1028] (1/2) Epoch 21, batch 9950, loss[loss=0.2327, simple_loss=0.2823, pruned_loss=0.09153, over 12801.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2757, pruned_loss=0.08029, over 2523411.12 frames. ], batch size: 29, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:25:43,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=389257.0, ans=0.125 2024-06-21 13:25:44,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=389257.0, ans=0.0 2024-06-21 13:25:50,963 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.15 vs. limit=15.0 2024-06-21 13:26:00,077 INFO [train.py:1028] (1/2) Epoch 21, batch 10000, loss[loss=0.2007, simple_loss=0.2663, pruned_loss=0.06759, over 12555.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2755, pruned_loss=0.08055, over 2485859.53 frames. ], batch size: 22, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:26:01,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=389293.6666666667, ans=0.0 2024-06-21 13:26:06,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=389293.6666666667, ans=0.0 2024-06-21 13:26:10,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=389312.0, ans=0.0 2024-06-21 13:26:11,131 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2024-06-21 13:26:12,329 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.46 vs. limit=22.5 2024-06-21 13:26:13,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=389312.0, ans=0.035 2024-06-21 13:26:17,405 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.201e+02 2.333e+02 2.565e+02 3.575e+02, threshold=4.666e+02, percent-clipped=0.0 2024-06-21 13:26:22,740 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2024-06-21 13:26:31,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=389348.6666666667, ans=15.0 2024-06-21 13:26:38,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=389348.6666666667, ans=22.5 2024-06-21 13:26:47,648 INFO [train.py:1028] (1/2) Epoch 21, batch 10050, loss[loss=0.2219, simple_loss=0.2835, pruned_loss=0.08015, over 12776.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2762, pruned_loss=0.08178, over 2444057.71 frames. ], batch size: 22, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:26:48,795 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2024-06-21 13:26:49,002 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.32 vs. limit=15.0 2024-06-21 13:26:56,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=389403.6666666667, ans=0.125 2024-06-21 13:27:00,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=389403.6666666667, ans=0.0 2024-06-21 13:27:02,850 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=26.97 vs. limit=22.5 2024-06-21 13:27:06,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=389422.0, ans=0.125 2024-06-21 13:27:31,100 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=27.08 vs. limit=22.5 2024-06-21 13:27:33,802 INFO [train.py:1028] (1/2) Epoch 21, batch 10100, loss[loss=0.1849, simple_loss=0.2485, pruned_loss=0.06064, over 11974.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2756, pruned_loss=0.08153, over 2425160.60 frames. ], batch size: 18, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:27:36,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=389477.0, ans=0.125 2024-06-21 13:30:39,814 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.148e+02 2.288e+02 2.479e+02 3.366e+02, threshold=4.576e+02, percent-clipped=0.0 2024-06-21 13:30:39,866 INFO [train.py:1028] (1/2) Epoch 22, batch 0, loss[loss=0.1848, simple_loss=0.242, pruned_loss=0.06379, over 12997.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.242, pruned_loss=0.06379, over 12997.00 frames. ], batch size: 36, lr: 2.67e-03, grad_scale: 32.0 2024-06-21 13:30:39,867 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 13:30:56,623 INFO [train.py:1060] (1/2) Epoch 22, validation: loss=0.1886, simple_loss=0.2528, pruned_loss=0.06221, over 351949.00 frames. 2024-06-21 13:30:56,624 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 13:31:00,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=389508.1666666667, ans=0.0 2024-06-21 13:31:03,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=389508.1666666667, ans=0.0 2024-06-21 13:31:07,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=389526.5, ans=0.025 2024-06-21 13:31:20,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=389544.8333333333, ans=0.025 2024-06-21 13:31:31,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=389563.1666666667, ans=0.2 2024-06-21 13:31:35,876 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.28 vs. limit=6.0 2024-06-21 13:31:47,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=389581.5, ans=0.2 2024-06-21 13:31:47,368 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.66 vs. limit=22.5 2024-06-21 13:31:52,369 INFO [train.py:1028] (1/2) Epoch 22, batch 50, loss[loss=0.2139, simple_loss=0.2725, pruned_loss=0.07762, over 12632.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2573, pruned_loss=0.07395, over 574087.99 frames. ], batch size: 29, lr: 2.67e-03, grad_scale: 32.0 2024-06-21 13:31:54,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=389599.8333333333, ans=0.125 2024-06-21 13:31:58,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=389599.8333333333, ans=0.1 2024-06-21 13:32:03,797 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.65 vs. limit=15.0 2024-06-21 13:32:13,705 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.57 vs. limit=15.0 2024-06-21 13:32:18,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=389654.8333333333, ans=0.0 2024-06-21 13:32:34,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=389673.1666666667, ans=0.125 2024-06-21 13:32:36,429 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.127e+02 2.240e+02 2.485e+02 3.030e+02, threshold=4.480e+02, percent-clipped=0.0 2024-06-21 13:32:36,463 INFO [train.py:1028] (1/2) Epoch 22, batch 100, loss[loss=0.1943, simple_loss=0.261, pruned_loss=0.06383, over 13278.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.257, pruned_loss=0.07343, over 1017231.61 frames. ], batch size: 46, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:32:42,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=389691.5, ans=0.025 2024-06-21 13:32:43,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=389691.5, ans=0.2 2024-06-21 13:32:49,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=12.0 2024-06-21 13:32:51,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=389709.8333333333, ans=0.0 2024-06-21 13:32:56,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=389728.1666666667, ans=0.125 2024-06-21 13:33:02,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=389728.1666666667, ans=0.125 2024-06-21 13:33:16,023 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.61 vs. limit=22.5 2024-06-21 13:33:17,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=389764.8333333333, ans=0.125 2024-06-21 13:33:32,193 INFO [train.py:1028] (1/2) Epoch 22, batch 150, loss[loss=0.2052, simple_loss=0.258, pruned_loss=0.07615, over 12619.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.256, pruned_loss=0.07226, over 1364323.48 frames. ], batch size: 29, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:33:47,944 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.71 vs. limit=15.0 2024-06-21 13:34:01,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=389838.1666666667, ans=0.125 2024-06-21 13:34:15,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=389856.5, ans=0.1 2024-06-21 13:34:19,649 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.054e+02 2.155e+02 2.372e+02 3.572e+02, threshold=4.309e+02, percent-clipped=0.0 2024-06-21 13:34:19,705 INFO [train.py:1028] (1/2) Epoch 22, batch 200, loss[loss=0.2249, simple_loss=0.2704, pruned_loss=0.08975, over 12590.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2557, pruned_loss=0.07208, over 1634309.86 frames. ], batch size: 202, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:34:26,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=389874.8333333333, ans=0.1 2024-06-21 13:34:32,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=389893.1666666667, ans=0.2 2024-06-21 13:34:43,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=389911.5, ans=0.125 2024-06-21 13:34:56,131 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2024-06-21 13:34:56,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=389929.8333333333, ans=0.2 2024-06-21 13:35:03,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=389929.8333333333, ans=0.125 2024-06-21 13:35:07,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=389948.1666666667, ans=0.2 2024-06-21 13:35:14,296 INFO [train.py:1028] (1/2) Epoch 22, batch 250, loss[loss=0.195, simple_loss=0.2452, pruned_loss=0.07237, over 13046.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2551, pruned_loss=0.07144, over 1845443.48 frames. ], batch size: 144, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:35:17,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=389966.5, ans=0.125 2024-06-21 13:35:18,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=389966.5, ans=0.125 2024-06-21 13:35:27,423 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2024-06-21 13:35:44,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=390021.5, ans=0.0 2024-06-21 13:35:57,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=390039.8333333333, ans=0.1 2024-06-21 13:35:59,557 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.072e+02 2.229e+02 2.443e+02 2.878e+02, threshold=4.458e+02, percent-clipped=0.0 2024-06-21 13:35:59,618 INFO [train.py:1028] (1/2) Epoch 22, batch 300, loss[loss=0.1835, simple_loss=0.2394, pruned_loss=0.06379, over 13210.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2561, pruned_loss=0.07217, over 2009594.44 frames. ], batch size: 112, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:36:04,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=390058.1666666667, ans=0.07 2024-06-21 13:36:12,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=390076.5, ans=0.1 2024-06-21 13:36:35,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=390113.1666666667, ans=0.025 2024-06-21 13:36:35,528 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.49 vs. limit=15.0 2024-06-21 13:36:39,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=390113.1666666667, ans=0.125 2024-06-21 13:36:51,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=390131.5, ans=0.125 2024-06-21 13:36:54,472 INFO [train.py:1028] (1/2) Epoch 22, batch 350, loss[loss=0.1896, simple_loss=0.257, pruned_loss=0.06112, over 12887.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.256, pruned_loss=0.07233, over 2139026.94 frames. ], batch size: 33, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:36:54,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=390149.8333333333, ans=0.0 2024-06-21 13:36:55,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=390149.8333333333, ans=0.125 2024-06-21 13:37:09,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390168.1666666667, ans=0.1 2024-06-21 13:37:40,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=390223.1666666667, ans=0.125 2024-06-21 13:37:42,633 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.751e+02 2.056e+02 2.261e+02 2.519e+02 3.296e+02, threshold=4.522e+02, percent-clipped=0.0 2024-06-21 13:37:42,692 INFO [train.py:1028] (1/2) Epoch 22, batch 400, loss[loss=0.1921, simple_loss=0.2474, pruned_loss=0.06833, over 13267.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2562, pruned_loss=0.07216, over 2239230.87 frames. ], batch size: 63, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:37:45,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=390241.5, ans=15.0 2024-06-21 13:37:47,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=390241.5, ans=0.2 2024-06-21 13:37:57,222 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.07 vs. limit=15.0 2024-06-21 13:37:58,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=390278.1666666667, ans=0.125 2024-06-21 13:37:59,633 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:38:09,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=390296.5, ans=0.125 2024-06-21 13:38:11,353 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:38:23,762 INFO [train.py:1028] (1/2) Epoch 22, batch 450, loss[loss=0.2062, simple_loss=0.2711, pruned_loss=0.07063, over 13228.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2563, pruned_loss=0.07197, over 2312345.51 frames. ], batch size: 67, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:38:27,485 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.68 vs. limit=15.0 2024-06-21 13:38:34,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=390351.5, ans=0.2 2024-06-21 13:38:38,836 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.55 vs. limit=15.0 2024-06-21 13:38:44,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=390369.8333333333, ans=0.125 2024-06-21 13:38:47,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=390388.1666666667, ans=0.0 2024-06-21 13:38:56,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=390406.5, ans=0.125 2024-06-21 13:38:56,870 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.15 vs. limit=15.0 2024-06-21 13:39:05,851 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.012e+02 2.134e+02 2.327e+02 2.700e+02, threshold=4.268e+02, percent-clipped=0.0 2024-06-21 13:39:05,887 INFO [train.py:1028] (1/2) Epoch 22, batch 500, loss[loss=0.2012, simple_loss=0.2562, pruned_loss=0.07313, over 13053.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2561, pruned_loss=0.07171, over 2375000.14 frames. ], batch size: 121, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:39:17,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=390424.8333333333, ans=0.0 2024-06-21 13:39:26,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=390443.1666666667, ans=0.125 2024-06-21 13:39:36,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=390461.5, ans=0.2 2024-06-21 13:40:00,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390498.1666666667, ans=0.1 2024-06-21 13:40:01,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=390498.1666666667, ans=0.2 2024-06-21 13:40:03,252 INFO [train.py:1028] (1/2) Epoch 22, batch 550, loss[loss=0.1951, simple_loss=0.2531, pruned_loss=0.0686, over 12909.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2556, pruned_loss=0.07144, over 2420903.82 frames. ], batch size: 158, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:40:37,024 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.76 vs. limit=15.0 2024-06-21 13:40:48,733 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.089e+02 2.224e+02 2.432e+02 3.650e+02, threshold=4.448e+02, percent-clipped=0.0 2024-06-21 13:40:48,764 INFO [train.py:1028] (1/2) Epoch 22, batch 600, loss[loss=0.1738, simple_loss=0.2245, pruned_loss=0.06161, over 13043.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2554, pruned_loss=0.07139, over 2459485.07 frames. ], batch size: 144, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:40:52,366 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2024-06-21 13:41:11,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=390644.8333333333, ans=0.125 2024-06-21 13:41:15,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=390644.8333333333, ans=0.025 2024-06-21 13:41:19,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=390663.1666666667, ans=0.0 2024-06-21 13:41:20,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=390663.1666666667, ans=0.125 2024-06-21 13:41:24,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=390663.1666666667, ans=0.125 2024-06-21 13:41:25,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=390663.1666666667, ans=0.2 2024-06-21 13:41:38,791 INFO [train.py:1028] (1/2) Epoch 22, batch 650, loss[loss=0.203, simple_loss=0.2608, pruned_loss=0.07254, over 13226.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2552, pruned_loss=0.07073, over 2490524.59 frames. ], batch size: 59, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:42:07,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=390754.8333333333, ans=0.07 2024-06-21 13:42:27,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390773.1666666667, ans=0.1 2024-06-21 13:42:30,345 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.050e+02 2.149e+02 2.277e+02 2.831e+02, threshold=4.298e+02, percent-clipped=0.0 2024-06-21 13:42:30,390 INFO [train.py:1028] (1/2) Epoch 22, batch 700, loss[loss=0.1938, simple_loss=0.2481, pruned_loss=0.0697, over 13310.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2554, pruned_loss=0.07113, over 2513150.53 frames. ], batch size: 46, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:42:38,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=390809.8333333333, ans=0.125 2024-06-21 13:42:45,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=390828.1666666667, ans=0.0 2024-06-21 13:42:52,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=390828.1666666667, ans=0.125 2024-06-21 13:43:08,929 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.89 vs. limit=15.0 2024-06-21 13:43:16,224 INFO [train.py:1028] (1/2) Epoch 22, batch 750, loss[loss=0.1924, simple_loss=0.2598, pruned_loss=0.06245, over 13249.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2551, pruned_loss=0.0707, over 2527413.21 frames. ], batch size: 63, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:43:20,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=390883.1666666667, ans=0.0 2024-06-21 13:43:21,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=390883.1666666667, ans=0.0 2024-06-21 13:43:29,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=390901.5, ans=0.125 2024-06-21 13:43:47,749 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.75 vs. limit=12.0 2024-06-21 13:43:50,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390919.8333333333, ans=0.1 2024-06-21 13:43:59,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=390938.1666666667, ans=0.025 2024-06-21 13:44:06,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=390956.5, ans=0.125 2024-06-21 13:44:07,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=390956.5, ans=0.125 2024-06-21 13:44:12,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=390974.8333333333, ans=0.0 2024-06-21 13:44:13,679 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.060e+02 2.153e+02 2.252e+02 2.846e+02, threshold=4.306e+02, percent-clipped=0.0 2024-06-21 13:44:13,722 INFO [train.py:1028] (1/2) Epoch 22, batch 800, loss[loss=0.1913, simple_loss=0.2545, pruned_loss=0.06407, over 12961.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2558, pruned_loss=0.07128, over 2539502.51 frames. ], batch size: 36, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:44:20,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=390974.8333333333, ans=0.125 2024-06-21 13:44:28,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=390993.1666666667, ans=0.125 2024-06-21 13:44:52,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=391048.1666666667, ans=0.125 2024-06-21 13:44:59,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2024-06-21 13:45:01,552 INFO [train.py:1028] (1/2) Epoch 22, batch 850, loss[loss=0.1974, simple_loss=0.2518, pruned_loss=0.07151, over 13128.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2563, pruned_loss=0.07142, over 2550194.73 frames. ], batch size: 95, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:45:04,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=391066.5, ans=0.125 2024-06-21 13:45:37,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=391121.5, ans=0.0 2024-06-21 13:45:41,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=391121.5, ans=0.2 2024-06-21 13:45:51,548 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.72 vs. limit=6.0 2024-06-21 13:45:56,637 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.091e+02 2.185e+02 2.273e+02 3.063e+02, threshold=4.369e+02, percent-clipped=0.0 2024-06-21 13:45:56,670 INFO [train.py:1028] (1/2) Epoch 22, batch 900, loss[loss=0.1696, simple_loss=0.2344, pruned_loss=0.05239, over 12888.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2563, pruned_loss=0.07166, over 2555843.39 frames. ], batch size: 36, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:46:15,205 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2024-06-21 13:46:17,743 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2024-06-21 13:46:45,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=391231.5, ans=0.125 2024-06-21 13:46:51,906 INFO [train.py:1028] (1/2) Epoch 22, batch 950, loss[loss=0.2132, simple_loss=0.2729, pruned_loss=0.07673, over 12872.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2564, pruned_loss=0.0714, over 2559445.88 frames. ], batch size: 39, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:47:00,424 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.79 vs. limit=22.5 2024-06-21 13:47:01,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391268.1666666667, ans=0.1 2024-06-21 13:47:09,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=391286.5, ans=0.125 2024-06-21 13:47:18,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=391304.8333333333, ans=0.0 2024-06-21 13:47:23,147 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.55 vs. limit=15.0 2024-06-21 13:47:24,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=391304.8333333333, ans=15.0 2024-06-21 13:47:28,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391323.1666666667, ans=0.1 2024-06-21 13:47:28,751 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.12 vs. limit=15.0 2024-06-21 13:47:34,748 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.127e+02 2.289e+02 2.519e+02 3.615e+02, threshold=4.578e+02, percent-clipped=0.0 2024-06-21 13:47:34,798 INFO [train.py:1028] (1/2) Epoch 22, batch 1000, loss[loss=0.1951, simple_loss=0.2518, pruned_loss=0.0692, over 13330.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2561, pruned_loss=0.0716, over 2561572.46 frames. ], batch size: 49, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:47:37,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391341.5, ans=0.1 2024-06-21 13:47:44,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=391359.8333333333, ans=0.125 2024-06-21 13:47:44,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.08 vs. limit=10.0 2024-06-21 13:47:51,061 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=391359.8333333333, ans=0.2 2024-06-21 13:48:03,409 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2024-06-21 13:48:16,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=391414.8333333333, ans=0.2 2024-06-21 13:48:22,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=391414.8333333333, ans=0.125 2024-06-21 13:48:30,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391433.1666666667, ans=0.1 2024-06-21 13:48:31,783 INFO [train.py:1028] (1/2) Epoch 22, batch 1050, loss[loss=0.1864, simple_loss=0.2455, pruned_loss=0.06359, over 13133.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2569, pruned_loss=0.0722, over 2564776.00 frames. ], batch size: 77, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:48:40,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=391433.1666666667, ans=0.0 2024-06-21 13:48:47,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=391451.5, ans=0.025 2024-06-21 13:48:55,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=391469.8333333333, ans=0.125 2024-06-21 13:48:55,793 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2024-06-21 13:48:56,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=391469.8333333333, ans=0.025 2024-06-21 13:49:05,719 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.73 vs. limit=15.0 2024-06-21 13:49:13,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=391506.5, ans=0.125 2024-06-21 13:49:19,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=391506.5, ans=0.2 2024-06-21 13:49:19,727 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=10.12 vs. limit=12.0 2024-06-21 13:49:22,080 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.066e+02 2.175e+02 2.354e+02 2.870e+02, threshold=4.350e+02, percent-clipped=0.0 2024-06-21 13:49:22,119 INFO [train.py:1028] (1/2) Epoch 22, batch 1100, loss[loss=0.2023, simple_loss=0.2631, pruned_loss=0.07074, over 13327.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2567, pruned_loss=0.07196, over 2570279.73 frames. ], batch size: 52, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:49:46,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=391543.1666666667, ans=0.0 2024-06-21 13:49:56,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=391561.5, ans=0.125 2024-06-21 13:50:12,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=391598.1666666667, ans=0.125 2024-06-21 13:50:18,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=391598.1666666667, ans=0.125 2024-06-21 13:50:20,738 INFO [train.py:1028] (1/2) Epoch 22, batch 1150, loss[loss=0.2081, simple_loss=0.2618, pruned_loss=0.07722, over 13292.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2571, pruned_loss=0.07218, over 2571050.98 frames. ], batch size: 52, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:50:31,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=391634.8333333333, ans=0.125 2024-06-21 13:50:45,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391671.5, ans=0.1 2024-06-21 13:51:01,027 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.121e+02 2.256e+02 2.420e+02 3.038e+02, threshold=4.512e+02, percent-clipped=0.0 2024-06-21 13:51:01,072 INFO [train.py:1028] (1/2) Epoch 22, batch 1200, loss[loss=0.19, simple_loss=0.2458, pruned_loss=0.0671, over 13225.00 frames. ], tot_loss[loss=0.201, simple_loss=0.257, pruned_loss=0.07249, over 2573790.77 frames. ], batch size: 77, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:51:02,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=391708.1666666667, ans=0.125 2024-06-21 13:51:04,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=391708.1666666667, ans=0.2 2024-06-21 13:51:08,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391708.1666666667, ans=0.1 2024-06-21 13:51:14,013 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.76 vs. limit=10.0 2024-06-21 13:51:14,585 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:51:14,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=391726.5, ans=0.025 2024-06-21 13:51:28,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=391744.8333333333, ans=0.125 2024-06-21 13:51:57,720 INFO [train.py:1028] (1/2) Epoch 22, batch 1250, loss[loss=0.2101, simple_loss=0.2608, pruned_loss=0.07968, over 13170.00 frames. ], tot_loss[loss=0.201, simple_loss=0.257, pruned_loss=0.07253, over 2583380.26 frames. ], batch size: 112, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:52:03,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=391799.8333333333, ans=0.035 2024-06-21 13:52:13,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=391818.1666666667, ans=0.1 2024-06-21 13:52:50,162 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.093e+02 2.186e+02 2.401e+02 4.792e+02, threshold=4.372e+02, percent-clipped=1.0 2024-06-21 13:52:50,196 INFO [train.py:1028] (1/2) Epoch 22, batch 1300, loss[loss=0.2159, simple_loss=0.2639, pruned_loss=0.08394, over 12713.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2573, pruned_loss=0.07243, over 2583522.15 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:52:53,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=391891.5, ans=0.125 2024-06-21 13:52:56,260 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.81 vs. limit=6.0 2024-06-21 13:52:57,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391891.5, ans=0.1 2024-06-21 13:52:57,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=391891.5, ans=0.95 2024-06-21 13:53:36,715 INFO [train.py:1028] (1/2) Epoch 22, batch 1350, loss[loss=0.2099, simple_loss=0.268, pruned_loss=0.07591, over 13254.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2566, pruned_loss=0.07203, over 2585188.90 frames. ], batch size: 59, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:53:41,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=391983.1666666667, ans=0.0 2024-06-21 13:53:47,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=392001.5, ans=15.0 2024-06-21 13:54:07,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=392038.1666666667, ans=0.1 2024-06-21 13:54:27,877 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.106e+02 2.264e+02 2.573e+02 3.379e+02, threshold=4.528e+02, percent-clipped=0.0 2024-06-21 13:54:27,913 INFO [train.py:1028] (1/2) Epoch 22, batch 1400, loss[loss=0.2168, simple_loss=0.2725, pruned_loss=0.08057, over 12396.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2564, pruned_loss=0.07206, over 2586382.05 frames. ], batch size: 25, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:54:29,116 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.81 vs. limit=12.0 2024-06-21 13:54:55,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=392111.5, ans=0.2 2024-06-21 13:55:21,045 INFO [train.py:1028] (1/2) Epoch 22, batch 1450, loss[loss=0.2001, simple_loss=0.2571, pruned_loss=0.07154, over 13133.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2564, pruned_loss=0.072, over 2586403.86 frames. ], batch size: 121, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:55:27,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=392166.5, ans=0.125 2024-06-21 13:55:41,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=392184.8333333333, ans=0.0 2024-06-21 13:56:08,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=392239.8333333333, ans=0.125 2024-06-21 13:56:16,061 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 2.099e+02 2.190e+02 2.333e+02 3.382e+02, threshold=4.381e+02, percent-clipped=0.0 2024-06-21 13:56:16,095 INFO [train.py:1028] (1/2) Epoch 22, batch 1500, loss[loss=0.1994, simple_loss=0.2567, pruned_loss=0.07105, over 13183.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2558, pruned_loss=0.07186, over 2589084.61 frames. ], batch size: 83, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:56:16,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=392258.1666666667, ans=0.125 2024-06-21 13:56:17,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=392258.1666666667, ans=0.0 2024-06-21 13:56:17,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=392258.1666666667, ans=0.125 2024-06-21 13:56:37,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=392294.8333333333, ans=0.0 2024-06-21 13:57:00,667 INFO [train.py:1028] (1/2) Epoch 22, batch 1550, loss[loss=0.197, simple_loss=0.2462, pruned_loss=0.07391, over 13140.00 frames. ], tot_loss[loss=0.2, simple_loss=0.256, pruned_loss=0.07203, over 2583949.58 frames. ], batch size: 103, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:57:00,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392349.8333333333, ans=0.1 2024-06-21 13:57:08,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=392349.8333333333, ans=0.1 2024-06-21 13:57:17,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=392368.1666666667, ans=0.2 2024-06-21 13:57:20,462 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.81 vs. limit=15.0 2024-06-21 13:57:28,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=392386.5, ans=0.125 2024-06-21 13:57:30,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=392386.5, ans=0.125 2024-06-21 13:57:31,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=392386.5, ans=0.025 2024-06-21 13:57:48,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=392423.1666666667, ans=0.0 2024-06-21 13:57:54,455 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.118e+02 2.220e+02 2.401e+02 3.384e+02, threshold=4.440e+02, percent-clipped=0.0 2024-06-21 13:57:54,489 INFO [train.py:1028] (1/2) Epoch 22, batch 1600, loss[loss=0.2033, simple_loss=0.2675, pruned_loss=0.06958, over 13161.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2558, pruned_loss=0.0718, over 2579935.17 frames. ], batch size: 77, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:57:54,637 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:58:34,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=392496.5, ans=0.0 2024-06-21 13:58:41,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=392514.8333333333, ans=0.025 2024-06-21 13:58:43,470 INFO [train.py:1028] (1/2) Epoch 22, batch 1650, loss[loss=0.1881, simple_loss=0.2417, pruned_loss=0.06732, over 13123.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2565, pruned_loss=0.07236, over 2575953.93 frames. ], batch size: 95, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:58:50,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=392533.1666666667, ans=0.2 2024-06-21 13:59:01,166 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2024-06-21 13:59:07,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392569.8333333333, ans=0.1 2024-06-21 13:59:08,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=392569.8333333333, ans=0.0 2024-06-21 13:59:10,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=392569.8333333333, ans=0.2 2024-06-21 13:59:24,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=392606.5, ans=0.125 2024-06-21 13:59:36,107 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.067e+02 2.186e+02 2.405e+02 3.278e+02, threshold=4.372e+02, percent-clipped=0.0 2024-06-21 13:59:36,140 INFO [train.py:1028] (1/2) Epoch 22, batch 1700, loss[loss=0.1992, simple_loss=0.2718, pruned_loss=0.06325, over 12445.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2561, pruned_loss=0.07179, over 2581076.35 frames. ], batch size: 25, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 13:59:36,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392624.8333333333, ans=0.1 2024-06-21 13:59:48,353 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2024-06-21 13:59:51,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=392643.1666666667, ans=0.125 2024-06-21 14:00:09,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=392679.8333333333, ans=0.125 2024-06-21 14:00:14,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=392679.8333333333, ans=0.125 2024-06-21 14:00:18,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=392698.1666666667, ans=0.125 2024-06-21 14:00:21,031 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2024-06-21 14:00:26,557 INFO [train.py:1028] (1/2) Epoch 22, batch 1750, loss[loss=0.2071, simple_loss=0.2706, pruned_loss=0.07184, over 12675.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2565, pruned_loss=0.07183, over 2582438.65 frames. ], batch size: 22, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:00:34,089 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.49 vs. limit=22.5 2024-06-21 14:01:06,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=392789.8333333333, ans=0.125 2024-06-21 14:01:13,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=392789.8333333333, ans=0.125 2024-06-21 14:01:15,145 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.878e+02 2.104e+02 2.227e+02 2.447e+02 3.112e+02, threshold=4.454e+02, percent-clipped=0.0 2024-06-21 14:01:15,188 INFO [train.py:1028] (1/2) Epoch 22, batch 1800, loss[loss=0.216, simple_loss=0.2775, pruned_loss=0.07722, over 13266.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2568, pruned_loss=0.07207, over 2582715.23 frames. ], batch size: 67, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:01:39,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=392844.8333333333, ans=0.125 2024-06-21 14:01:48,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=392844.8333333333, ans=0.125 2024-06-21 14:01:56,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=392863.1666666667, ans=0.0 2024-06-21 14:02:09,141 INFO [train.py:1028] (1/2) Epoch 22, batch 1850, loss[loss=0.2149, simple_loss=0.2629, pruned_loss=0.08346, over 13201.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2567, pruned_loss=0.07212, over 2583380.33 frames. ], batch size: 83, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:02:10,424 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.46 vs. limit=15.0 2024-06-21 14:02:30,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=392936.5, ans=0.125 2024-06-21 14:02:35,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=392936.5, ans=0.2 2024-06-21 14:02:57,380 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.857e+02 2.103e+02 2.236e+02 2.430e+02 3.214e+02, threshold=4.472e+02, percent-clipped=0.0 2024-06-21 14:02:57,414 INFO [train.py:1028] (1/2) Epoch 22, batch 1900, loss[loss=0.2028, simple_loss=0.2615, pruned_loss=0.07206, over 13147.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2567, pruned_loss=0.07207, over 2585560.31 frames. ], batch size: 95, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:03:01,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=392991.5, ans=0.2 2024-06-21 14:03:02,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=392991.5, ans=0.0 2024-06-21 14:03:04,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=392991.5, ans=0.125 2024-06-21 14:03:12,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=393009.8333333333, ans=0.125 2024-06-21 14:03:12,587 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.25 vs. limit=6.0 2024-06-21 14:03:31,744 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.90 vs. limit=10.0 2024-06-21 14:03:32,571 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2024-06-21 14:03:35,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=393046.5, ans=0.125 2024-06-21 14:03:49,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=393064.8333333333, ans=0.09899494936611666 2024-06-21 14:03:51,438 INFO [train.py:1028] (1/2) Epoch 22, batch 1950, loss[loss=0.1964, simple_loss=0.2625, pruned_loss=0.06513, over 13272.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2564, pruned_loss=0.07213, over 2590985.74 frames. ], batch size: 52, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:04:08,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=393101.5, ans=0.0 2024-06-21 14:04:09,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=393101.5, ans=0.125 2024-06-21 14:04:25,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=393119.8333333333, ans=0.1 2024-06-21 14:04:30,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=393138.1666666667, ans=0.125 2024-06-21 14:04:37,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=393138.1666666667, ans=0.2 2024-06-21 14:04:39,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=393156.5, ans=0.1 2024-06-21 14:04:46,964 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.073e+02 2.224e+02 2.384e+02 3.007e+02, threshold=4.448e+02, percent-clipped=0.0 2024-06-21 14:04:47,001 INFO [train.py:1028] (1/2) Epoch 22, batch 2000, loss[loss=0.1962, simple_loss=0.2544, pruned_loss=0.06901, over 12706.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2563, pruned_loss=0.0722, over 2587153.94 frames. ], batch size: 22, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:04:51,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=393174.8333333333, ans=0.025 2024-06-21 14:04:58,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=393193.1666666667, ans=0.125 2024-06-21 14:05:10,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=393211.5, ans=0.125 2024-06-21 14:05:12,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=393211.5, ans=0.125 2024-06-21 14:05:16,986 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.11 vs. limit=22.5 2024-06-21 14:05:26,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=393248.1666666667, ans=0.0 2024-06-21 14:05:33,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=393248.1666666667, ans=0.0 2024-06-21 14:05:36,644 INFO [train.py:1028] (1/2) Epoch 22, batch 2050, loss[loss=0.1961, simple_loss=0.2597, pruned_loss=0.06625, over 12672.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2567, pruned_loss=0.07269, over 2582792.68 frames. ], batch size: 29, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:05:41,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=393266.5, ans=0.09899494936611666 2024-06-21 14:05:46,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=393284.8333333333, ans=0.2 2024-06-21 14:05:57,244 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.47 vs. limit=15.0 2024-06-21 14:06:01,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=393303.1666666667, ans=0.0 2024-06-21 14:06:02,103 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=15.0 2024-06-21 14:06:06,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=393321.5, ans=0.2 2024-06-21 14:06:19,562 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.47 vs. limit=15.0 2024-06-21 14:06:28,575 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.144e+02 2.329e+02 2.514e+02 3.732e+02, threshold=4.657e+02, percent-clipped=0.0 2024-06-21 14:06:28,613 INFO [train.py:1028] (1/2) Epoch 22, batch 2100, loss[loss=0.2019, simple_loss=0.2657, pruned_loss=0.06905, over 13237.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.257, pruned_loss=0.07234, over 2586302.11 frames. ], batch size: 59, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:06:35,208 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.79 vs. limit=15.0 2024-06-21 14:06:37,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=393376.5, ans=0.125 2024-06-21 14:06:43,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=393376.5, ans=0.0 2024-06-21 14:06:50,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=393394.8333333333, ans=0.125 2024-06-21 14:07:06,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=393431.5, ans=0.125 2024-06-21 14:07:07,368 INFO [train.py:1028] (1/2) Epoch 22, batch 2150, loss[loss=0.1879, simple_loss=0.2525, pruned_loss=0.06165, over 13244.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2569, pruned_loss=0.07196, over 2588268.09 frames. ], batch size: 52, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:07:10,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=393449.8333333333, ans=0.125 2024-06-21 14:07:12,383 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:07:35,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=393486.5, ans=0.09899494936611666 2024-06-21 14:07:42,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=393504.8333333333, ans=0.125 2024-06-21 14:07:44,110 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.69 vs. limit=15.0 2024-06-21 14:07:53,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=393523.1666666667, ans=0.09899494936611666 2024-06-21 14:08:00,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=393523.1666666667, ans=0.125 2024-06-21 14:08:03,256 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.046e+02 2.163e+02 2.307e+02 2.802e+02, threshold=4.327e+02, percent-clipped=0.0 2024-06-21 14:08:03,313 INFO [train.py:1028] (1/2) Epoch 22, batch 2200, loss[loss=0.2151, simple_loss=0.2725, pruned_loss=0.07885, over 13210.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2572, pruned_loss=0.0721, over 2588362.02 frames. ], batch size: 83, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:08:07,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=393541.5, ans=0.125 2024-06-21 14:08:19,058 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.66 vs. limit=22.5 2024-06-21 14:08:30,679 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.14 vs. limit=22.5 2024-06-21 14:08:42,778 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=393596.5, ans=0.0 2024-06-21 14:08:48,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=393614.8333333333, ans=0.0 2024-06-21 14:08:51,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=393614.8333333333, ans=0.1 2024-06-21 14:08:55,763 INFO [train.py:1028] (1/2) Epoch 22, batch 2250, loss[loss=0.2104, simple_loss=0.2751, pruned_loss=0.07282, over 13258.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2566, pruned_loss=0.07157, over 2587972.28 frames. ], batch size: 63, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:09:02,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=393633.1666666667, ans=0.5 2024-06-21 14:09:05,086 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.50 vs. limit=15.0 2024-06-21 14:09:50,206 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.39 vs. limit=22.5 2024-06-21 14:09:52,273 INFO [train.py:1028] (1/2) Epoch 22, batch 2300, loss[loss=0.191, simple_loss=0.2507, pruned_loss=0.06567, over 12950.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2565, pruned_loss=0.07127, over 2582030.85 frames. ], batch size: 33, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:09:53,218 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 2.169e+02 2.348e+02 2.620e+02 3.934e+02, threshold=4.696e+02, percent-clipped=0.0 2024-06-21 14:09:56,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=393724.8333333333, ans=0.09899494936611666 2024-06-21 14:10:23,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=393779.8333333333, ans=0.125 2024-06-21 14:10:25,436 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.14 vs. limit=22.5 2024-06-21 14:10:27,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=393779.8333333333, ans=0.2 2024-06-21 14:10:30,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=393779.8333333333, ans=0.125 2024-06-21 14:10:46,810 INFO [train.py:1028] (1/2) Epoch 22, batch 2350, loss[loss=0.183, simple_loss=0.2392, pruned_loss=0.06342, over 13213.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2569, pruned_loss=0.07154, over 2585861.26 frames. ], batch size: 67, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:10:50,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=393816.5, ans=0.1 2024-06-21 14:11:16,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=393871.5, ans=0.0 2024-06-21 14:11:20,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=393871.5, ans=0.0 2024-06-21 14:11:36,438 INFO [train.py:1028] (1/2) Epoch 22, batch 2400, loss[loss=0.1979, simple_loss=0.2588, pruned_loss=0.06852, over 13273.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2564, pruned_loss=0.07164, over 2588652.92 frames. ], batch size: 46, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:11:37,493 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.120e+02 2.247e+02 2.430e+02 3.498e+02, threshold=4.493e+02, percent-clipped=0.0 2024-06-21 14:11:44,419 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2024-06-21 14:11:44,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=393908.1666666667, ans=0.125 2024-06-21 14:11:49,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=393926.5, ans=0.025 2024-06-21 14:12:01,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=393944.8333333333, ans=0.1 2024-06-21 14:12:19,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=393981.5, ans=0.0 2024-06-21 14:12:25,520 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.32 vs. limit=15.0 2024-06-21 14:12:26,657 INFO [train.py:1028] (1/2) Epoch 22, batch 2450, loss[loss=0.182, simple_loss=0.2405, pruned_loss=0.06172, over 13288.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2564, pruned_loss=0.07222, over 2584837.94 frames. ], batch size: 63, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:12:34,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=393999.8333333333, ans=0.0 2024-06-21 14:12:47,856 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.71 vs. limit=15.0 2024-06-21 14:13:17,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=394073.1666666667, ans=0.1 2024-06-21 14:13:27,800 INFO [train.py:1028] (1/2) Epoch 22, batch 2500, loss[loss=0.2231, simple_loss=0.2746, pruned_loss=0.08577, over 13192.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2553, pruned_loss=0.07205, over 2586983.20 frames. ], batch size: 83, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:13:28,621 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.121e+02 2.247e+02 2.474e+02 3.241e+02, threshold=4.495e+02, percent-clipped=0.0 2024-06-21 14:13:28,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=394091.5, ans=0.0 2024-06-21 14:13:33,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=394091.5, ans=0.0 2024-06-21 14:13:44,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=394109.8333333333, ans=0.0 2024-06-21 14:14:05,178 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.75 vs. limit=15.0 2024-06-21 14:14:12,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=394164.8333333333, ans=0.125 2024-06-21 14:14:14,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=394164.8333333333, ans=0.1 2024-06-21 14:14:17,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=394164.8333333333, ans=15.0 2024-06-21 14:14:19,431 INFO [train.py:1028] (1/2) Epoch 22, batch 2550, loss[loss=0.2035, simple_loss=0.2655, pruned_loss=0.07075, over 12679.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2543, pruned_loss=0.07165, over 2586225.72 frames. ], batch size: 22, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:14:21,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=394183.1666666667, ans=0.1 2024-06-21 14:14:42,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=394219.8333333333, ans=0.025 2024-06-21 14:14:46,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=394238.1666666667, ans=0.0 2024-06-21 14:15:05,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=394274.8333333333, ans=0.0 2024-06-21 14:15:06,090 INFO [train.py:1028] (1/2) Epoch 22, batch 2600, loss[loss=0.2038, simple_loss=0.2625, pruned_loss=0.07253, over 13248.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2527, pruned_loss=0.07106, over 2585939.27 frames. ], batch size: 52, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:15:07,163 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.095e+02 2.221e+02 2.432e+02 3.036e+02, threshold=4.442e+02, percent-clipped=0.0 2024-06-21 14:15:12,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=394274.8333333333, ans=0.05 2024-06-21 14:15:13,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=394274.8333333333, ans=0.0 2024-06-21 14:15:19,946 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.33 vs. limit=10.0 2024-06-21 14:15:35,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=394311.5, ans=0.1 2024-06-21 14:15:35,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=394311.5, ans=0.1 2024-06-21 14:15:36,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=394311.5, ans=0.125 2024-06-21 14:15:43,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=394329.8333333333, ans=0.0 2024-06-21 14:16:02,866 INFO [train.py:1028] (1/2) Epoch 22, batch 2650, loss[loss=0.2095, simple_loss=0.2649, pruned_loss=0.07701, over 13072.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2513, pruned_loss=0.07077, over 2587039.10 frames. ], batch size: 144, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:16:07,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=394366.5, ans=0.0 2024-06-21 14:16:21,494 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:16:37,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=394421.5, ans=0.0 2024-06-21 14:16:54,966 INFO [train.py:1028] (1/2) Epoch 22, batch 2700, loss[loss=0.1762, simple_loss=0.2319, pruned_loss=0.06022, over 13237.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.25, pruned_loss=0.0705, over 2585328.31 frames. ], batch size: 89, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:16:55,965 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.072e+02 2.234e+02 2.406e+02 3.537e+02, threshold=4.468e+02, percent-clipped=0.0 2024-06-21 14:17:07,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=394476.5, ans=0.125 2024-06-21 14:17:11,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=394476.5, ans=0.035 2024-06-21 14:17:12,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=394476.5, ans=0.1 2024-06-21 14:17:24,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=394494.8333333333, ans=0.125 2024-06-21 14:17:29,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=394513.1666666667, ans=0.125 2024-06-21 14:17:39,429 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.98 vs. limit=12.0 2024-06-21 14:17:43,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=394531.5, ans=0.5 2024-06-21 14:17:44,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=394531.5, ans=0.04949747468305833 2024-06-21 14:17:47,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=394549.8333333333, ans=0.125 2024-06-21 14:17:48,583 INFO [train.py:1028] (1/2) Epoch 22, batch 2750, loss[loss=0.1935, simple_loss=0.2474, pruned_loss=0.06984, over 13268.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2486, pruned_loss=0.06964, over 2580870.13 frames. ], batch size: 43, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:17:49,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=394549.8333333333, ans=0.95 2024-06-21 14:17:50,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=394549.8333333333, ans=0.125 2024-06-21 14:17:56,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=394549.8333333333, ans=0.025 2024-06-21 14:17:57,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=394549.8333333333, ans=0.95 2024-06-21 14:18:15,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=394586.5, ans=0.125 2024-06-21 14:18:20,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=394586.5, ans=0.0 2024-06-21 14:18:37,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=394604.8333333333, ans=0.0 2024-06-21 14:18:42,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=394623.1666666667, ans=0.1 2024-06-21 14:18:42,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=394623.1666666667, ans=0.125 2024-06-21 14:18:50,166 INFO [train.py:1028] (1/2) Epoch 22, batch 2800, loss[loss=0.2158, simple_loss=0.2589, pruned_loss=0.08629, over 10696.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2483, pruned_loss=0.06995, over 2578687.89 frames. ], batch size: 303, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:18:51,208 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.043e+02 2.186e+02 2.388e+02 3.843e+02, threshold=4.371e+02, percent-clipped=0.0 2024-06-21 14:18:54,740 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.79 vs. limit=15.0 2024-06-21 14:19:01,148 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.57 vs. limit=10.0 2024-06-21 14:19:15,017 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=22.5 2024-06-21 14:19:25,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=394696.5, ans=0.025 2024-06-21 14:19:27,676 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:19:38,716 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.31 vs. limit=15.0 2024-06-21 14:19:40,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=394714.8333333333, ans=0.125 2024-06-21 14:19:45,082 INFO [train.py:1028] (1/2) Epoch 22, batch 2850, loss[loss=0.2059, simple_loss=0.2649, pruned_loss=0.07346, over 13297.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2478, pruned_loss=0.06988, over 2577609.48 frames. ], batch size: 49, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:19:49,916 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=22.5 2024-06-21 14:19:51,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=394733.1666666667, ans=0.025 2024-06-21 14:19:53,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=394751.5, ans=0.125 2024-06-21 14:19:54,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=394751.5, ans=0.2 2024-06-21 14:20:06,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=394769.8333333333, ans=0.125 2024-06-21 14:20:09,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=394769.8333333333, ans=0.125 2024-06-21 14:20:24,765 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2024-06-21 14:20:28,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=394824.8333333333, ans=0.125 2024-06-21 14:20:29,837 INFO [train.py:1028] (1/2) Epoch 22, batch 2900, loss[loss=0.1936, simple_loss=0.2472, pruned_loss=0.07001, over 13171.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2462, pruned_loss=0.06941, over 2585434.63 frames. ], batch size: 55, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:20:31,177 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.070e+02 2.182e+02 2.430e+02 3.200e+02, threshold=4.363e+02, percent-clipped=0.0 2024-06-21 14:20:37,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=394824.8333333333, ans=0.125 2024-06-21 14:20:53,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=394861.5, ans=0.125 2024-06-21 14:21:19,850 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.07 vs. limit=22.5 2024-06-21 14:21:23,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=394898.1666666667, ans=0.025 2024-06-21 14:21:28,616 INFO [train.py:1028] (1/2) Epoch 22, batch 2950, loss[loss=0.1949, simple_loss=0.2501, pruned_loss=0.06986, over 13271.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2458, pruned_loss=0.0693, over 2581997.86 frames. ], batch size: 43, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:21:33,112 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:21:56,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=394953.1666666667, ans=0.0 2024-06-21 14:22:01,585 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:22:22,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=394989.8333333333, ans=0.0 2024-06-21 14:22:28,103 INFO [train.py:1028] (1/2) Epoch 22, batch 3000, loss[loss=0.1794, simple_loss=0.2379, pruned_loss=0.06047, over 13187.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2451, pruned_loss=0.06886, over 2579263.44 frames. ], batch size: 59, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:22:28,104 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 14:22:36,027 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.3670, 4.8802, 5.1449, 5.1190], device='cuda:1') 2024-06-21 14:22:39,680 INFO [train.py:1060] (1/2) Epoch 22, validation: loss=0.187, simple_loss=0.2507, pruned_loss=0.06171, over 351949.00 frames. 2024-06-21 14:22:39,682 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 14:22:40,685 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.063e+02 2.181e+02 2.421e+02 3.145e+02, threshold=4.362e+02, percent-clipped=0.0 2024-06-21 14:22:41,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=395008.1666666667, ans=0.125 2024-06-21 14:22:54,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=395026.5, ans=0.125 2024-06-21 14:23:01,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=395044.8333333333, ans=0.2 2024-06-21 14:23:18,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=395063.1666666667, ans=0.0 2024-06-21 14:23:29,799 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.14 vs. limit=22.5 2024-06-21 14:23:32,044 INFO [train.py:1028] (1/2) Epoch 22, batch 3050, loss[loss=0.1798, simple_loss=0.2368, pruned_loss=0.06136, over 13302.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2444, pruned_loss=0.06898, over 2578568.57 frames. ], batch size: 46, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:23:36,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=395099.8333333333, ans=0.125 2024-06-21 14:23:37,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=395099.8333333333, ans=0.0 2024-06-21 14:23:46,552 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.08 vs. limit=12.0 2024-06-21 14:23:51,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=395136.5, ans=0.125 2024-06-21 14:23:53,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=395136.5, ans=10.0 2024-06-21 14:24:02,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=395154.8333333333, ans=0.1 2024-06-21 14:24:17,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=395191.5, ans=0.0 2024-06-21 14:24:18,560 INFO [train.py:1028] (1/2) Epoch 22, batch 3100, loss[loss=0.2015, simple_loss=0.2528, pruned_loss=0.07517, over 13022.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2434, pruned_loss=0.06837, over 2579227.66 frames. ], batch size: 144, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:24:19,484 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.067e+02 2.202e+02 2.371e+02 3.498e+02, threshold=4.404e+02, percent-clipped=0.0 2024-06-21 14:24:19,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=395191.5, ans=0.125 2024-06-21 14:24:32,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=395191.5, ans=0.2 2024-06-21 14:24:41,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=395209.8333333333, ans=0.125 2024-06-21 14:24:57,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=395246.5, ans=0.125 2024-06-21 14:24:59,154 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.43 vs. limit=6.0 2024-06-21 14:25:05,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=395264.8333333333, ans=0.125 2024-06-21 14:25:12,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.03 vs. limit=15.0 2024-06-21 14:25:20,481 INFO [train.py:1028] (1/2) Epoch 22, batch 3150, loss[loss=0.1971, simple_loss=0.2432, pruned_loss=0.07556, over 12940.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2425, pruned_loss=0.06804, over 2581614.30 frames. ], batch size: 158, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:25:22,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=395283.1666666667, ans=0.2 2024-06-21 14:25:27,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=395283.1666666667, ans=0.0 2024-06-21 14:25:42,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=395319.8333333333, ans=10.0 2024-06-21 14:25:45,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=395319.8333333333, ans=0.05 2024-06-21 14:25:59,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=395356.5, ans=0.2 2024-06-21 14:26:00,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=395356.5, ans=0.125 2024-06-21 14:26:05,831 INFO [train.py:1028] (1/2) Epoch 22, batch 3200, loss[loss=0.2014, simple_loss=0.2561, pruned_loss=0.0733, over 13096.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2419, pruned_loss=0.06765, over 2583314.98 frames. ], batch size: 55, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:26:06,462 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.050e+02 2.200e+02 2.430e+02 3.427e+02, threshold=4.401e+02, percent-clipped=0.0 2024-06-21 14:26:20,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=395411.5, ans=0.0 2024-06-21 14:26:22,678 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.92 vs. limit=15.0 2024-06-21 14:26:24,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=395429.8333333333, ans=0.025 2024-06-21 14:26:28,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=395429.8333333333, ans=0.125 2024-06-21 14:26:28,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=395429.8333333333, ans=0.125 2024-06-21 14:26:29,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=395429.8333333333, ans=0.2 2024-06-21 14:26:34,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=395448.1666666667, ans=0.0 2024-06-21 14:26:37,213 INFO [train.py:1028] (1/2) Epoch 22, batch 3250, loss[loss=0.1716, simple_loss=0.2254, pruned_loss=0.05885, over 13278.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2415, pruned_loss=0.06764, over 2586797.84 frames. ], batch size: 72, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:26:38,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=395466.5, ans=0.125 2024-06-21 14:26:43,568 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:27:01,739 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.05 vs. limit=15.0 2024-06-21 14:27:02,382 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.05 vs. limit=15.0 2024-06-21 14:27:04,222 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.91 vs. limit=15.0 2024-06-21 14:27:11,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=395539.8333333333, ans=0.0 2024-06-21 14:27:12,577 INFO [train.py:1028] (1/2) Epoch 22, batch 3300, loss[loss=0.186, simple_loss=0.2385, pruned_loss=0.06676, over 12920.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2414, pruned_loss=0.06746, over 2583173.72 frames. ], batch size: 177, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:27:13,156 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.023e+02 2.169e+02 2.334e+02 3.159e+02, threshold=4.339e+02, percent-clipped=0.0 2024-06-21 14:27:13,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=395558.1666666667, ans=0.1 2024-06-21 14:27:33,718 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.38 vs. limit=22.5 2024-06-21 14:27:46,360 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2024-06-21 14:27:47,428 INFO [train.py:1028] (1/2) Epoch 22, batch 3350, loss[loss=0.1844, simple_loss=0.2394, pruned_loss=0.06468, over 12977.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.241, pruned_loss=0.06774, over 2577796.15 frames. ], batch size: 158, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:27:50,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=395649.8333333333, ans=0.125 2024-06-21 14:27:57,819 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=22.5 2024-06-21 14:27:58,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=395668.1666666667, ans=0.2 2024-06-21 14:28:03,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=395686.5, ans=0.0 2024-06-21 14:28:08,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=395704.8333333333, ans=0.125 2024-06-21 14:28:09,081 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.81 vs. limit=15.0 2024-06-21 14:28:20,101 INFO [train.py:1028] (1/2) Epoch 22, batch 3400, loss[loss=0.1932, simple_loss=0.2495, pruned_loss=0.06843, over 12553.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2408, pruned_loss=0.06815, over 2575988.02 frames. ], batch size: 22, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:28:20,754 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 2.063e+02 2.251e+02 2.541e+02 3.891e+02, threshold=4.503e+02, percent-clipped=0.0 2024-06-21 14:28:23,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=395741.5, ans=0.2 2024-06-21 14:28:23,850 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2024-06-21 14:28:33,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=395778.1666666667, ans=0.0 2024-06-21 14:28:34,752 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.97 vs. limit=15.0 2024-06-21 14:28:52,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=395814.8333333333, ans=0.0 2024-06-21 14:28:52,837 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.55 vs. limit=15.0 2024-06-21 14:28:55,680 INFO [train.py:1028] (1/2) Epoch 22, batch 3450, loss[loss=0.2073, simple_loss=0.2533, pruned_loss=0.08068, over 12786.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2404, pruned_loss=0.06784, over 2577694.68 frames. ], batch size: 177, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:29:21,931 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.43 vs. limit=22.5 2024-06-21 14:29:26,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=395906.5, ans=0.0 2024-06-21 14:29:29,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=395906.5, ans=0.125 2024-06-21 14:29:31,130 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.70 vs. limit=22.5 2024-06-21 14:29:31,945 INFO [train.py:1028] (1/2) Epoch 22, batch 3500, loss[loss=0.197, simple_loss=0.2497, pruned_loss=0.07219, over 12941.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.24, pruned_loss=0.06755, over 2576801.79 frames. ], batch size: 33, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:29:32,662 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.053e+02 2.178e+02 2.386e+02 2.848e+02, threshold=4.356e+02, percent-clipped=0.0 2024-06-21 14:29:56,736 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.55 vs. limit=10.0 2024-06-21 14:30:10,780 INFO [train.py:1028] (1/2) Epoch 22, batch 3550, loss[loss=0.1781, simple_loss=0.2323, pruned_loss=0.06198, over 13126.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2394, pruned_loss=0.06708, over 2577496.39 frames. ], batch size: 95, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:30:13,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=396016.5, ans=0.125 2024-06-21 14:30:16,008 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.87 vs. limit=22.5 2024-06-21 14:30:17,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=396034.8333333333, ans=0.125 2024-06-21 14:30:17,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=396034.8333333333, ans=0.1 2024-06-21 14:30:17,477 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.15 vs. limit=15.0 2024-06-21 14:30:17,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=396034.8333333333, ans=0.2 2024-06-21 14:30:21,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=396034.8333333333, ans=0.125 2024-06-21 14:30:27,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=396053.1666666667, ans=0.0 2024-06-21 14:30:31,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=396071.5, ans=0.0 2024-06-21 14:30:32,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=396071.5, ans=0.0 2024-06-21 14:30:37,874 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.69 vs. limit=12.0 2024-06-21 14:30:42,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=396089.8333333333, ans=0.0 2024-06-21 14:30:44,114 INFO [train.py:1028] (1/2) Epoch 22, batch 3600, loss[loss=0.167, simple_loss=0.2209, pruned_loss=0.05653, over 13251.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.239, pruned_loss=0.06708, over 2580704.16 frames. ], batch size: 49, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:30:44,916 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.052e+02 2.156e+02 2.454e+02 3.434e+02, threshold=4.313e+02, percent-clipped=0.0 2024-06-21 14:30:53,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=396126.5, ans=0.0 2024-06-21 14:30:59,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=396126.5, ans=0.1 2024-06-21 14:31:05,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=396144.8333333333, ans=0.1 2024-06-21 14:31:09,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=396163.1666666667, ans=0.125 2024-06-21 14:31:21,079 INFO [train.py:1028] (1/2) Epoch 22, batch 3650, loss[loss=0.1837, simple_loss=0.2385, pruned_loss=0.06448, over 13040.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2393, pruned_loss=0.06702, over 2580150.52 frames. ], batch size: 102, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:31:21,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=396199.8333333333, ans=0.125 2024-06-21 14:31:35,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=396218.1666666667, ans=0.125 2024-06-21 14:31:37,412 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.20 vs. limit=15.0 2024-06-21 14:31:45,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=396254.8333333333, ans=0.125 2024-06-21 14:31:57,790 INFO [train.py:1028] (1/2) Epoch 22, batch 3700, loss[loss=0.1745, simple_loss=0.2347, pruned_loss=0.0571, over 13263.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2382, pruned_loss=0.06658, over 2585182.90 frames. ], batch size: 72, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:31:58,408 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.021e+02 2.178e+02 2.348e+02 3.318e+02, threshold=4.356e+02, percent-clipped=0.0 2024-06-21 14:32:17,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=396346.5, ans=0.0 2024-06-21 14:32:23,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=396364.8333333333, ans=0.1 2024-06-21 14:32:31,081 INFO [train.py:1028] (1/2) Epoch 22, batch 3750, loss[loss=0.2331, simple_loss=0.2949, pruned_loss=0.08566, over 12539.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2376, pruned_loss=0.06617, over 2587061.57 frames. ], batch size: 22, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:32:39,314 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0 2024-06-21 14:32:44,478 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.58 vs. limit=12.0 2024-06-21 14:32:44,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=396419.8333333333, ans=0.2 2024-06-21 14:33:03,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=396474.8333333333, ans=0.0 2024-06-21 14:33:04,187 INFO [train.py:1028] (1/2) Epoch 22, batch 3800, loss[loss=0.1844, simple_loss=0.2327, pruned_loss=0.06806, over 13252.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2378, pruned_loss=0.06614, over 2585577.87 frames. ], batch size: 83, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:33:04,813 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.778e+02 2.021e+02 2.176e+02 2.352e+02 3.040e+02, threshold=4.353e+02, percent-clipped=0.0 2024-06-21 14:33:14,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=396474.8333333333, ans=0.0 2024-06-21 14:33:31,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=396529.8333333333, ans=0.1 2024-06-21 14:33:39,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=396548.1666666667, ans=0.0 2024-06-21 14:33:40,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=396548.1666666667, ans=0.025 2024-06-21 14:33:43,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=396548.1666666667, ans=0.0 2024-06-21 14:33:44,972 INFO [train.py:1028] (1/2) Epoch 22, batch 3850, loss[loss=0.1848, simple_loss=0.238, pruned_loss=0.06584, over 12986.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2372, pruned_loss=0.06582, over 2584548.74 frames. ], batch size: 144, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:33:45,964 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.89 vs. limit=22.5 2024-06-21 14:33:48,411 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.64 vs. limit=15.0 2024-06-21 14:33:49,714 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2024-06-21 14:33:50,270 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2024-06-21 14:33:52,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=396584.8333333333, ans=0.1 2024-06-21 14:34:02,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=396603.1666666667, ans=0.125 2024-06-21 14:34:02,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=396603.1666666667, ans=0.125 2024-06-21 14:34:05,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=396621.5, ans=0.125 2024-06-21 14:34:09,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=396621.5, ans=0.125 2024-06-21 14:34:13,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=396639.8333333333, ans=0.025 2024-06-21 14:34:17,267 INFO [train.py:1028] (1/2) Epoch 22, batch 3900, loss[loss=0.1852, simple_loss=0.2383, pruned_loss=0.0661, over 13182.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2368, pruned_loss=0.06572, over 2587684.36 frames. ], batch size: 83, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:34:17,841 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.024e+02 2.128e+02 2.346e+02 3.309e+02, threshold=4.255e+02, percent-clipped=0.0 2024-06-21 14:34:18,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=396658.1666666667, ans=0.1 2024-06-21 14:34:22,808 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.04 vs. limit=22.5 2024-06-21 14:34:23,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=396676.5, ans=0.0 2024-06-21 14:34:27,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=396676.5, ans=0.025 2024-06-21 14:34:35,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=396694.8333333333, ans=0.125 2024-06-21 14:34:44,010 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.05 vs. limit=15.0 2024-06-21 14:34:44,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=396731.5, ans=0.125 2024-06-21 14:34:47,637 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2024-06-21 14:34:49,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=396749.8333333333, ans=0.125 2024-06-21 14:34:49,815 INFO [train.py:1028] (1/2) Epoch 22, batch 3950, loss[loss=0.1883, simple_loss=0.237, pruned_loss=0.06983, over 13122.00 frames. ], tot_loss[loss=0.1829, simple_loss=0.2357, pruned_loss=0.06498, over 2589184.82 frames. ], batch size: 132, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:34:51,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=396749.8333333333, ans=0.05 2024-06-21 14:34:54,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=396749.8333333333, ans=0.0 2024-06-21 14:34:56,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=396768.1666666667, ans=0.07 2024-06-21 14:34:56,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=396768.1666666667, ans=0.125 2024-06-21 14:34:57,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=396768.1666666667, ans=0.125 2024-06-21 14:35:00,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=396768.1666666667, ans=0.125 2024-06-21 14:35:04,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=396786.5, ans=0.015 2024-06-21 14:35:07,250 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.27 vs. limit=15.0 2024-06-21 14:35:08,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=396804.8333333333, ans=0.2 2024-06-21 14:35:15,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=396804.8333333333, ans=0.04949747468305833 2024-06-21 14:35:17,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=396804.8333333333, ans=0.1 2024-06-21 14:35:17,194 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.45 vs. limit=15.0 2024-06-21 14:35:17,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=396804.8333333333, ans=0.1 2024-06-21 14:35:20,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=396823.1666666667, ans=0.025 2024-06-21 14:35:24,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=396823.1666666667, ans=0.0 2024-06-21 14:35:25,424 INFO [train.py:1028] (1/2) Epoch 22, batch 4000, loss[loss=0.1993, simple_loss=0.2636, pruned_loss=0.06747, over 12938.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2358, pruned_loss=0.06519, over 2584782.64 frames. ], batch size: 39, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:35:26,127 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 1.984e+02 2.134e+02 2.275e+02 3.227e+02, threshold=4.269e+02, percent-clipped=0.0 2024-06-21 14:35:41,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=396859.8333333333, ans=0.125 2024-06-21 14:35:42,829 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.74 vs. limit=15.0 2024-06-21 14:35:57,496 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.21 vs. limit=22.5 2024-06-21 14:36:02,758 INFO [train.py:1028] (1/2) Epoch 22, batch 4050, loss[loss=0.2029, simple_loss=0.2442, pruned_loss=0.08079, over 11042.00 frames. ], tot_loss[loss=0.1829, simple_loss=0.2353, pruned_loss=0.06525, over 2583011.27 frames. ], batch size: 304, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:36:08,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=396933.1666666667, ans=0.125 2024-06-21 14:36:14,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=396951.5, ans=0.125 2024-06-21 14:36:21,061 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=396969.8333333333, ans=0.125 2024-06-21 14:36:21,899 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.97 vs. limit=15.0 2024-06-21 14:36:26,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=396988.1666666667, ans=0.0 2024-06-21 14:36:35,754 INFO [train.py:1028] (1/2) Epoch 22, batch 4100, loss[loss=0.1791, simple_loss=0.2269, pruned_loss=0.06566, over 13052.00 frames. ], tot_loss[loss=0.1829, simple_loss=0.2352, pruned_loss=0.06529, over 2579167.93 frames. ], batch size: 102, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:36:36,342 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.782e+02 2.025e+02 2.145e+02 2.317e+02 2.980e+02, threshold=4.289e+02, percent-clipped=0.0 2024-06-21 14:36:39,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=397024.8333333333, ans=0.1 2024-06-21 14:36:45,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=397043.1666666667, ans=10.0 2024-06-21 14:36:52,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=397061.5, ans=0.125 2024-06-21 14:36:56,725 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.90 vs. limit=15.0 2024-06-21 14:36:57,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397079.8333333333, ans=0.1 2024-06-21 14:36:59,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397079.8333333333, ans=0.1 2024-06-21 14:37:01,655 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.77 vs. limit=12.0 2024-06-21 14:37:08,978 INFO [train.py:1028] (1/2) Epoch 22, batch 4150, loss[loss=0.1694, simple_loss=0.2215, pruned_loss=0.05866, over 13114.00 frames. ], tot_loss[loss=0.1825, simple_loss=0.2349, pruned_loss=0.06505, over 2577045.69 frames. ], batch size: 55, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:37:10,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=397116.5, ans=0.125 2024-06-21 14:37:13,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397116.5, ans=0.1 2024-06-21 14:37:25,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=397134.8333333333, ans=0.125 2024-06-21 14:37:25,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=397134.8333333333, ans=0.125 2024-06-21 14:37:32,991 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.44 vs. limit=15.0 2024-06-21 14:37:51,432 INFO [train.py:1028] (1/2) Epoch 22, batch 4200, loss[loss=0.1811, simple_loss=0.2261, pruned_loss=0.06807, over 13031.00 frames. ], tot_loss[loss=0.1822, simple_loss=0.2343, pruned_loss=0.06508, over 2579522.90 frames. ], batch size: 102, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:37:51,941 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.33 vs. limit=15.0 2024-06-21 14:37:52,074 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 1.971e+02 2.084e+02 2.260e+02 3.127e+02, threshold=4.167e+02, percent-clipped=0.0 2024-06-21 14:38:06,127 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.89 vs. limit=22.5 2024-06-21 14:38:09,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=397244.8333333333, ans=0.125 2024-06-21 14:38:13,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=397263.1666666667, ans=0.035 2024-06-21 14:38:13,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=397263.1666666667, ans=0.125 2024-06-21 14:38:20,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=397281.5, ans=0.125 2024-06-21 14:38:24,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=397299.8333333333, ans=0.0 2024-06-21 14:38:24,632 INFO [train.py:1028] (1/2) Epoch 22, batch 4250, loss[loss=0.1872, simple_loss=0.2485, pruned_loss=0.06301, over 13291.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2345, pruned_loss=0.06519, over 2581399.61 frames. ], batch size: 46, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:38:24,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=397299.8333333333, ans=0.125 2024-06-21 14:38:46,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=397354.8333333333, ans=0.1 2024-06-21 14:38:57,999 INFO [train.py:1028] (1/2) Epoch 22, batch 4300, loss[loss=0.1713, simple_loss=0.2259, pruned_loss=0.05835, over 13157.00 frames. ], tot_loss[loss=0.1821, simple_loss=0.2341, pruned_loss=0.06505, over 2582262.10 frames. ], batch size: 59, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:38:58,594 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.018e+02 2.157e+02 2.357e+02 3.339e+02, threshold=4.314e+02, percent-clipped=0.0 2024-06-21 14:38:58,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=397391.5, ans=0.5 2024-06-21 14:39:12,054 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.48 vs. limit=15.0 2024-06-21 14:39:31,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=397464.8333333333, ans=0.025 2024-06-21 14:39:33,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=397464.8333333333, ans=0.125 2024-06-21 14:39:36,693 INFO [train.py:1028] (1/2) Epoch 22, batch 4350, loss[loss=0.1727, simple_loss=0.2356, pruned_loss=0.05487, over 13172.00 frames. ], tot_loss[loss=0.1821, simple_loss=0.234, pruned_loss=0.06509, over 2586243.86 frames. ], batch size: 59, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:39:38,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=397483.1666666667, ans=0.125 2024-06-21 14:39:46,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=397501.5, ans=0.025 2024-06-21 14:39:56,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=397519.8333333333, ans=0.025 2024-06-21 14:40:07,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397556.5, ans=0.1 2024-06-21 14:40:13,170 INFO [train.py:1028] (1/2) Epoch 22, batch 4400, loss[loss=0.1761, simple_loss=0.2287, pruned_loss=0.06174, over 13241.00 frames. ], tot_loss[loss=0.182, simple_loss=0.2341, pruned_loss=0.065, over 2586097.46 frames. ], batch size: 83, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:40:13,774 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 1.992e+02 2.108e+02 2.311e+02 3.009e+02, threshold=4.215e+02, percent-clipped=0.0 2024-06-21 14:40:16,731 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.95 vs. limit=15.0 2024-06-21 14:40:23,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=397593.1666666667, ans=0.1 2024-06-21 14:40:34,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=397629.8333333333, ans=0.0 2024-06-21 14:40:34,389 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.80 vs. limit=15.0 2024-06-21 14:40:35,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=397629.8333333333, ans=0.1 2024-06-21 14:40:36,346 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.63 vs. limit=22.5 2024-06-21 14:40:39,385 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.43 vs. limit=8.0 2024-06-21 14:40:43,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=397648.1666666667, ans=0.07 2024-06-21 14:40:46,599 INFO [train.py:1028] (1/2) Epoch 22, batch 4450, loss[loss=0.173, simple_loss=0.2309, pruned_loss=0.05752, over 12981.00 frames. ], tot_loss[loss=0.1822, simple_loss=0.234, pruned_loss=0.06521, over 2580785.68 frames. ], batch size: 33, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:40:47,747 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.48 vs. limit=15.0 2024-06-21 14:40:49,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=397666.5, ans=0.125 2024-06-21 14:40:50,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=397666.5, ans=0.125 2024-06-21 14:40:53,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=397684.8333333333, ans=0.0 2024-06-21 14:40:55,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=397684.8333333333, ans=10.0 2024-06-21 14:41:02,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=397703.1666666667, ans=0.125 2024-06-21 14:41:12,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=397739.8333333333, ans=0.125 2024-06-21 14:41:19,230 INFO [train.py:1028] (1/2) Epoch 22, batch 4500, loss[loss=0.1669, simple_loss=0.2215, pruned_loss=0.05618, over 13263.00 frames. ], tot_loss[loss=0.1813, simple_loss=0.233, pruned_loss=0.0648, over 2585505.91 frames. ], batch size: 89, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:41:19,836 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.022e+02 2.162e+02 2.341e+02 3.125e+02, threshold=4.323e+02, percent-clipped=0.0 2024-06-21 14:41:20,317 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.89 vs. limit=12.0 2024-06-21 14:41:23,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=397758.1666666667, ans=0.025 2024-06-21 14:41:33,252 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:41:35,452 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.51 vs. limit=15.0 2024-06-21 14:41:49,516 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.39 vs. limit=22.5 2024-06-21 14:41:56,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397831.5, ans=0.1 2024-06-21 14:41:59,256 INFO [train.py:1028] (1/2) Epoch 22, batch 4550, loss[loss=0.1571, simple_loss=0.2137, pruned_loss=0.05029, over 13308.00 frames. ], tot_loss[loss=0.1806, simple_loss=0.2324, pruned_loss=0.0644, over 2588755.55 frames. ], batch size: 52, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:42:12,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=397886.5, ans=0.125 2024-06-21 14:42:12,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=397886.5, ans=0.125 2024-06-21 14:42:14,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=397886.5, ans=0.0 2024-06-21 14:42:14,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=397886.5, ans=0.125 2024-06-21 14:42:26,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=397923.1666666667, ans=0.125 2024-06-21 14:42:31,678 INFO [train.py:1028] (1/2) Epoch 22, batch 4600, loss[loss=0.2015, simple_loss=0.2475, pruned_loss=0.07772, over 12528.00 frames. ], tot_loss[loss=0.1815, simple_loss=0.2334, pruned_loss=0.06484, over 2586139.36 frames. ], batch size: 202, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:42:32,331 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+02 2.055e+02 2.218e+02 2.375e+02 3.149e+02, threshold=4.435e+02, percent-clipped=0.0 2024-06-21 14:42:32,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=397941.5, ans=0.125 2024-06-21 14:42:41,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=397959.8333333333, ans=0.0 2024-06-21 14:42:48,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=397978.1666666667, ans=0.025 2024-06-21 14:42:58,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=398014.8333333333, ans=0.07 2024-06-21 14:43:02,116 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.14 vs. limit=22.5 2024-06-21 14:43:04,225 INFO [train.py:1028] (1/2) Epoch 22, batch 4650, loss[loss=0.1748, simple_loss=0.2157, pruned_loss=0.06694, over 13121.00 frames. ], tot_loss[loss=0.1814, simple_loss=0.2331, pruned_loss=0.06485, over 2589027.59 frames. ], batch size: 132, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:43:31,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=398088.1666666667, ans=0.125 2024-06-21 14:43:32,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=398088.1666666667, ans=0.025 2024-06-21 14:43:41,268 INFO [train.py:1028] (1/2) Epoch 22, batch 4700, loss[loss=0.1794, simple_loss=0.2383, pruned_loss=0.06022, over 12808.00 frames. ], tot_loss[loss=0.1818, simple_loss=0.2335, pruned_loss=0.06503, over 2584328.34 frames. ], batch size: 26, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:43:41,941 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.008e+02 2.163e+02 2.336e+02 3.155e+02, threshold=4.326e+02, percent-clipped=0.0 2024-06-21 14:44:02,578 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.96 vs. limit=10.0 2024-06-21 14:44:05,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=398179.8333333333, ans=0.09899494936611666 2024-06-21 14:44:08,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=398179.8333333333, ans=0.125 2024-06-21 14:44:16,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=398198.1666666667, ans=0.025 2024-06-21 14:44:17,603 INFO [train.py:1028] (1/2) Epoch 22, batch 4750, loss[loss=0.1843, simple_loss=0.2329, pruned_loss=0.06782, over 12544.00 frames. ], tot_loss[loss=0.1815, simple_loss=0.2331, pruned_loss=0.0649, over 2581372.80 frames. ], batch size: 202, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:44:24,139 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2024-06-21 14:44:28,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398234.8333333333, ans=0.1 2024-06-21 14:44:29,818 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2024-06-21 14:44:30,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=398253.1666666667, ans=0.0 2024-06-21 14:44:50,213 INFO [train.py:1028] (1/2) Epoch 22, batch 4800, loss[loss=0.1799, simple_loss=0.234, pruned_loss=0.06287, over 13270.00 frames. ], tot_loss[loss=0.1814, simple_loss=0.233, pruned_loss=0.06492, over 2577656.62 frames. ], batch size: 63, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:44:50,896 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.024e+02 2.117e+02 2.256e+02 2.983e+02, threshold=4.234e+02, percent-clipped=0.0 2024-06-21 14:44:53,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=398308.1666666667, ans=0.125 2024-06-21 14:44:56,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=398326.5, ans=0.125 2024-06-21 14:45:00,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=398326.5, ans=0.125 2024-06-21 14:45:08,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=398344.8333333333, ans=0.2 2024-06-21 14:45:25,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=398399.8333333333, ans=0.125 2024-06-21 14:45:26,220 INFO [train.py:1028] (1/2) Epoch 22, batch 4850, loss[loss=0.1625, simple_loss=0.2098, pruned_loss=0.05756, over 13191.00 frames. ], tot_loss[loss=0.181, simple_loss=0.2327, pruned_loss=0.06463, over 2575955.24 frames. ], batch size: 89, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:45:27,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=398399.8333333333, ans=0.0 2024-06-21 14:45:45,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=398436.5, ans=0.125 2024-06-21 14:45:51,324 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:45:53,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=398454.8333333333, ans=0.125 2024-06-21 14:46:05,833 INFO [train.py:1028] (1/2) Epoch 22, batch 4900, loss[loss=0.1718, simple_loss=0.2322, pruned_loss=0.05575, over 13168.00 frames. ], tot_loss[loss=0.1809, simple_loss=0.2324, pruned_loss=0.06468, over 2576675.62 frames. ], batch size: 59, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:46:06,435 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 1.994e+02 2.150e+02 2.264e+02 3.021e+02, threshold=4.300e+02, percent-clipped=0.0 2024-06-21 14:46:10,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=398491.5, ans=0.0 2024-06-21 14:46:14,846 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:46:18,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=398528.1666666667, ans=0.125 2024-06-21 14:46:20,141 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:46:22,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=398528.1666666667, ans=0.0 2024-06-21 14:46:23,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398528.1666666667, ans=0.1 2024-06-21 14:46:29,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=398546.5, ans=0.05 2024-06-21 14:46:35,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=398564.8333333333, ans=0.2 2024-06-21 14:46:37,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=398564.8333333333, ans=0.125 2024-06-21 14:46:39,312 INFO [train.py:1028] (1/2) Epoch 22, batch 4950, loss[loss=0.1858, simple_loss=0.2226, pruned_loss=0.07452, over 11070.00 frames. ], tot_loss[loss=0.1815, simple_loss=0.2328, pruned_loss=0.06506, over 2570577.27 frames. ], batch size: 304, lr: 2.63e-03, grad_scale: 64.0 2024-06-21 14:46:49,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=398601.5, ans=0.1 2024-06-21 14:46:57,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=398619.8333333333, ans=0.2 2024-06-21 14:47:05,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398656.5, ans=0.1 2024-06-21 14:47:06,018 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.85 vs. limit=15.0 2024-06-21 14:47:09,221 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.27 vs. limit=15.0 2024-06-21 14:47:11,555 INFO [train.py:1028] (1/2) Epoch 22, batch 5000, loss[loss=0.1695, simple_loss=0.214, pruned_loss=0.06248, over 13160.00 frames. ], tot_loss[loss=0.181, simple_loss=0.2324, pruned_loss=0.0648, over 2574598.41 frames. ], batch size: 95, lr: 2.63e-03, grad_scale: 64.0 2024-06-21 14:47:11,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=398674.8333333333, ans=0.125 2024-06-21 14:47:12,145 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 2.033e+02 2.212e+02 2.427e+02 3.168e+02, threshold=4.423e+02, percent-clipped=0.0 2024-06-21 14:47:12,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=398674.8333333333, ans=0.07 2024-06-21 14:47:21,787 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.98 vs. limit=15.0 2024-06-21 14:47:31,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=398711.5, ans=0.125 2024-06-21 14:47:40,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=398729.8333333333, ans=0.0 2024-06-21 14:47:47,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=398748.1666666667, ans=0.125 2024-06-21 14:47:48,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=398748.1666666667, ans=0.1 2024-06-21 14:47:52,090 INFO [train.py:1028] (1/2) Epoch 22, batch 5050, loss[loss=0.1877, simple_loss=0.2418, pruned_loss=0.06683, over 12944.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2322, pruned_loss=0.06437, over 2571668.74 frames. ], batch size: 36, lr: 2.63e-03, grad_scale: 64.0 2024-06-21 14:47:53,794 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.60 vs. limit=10.0 2024-06-21 14:48:01,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398784.8333333333, ans=0.1 2024-06-21 14:48:14,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=398821.5, ans=0.0 2024-06-21 14:48:14,967 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=7.87 vs. limit=12.0 2024-06-21 14:48:21,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=398839.8333333333, ans=0.125 2024-06-21 14:48:24,949 INFO [train.py:1028] (1/2) Epoch 22, batch 5100, loss[loss=0.1856, simple_loss=0.2407, pruned_loss=0.06527, over 13228.00 frames. ], tot_loss[loss=0.1813, simple_loss=0.2326, pruned_loss=0.065, over 2568512.76 frames. ], batch size: 40, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:48:26,353 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.782e+02 1.998e+02 2.133e+02 2.340e+02 3.001e+02, threshold=4.265e+02, percent-clipped=0.0 2024-06-21 14:48:37,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398876.5, ans=0.1 2024-06-21 14:48:38,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=398894.8333333333, ans=0.125 2024-06-21 14:48:48,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=398913.1666666667, ans=0.125 2024-06-21 14:48:57,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=398949.8333333333, ans=0.125 2024-06-21 14:48:58,425 INFO [train.py:1028] (1/2) Epoch 22, batch 5150, loss[loss=0.1857, simple_loss=0.2235, pruned_loss=0.07396, over 13115.00 frames. ], tot_loss[loss=0.182, simple_loss=0.2329, pruned_loss=0.06553, over 2571702.87 frames. ], batch size: 132, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:49:11,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398986.5, ans=0.1 2024-06-21 14:49:13,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398986.5, ans=0.1 2024-06-21 14:49:13,519 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.55 vs. limit=6.0 2024-06-21 14:49:13,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=398986.5, ans=0.1 2024-06-21 14:49:20,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=399004.8333333333, ans=0.0 2024-06-21 14:49:30,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=399023.1666666667, ans=0.2 2024-06-21 14:49:32,233 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2024-06-21 14:49:34,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=399041.5, ans=0.0 2024-06-21 14:49:34,955 INFO [train.py:1028] (1/2) Epoch 22, batch 5200, loss[loss=0.1833, simple_loss=0.2309, pruned_loss=0.0679, over 13160.00 frames. ], tot_loss[loss=0.1816, simple_loss=0.2325, pruned_loss=0.06532, over 2574528.35 frames. ], batch size: 95, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:49:36,276 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.022e+02 2.167e+02 2.270e+02 3.344e+02, threshold=4.334e+02, percent-clipped=0.0 2024-06-21 14:49:46,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=399059.8333333333, ans=0.1 2024-06-21 14:49:50,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=399078.1666666667, ans=0.125 2024-06-21 14:50:00,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=399096.5, ans=0.2 2024-06-21 14:50:04,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=399096.5, ans=0.125 2024-06-21 14:50:08,802 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.11 vs. limit=15.0 2024-06-21 14:50:09,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=399114.8333333333, ans=0.125 2024-06-21 14:50:10,931 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.01 vs. limit=12.0 2024-06-21 14:50:11,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=399133.1666666667, ans=0.125 2024-06-21 14:50:11,898 INFO [train.py:1028] (1/2) Epoch 22, batch 5250, loss[loss=0.1685, simple_loss=0.2275, pruned_loss=0.0548, over 13236.00 frames. ], tot_loss[loss=0.1816, simple_loss=0.2328, pruned_loss=0.06514, over 2569675.27 frames. ], batch size: 52, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:50:22,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=399151.5, ans=0.0 2024-06-21 14:50:25,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=399169.8333333333, ans=0.125 2024-06-21 14:50:38,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=399206.5, ans=0.125 2024-06-21 14:50:45,260 INFO [train.py:1028] (1/2) Epoch 22, batch 5300, loss[loss=0.1952, simple_loss=0.2394, pruned_loss=0.07551, over 13056.00 frames. ], tot_loss[loss=0.1811, simple_loss=0.2326, pruned_loss=0.06483, over 2566535.79 frames. ], batch size: 144, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:50:46,493 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 2.015e+02 2.116e+02 2.248e+02 2.806e+02, threshold=4.232e+02, percent-clipped=0.0 2024-06-21 14:50:47,623 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.14 vs. limit=10.0 2024-06-21 14:50:50,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=399224.8333333333, ans=0.0 2024-06-21 14:50:57,196 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.54 vs. limit=10.0 2024-06-21 14:51:09,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=399279.8333333333, ans=0.0 2024-06-21 14:51:17,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=399298.1666666667, ans=0.1 2024-06-21 14:51:18,652 INFO [train.py:1028] (1/2) Epoch 22, batch 5350, loss[loss=0.2003, simple_loss=0.2522, pruned_loss=0.07423, over 12109.00 frames. ], tot_loss[loss=0.1806, simple_loss=0.2319, pruned_loss=0.06464, over 2574090.38 frames. ], batch size: 17, lr: 2.63e-03, grad_scale: 16.0 2024-06-21 14:51:25,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=399316.5, ans=0.125 2024-06-21 14:51:58,140 INFO [train.py:1028] (1/2) Epoch 22, batch 5400, loss[loss=0.1814, simple_loss=0.2269, pruned_loss=0.06795, over 12285.00 frames. ], tot_loss[loss=0.1811, simple_loss=0.232, pruned_loss=0.06506, over 2566973.72 frames. ], batch size: 241, lr: 2.63e-03, grad_scale: 16.0 2024-06-21 14:52:00,120 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.009e+02 2.150e+02 2.300e+02 3.036e+02, threshold=4.299e+02, percent-clipped=0.0 2024-06-21 14:52:01,313 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.66 vs. limit=12.0 2024-06-21 14:52:07,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=399426.5, ans=0.1 2024-06-21 14:52:09,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=399426.5, ans=0.1 2024-06-21 14:52:13,792 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.54 vs. limit=15.0 2024-06-21 14:52:21,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.57 vs. limit=10.0 2024-06-21 14:52:31,953 INFO [train.py:1028] (1/2) Epoch 22, batch 5450, loss[loss=0.2084, simple_loss=0.2606, pruned_loss=0.07804, over 12415.00 frames. ], tot_loss[loss=0.1818, simple_loss=0.2327, pruned_loss=0.0655, over 2569522.00 frames. ], batch size: 25, lr: 2.63e-03, grad_scale: 16.0 2024-06-21 14:52:35,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=399499.8333333333, ans=0.125 2024-06-21 14:52:37,674 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.13 vs. limit=6.0 2024-06-21 14:52:40,167 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.13 vs. limit=15.0 2024-06-21 14:52:56,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=399554.8333333333, ans=0.125 2024-06-21 14:53:00,942 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.53 vs. limit=15.0 2024-06-21 14:53:03,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.18 vs. limit=22.5 2024-06-21 14:53:05,130 INFO [train.py:1028] (1/2) Epoch 22, batch 5500, loss[loss=0.2054, simple_loss=0.2465, pruned_loss=0.08215, over 12154.00 frames. ], tot_loss[loss=0.1818, simple_loss=0.2326, pruned_loss=0.06547, over 2563637.97 frames. ], batch size: 240, lr: 2.63e-03, grad_scale: 16.0 2024-06-21 14:53:06,885 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.045e+02 2.140e+02 2.287e+02 3.017e+02, threshold=4.281e+02, percent-clipped=0.0 2024-06-21 14:53:08,035 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.91 vs. limit=12.0 2024-06-21 14:53:08,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=399591.5, ans=0.2 2024-06-21 14:53:12,028 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.62 vs. limit=15.0 2024-06-21 14:53:14,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=399609.8333333333, ans=0.125 2024-06-21 14:53:16,545 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.96 vs. limit=12.0 2024-06-21 14:53:46,278 INFO [train.py:1028] (1/2) Epoch 22, batch 5550, loss[loss=0.189, simple_loss=0.2468, pruned_loss=0.06563, over 13276.00 frames. ], tot_loss[loss=0.1804, simple_loss=0.2317, pruned_loss=0.0646, over 2568047.34 frames. ], batch size: 43, lr: 2.63e-03, grad_scale: 16.0 2024-06-21 14:53:49,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=399683.1666666667, ans=0.125 2024-06-21 14:53:59,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=399719.8333333333, ans=0.025 2024-06-21 14:54:00,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=399719.8333333333, ans=0.125 2024-06-21 14:54:04,718 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:54:05,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=399738.1666666667, ans=0.125 2024-06-21 14:54:18,019 INFO [train.py:1028] (1/2) Epoch 22, batch 5600, loss[loss=0.1837, simple_loss=0.2371, pruned_loss=0.06518, over 13212.00 frames. ], tot_loss[loss=0.1799, simple_loss=0.2311, pruned_loss=0.06433, over 2569828.08 frames. ], batch size: 89, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:54:20,014 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.771e+02 1.966e+02 2.142e+02 2.327e+02 5.044e+02, threshold=4.284e+02, percent-clipped=1.0 2024-06-21 14:54:20,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=399774.8333333333, ans=0.125 2024-06-21 14:54:28,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2024-06-21 14:54:30,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=399793.1666666667, ans=0.125 2024-06-21 14:54:30,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=399793.1666666667, ans=0.025 2024-06-21 14:54:39,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=399829.8333333333, ans=0.125 2024-06-21 14:54:39,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=399829.8333333333, ans=0.125 2024-06-21 14:54:43,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=399829.8333333333, ans=0.125 2024-06-21 14:54:47,731 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.65 vs. limit=12.0 2024-06-21 14:54:49,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=399848.1666666667, ans=0.125 2024-06-21 14:54:51,103 INFO [train.py:1028] (1/2) Epoch 22, batch 5650, loss[loss=0.1948, simple_loss=0.2411, pruned_loss=0.0742, over 12550.00 frames. ], tot_loss[loss=0.1793, simple_loss=0.2307, pruned_loss=0.06395, over 2574616.81 frames. ], batch size: 202, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:55:19,578 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2024-06-21 14:55:26,884 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.94 vs. limit=15.0 2024-06-21 14:55:27,097 INFO [train.py:1028] (1/2) Epoch 22, batch 5700, loss[loss=0.2008, simple_loss=0.2506, pruned_loss=0.0755, over 13278.00 frames. ], tot_loss[loss=0.1797, simple_loss=0.2309, pruned_loss=0.06424, over 2578829.65 frames. ], batch size: 63, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:55:29,032 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 1.975e+02 2.072e+02 2.214e+02 2.791e+02, threshold=4.145e+02, percent-clipped=0.0 2024-06-21 14:55:35,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=399958.1666666667, ans=0.0 2024-06-21 14:55:38,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=399976.5, ans=0.125 2024-06-21 14:55:38,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=399976.5, ans=0.125 2024-06-21 14:55:41,792 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.54 vs. limit=15.0 2024-06-21 14:55:44,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=399994.8333333333, ans=0.125 2024-06-21 14:55:46,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=399994.8333333333, ans=0.0 2024-06-21 14:55:50,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=400013.1666666667, ans=0.05 2024-06-21 14:55:54,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=400013.1666666667, ans=0.0 2024-06-21 14:56:03,209 INFO [train.py:1028] (1/2) Epoch 22, batch 5750, loss[loss=0.206, simple_loss=0.2493, pruned_loss=0.08132, over 12761.00 frames. ], tot_loss[loss=0.1807, simple_loss=0.2322, pruned_loss=0.06461, over 2579318.74 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:56:05,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=400049.8333333333, ans=0.0 2024-06-21 14:56:08,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=400068.1666666667, ans=0.125 2024-06-21 14:56:17,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=400086.5, ans=0.0 2024-06-21 14:56:22,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=400104.8333333333, ans=0.125 2024-06-21 14:56:26,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=400104.8333333333, ans=0.1 2024-06-21 14:56:27,243 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.30 vs. limit=15.0 2024-06-21 14:56:29,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400123.1666666667, ans=0.1 2024-06-21 14:56:31,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=400123.1666666667, ans=0.125 2024-06-21 14:56:35,171 INFO [train.py:1028] (1/2) Epoch 22, batch 5800, loss[loss=0.1945, simple_loss=0.2399, pruned_loss=0.07453, over 12670.00 frames. ], tot_loss[loss=0.1823, simple_loss=0.2336, pruned_loss=0.06552, over 2578827.19 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:56:37,140 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.130e+02 2.282e+02 2.432e+02 3.124e+02, threshold=4.564e+02, percent-clipped=0.0 2024-06-21 14:56:45,415 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.96 vs. limit=22.5 2024-06-21 14:56:46,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=400159.8333333333, ans=0.09899494936611666 2024-06-21 14:56:51,133 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2024-06-21 14:56:59,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=400196.5, ans=0.07 2024-06-21 14:57:07,475 INFO [train.py:1028] (1/2) Epoch 22, batch 5850, loss[loss=0.2106, simple_loss=0.2565, pruned_loss=0.08232, over 12500.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.2351, pruned_loss=0.06598, over 2578455.53 frames. ], batch size: 202, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:57:10,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=400233.1666666667, ans=0.125 2024-06-21 14:57:10,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=400233.1666666667, ans=0.0 2024-06-21 14:57:12,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=400233.1666666667, ans=0.125 2024-06-21 14:57:22,119 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.32 vs. limit=12.0 2024-06-21 14:57:33,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=400269.8333333333, ans=0.125 2024-06-21 14:57:45,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=400306.5, ans=0.125 2024-06-21 14:57:46,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=400306.5, ans=0.125 2024-06-21 14:57:48,787 INFO [train.py:1028] (1/2) Epoch 22, batch 5900, loss[loss=0.1896, simple_loss=0.2346, pruned_loss=0.07226, over 13108.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2368, pruned_loss=0.06649, over 2578052.01 frames. ], batch size: 121, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:57:49,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=400324.8333333333, ans=0.125 2024-06-21 14:57:50,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=400324.8333333333, ans=0.0 2024-06-21 14:57:50,941 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.731e+02 2.054e+02 2.189e+02 2.439e+02 3.591e+02, threshold=4.378e+02, percent-clipped=0.0 2024-06-21 14:57:53,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=400324.8333333333, ans=0.125 2024-06-21 14:57:55,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=400343.1666666667, ans=0.05 2024-06-21 14:58:01,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2024-06-21 14:58:02,014 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:58:07,110 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.51 vs. limit=22.5 2024-06-21 14:58:07,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=400361.5, ans=0.0 2024-06-21 14:58:08,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=400379.8333333333, ans=0.2 2024-06-21 14:58:10,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=400379.8333333333, ans=0.1 2024-06-21 14:58:12,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=400379.8333333333, ans=0.125 2024-06-21 14:58:19,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=400398.1666666667, ans=0.125 2024-06-21 14:58:20,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=400398.1666666667, ans=0.125 2024-06-21 14:58:21,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=400416.5, ans=0.125 2024-06-21 14:58:22,210 INFO [train.py:1028] (1/2) Epoch 22, batch 5950, loss[loss=0.1791, simple_loss=0.2307, pruned_loss=0.06378, over 13090.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2387, pruned_loss=0.06703, over 2581931.91 frames. ], batch size: 121, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:58:32,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=400434.8333333333, ans=0.1 2024-06-21 14:58:49,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=400489.8333333333, ans=0.2 2024-06-21 14:58:55,376 INFO [train.py:1028] (1/2) Epoch 22, batch 6000, loss[loss=0.2226, simple_loss=0.2659, pruned_loss=0.08967, over 12255.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2398, pruned_loss=0.06732, over 2575695.70 frames. ], batch size: 241, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:58:55,376 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 14:59:03,338 INFO [train.py:1060] (1/2) Epoch 22, validation: loss=0.1876, simple_loss=0.251, pruned_loss=0.06212, over 351949.00 frames. 2024-06-21 14:59:03,339 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 14:59:03,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=400508.1666666667, ans=0.125 2024-06-21 14:59:05,444 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.073e+02 2.237e+02 2.446e+02 3.016e+02, threshold=4.475e+02, percent-clipped=0.0 2024-06-21 14:59:14,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=400526.5, ans=0.0 2024-06-21 14:59:41,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=400581.5, ans=0.125 2024-06-21 14:59:43,570 INFO [train.py:1028] (1/2) Epoch 22, batch 6050, loss[loss=0.1782, simple_loss=0.2406, pruned_loss=0.05787, over 13219.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2411, pruned_loss=0.06752, over 2578956.42 frames. ], batch size: 40, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:59:55,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400618.1666666667, ans=0.1 2024-06-21 15:00:05,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=400654.8333333333, ans=0.0 2024-06-21 15:00:06,944 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.76 vs. limit=15.0 2024-06-21 15:00:16,962 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.28 vs. limit=15.0 2024-06-21 15:00:17,165 INFO [train.py:1028] (1/2) Epoch 22, batch 6100, loss[loss=0.1734, simple_loss=0.22, pruned_loss=0.06342, over 13115.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2424, pruned_loss=0.06789, over 2581893.95 frames. ], batch size: 121, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:00:19,135 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 2.054e+02 2.165e+02 2.342e+02 3.200e+02, threshold=4.330e+02, percent-clipped=0.0 2024-06-21 15:00:22,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=400691.5, ans=0.125 2024-06-21 15:00:25,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=400709.8333333333, ans=0.2 2024-06-21 15:00:30,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=400709.8333333333, ans=0.125 2024-06-21 15:00:37,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=400746.5, ans=0.2 2024-06-21 15:00:43,309 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.71 vs. limit=15.0 2024-06-21 15:00:50,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=400764.8333333333, ans=0.0 2024-06-21 15:00:52,304 INFO [train.py:1028] (1/2) Epoch 22, batch 6150, loss[loss=0.209, simple_loss=0.2522, pruned_loss=0.08292, over 10740.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2435, pruned_loss=0.06823, over 2579669.80 frames. ], batch size: 304, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:01:12,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=400838.1666666667, ans=0.0 2024-06-21 15:01:14,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=400838.1666666667, ans=0.1 2024-06-21 15:01:25,281 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.51 vs. limit=15.0 2024-06-21 15:01:31,028 INFO [train.py:1028] (1/2) Epoch 22, batch 6200, loss[loss=0.2208, simple_loss=0.2716, pruned_loss=0.08499, over 13249.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2445, pruned_loss=0.06887, over 2577050.43 frames. ], batch size: 89, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:01:38,288 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.235e+02 2.433e+02 2.785e+02 4.406e+02, threshold=4.866e+02, percent-clipped=1.0 2024-06-21 15:01:45,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400893.1666666667, ans=0.1 2024-06-21 15:01:49,814 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.75 vs. limit=15.0 2024-06-21 15:02:04,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=400948.1666666667, ans=0.125 2024-06-21 15:02:08,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=400948.1666666667, ans=0.2 2024-06-21 15:02:09,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=400948.1666666667, ans=6.0 2024-06-21 15:02:11,403 INFO [train.py:1028] (1/2) Epoch 22, batch 6250, loss[loss=0.1819, simple_loss=0.2388, pruned_loss=0.06247, over 13211.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2461, pruned_loss=0.06965, over 2569639.42 frames. ], batch size: 83, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:02:26,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=401003.1666666667, ans=0.125 2024-06-21 15:02:30,229 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.35 vs. limit=15.0 2024-06-21 15:02:38,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=401039.8333333333, ans=0.1 2024-06-21 15:02:44,505 INFO [train.py:1028] (1/2) Epoch 22, batch 6300, loss[loss=0.2114, simple_loss=0.2616, pruned_loss=0.08058, over 11361.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2478, pruned_loss=0.07038, over 2564847.34 frames. ], batch size: 16, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:02:45,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=401058.1666666667, ans=0.0 2024-06-21 15:02:46,503 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.233e+02 2.386e+02 2.704e+02 4.213e+02, threshold=4.771e+02, percent-clipped=0.0 2024-06-21 15:02:47,725 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.39 vs. limit=22.5 2024-06-21 15:02:56,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=401076.5, ans=0.0 2024-06-21 15:02:56,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=401076.5, ans=0.2 2024-06-21 15:03:02,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=401094.8333333333, ans=0.125 2024-06-21 15:03:18,020 INFO [train.py:1028] (1/2) Epoch 22, batch 6350, loss[loss=0.2299, simple_loss=0.2795, pruned_loss=0.09015, over 12578.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2491, pruned_loss=0.07045, over 2574182.05 frames. ], batch size: 202, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:03:26,356 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.94 vs. limit=15.0 2024-06-21 15:03:26,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=401168.1666666667, ans=15.0 2024-06-21 15:03:30,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=401186.5, ans=0.2 2024-06-21 15:03:32,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=401186.5, ans=0.125 2024-06-21 15:03:39,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=401186.5, ans=0.125 2024-06-21 15:03:49,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=401204.8333333333, ans=0.2 2024-06-21 15:03:50,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=401223.1666666667, ans=0.0 2024-06-21 15:03:57,496 INFO [train.py:1028] (1/2) Epoch 22, batch 6400, loss[loss=0.1633, simple_loss=0.2259, pruned_loss=0.0504, over 13236.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2509, pruned_loss=0.07122, over 2575822.25 frames. ], batch size: 67, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:03:59,379 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.47 vs. limit=6.0 2024-06-21 15:03:59,542 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.114e+02 2.250e+02 2.485e+02 3.994e+02, threshold=4.500e+02, percent-clipped=0.0 2024-06-21 15:04:01,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=401241.5, ans=0.125 2024-06-21 15:04:04,683 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.34 vs. limit=12.0 2024-06-21 15:04:29,999 INFO [train.py:1028] (1/2) Epoch 22, batch 6450, loss[loss=0.2347, simple_loss=0.2787, pruned_loss=0.09533, over 12503.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2524, pruned_loss=0.07192, over 2581345.99 frames. ], batch size: 202, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:04:33,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=401333.1666666667, ans=0.125 2024-06-21 15:04:40,609 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.74 vs. limit=6.0 2024-06-21 15:04:43,740 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=15.0 2024-06-21 15:04:45,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=401369.8333333333, ans=0.125 2024-06-21 15:04:49,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=401388.1666666667, ans=0.125 2024-06-21 15:04:53,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=401388.1666666667, ans=0.125 2024-06-21 15:04:56,921 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.54 vs. limit=15.0 2024-06-21 15:05:02,397 INFO [train.py:1028] (1/2) Epoch 22, batch 6500, loss[loss=0.2127, simple_loss=0.2598, pruned_loss=0.08276, over 10779.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2535, pruned_loss=0.0721, over 2583928.03 frames. ], batch size: 303, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:05:04,304 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.171e+02 2.322e+02 2.518e+02 3.300e+02, threshold=4.645e+02, percent-clipped=0.0 2024-06-21 15:05:05,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=401424.8333333333, ans=0.0 2024-06-21 15:05:09,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=401443.1666666667, ans=0.125 2024-06-21 15:05:10,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=401443.1666666667, ans=0.125 2024-06-21 15:05:12,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=401443.1666666667, ans=0.125 2024-06-21 15:05:16,588 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:05:24,437 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.34 vs. limit=15.0 2024-06-21 15:05:26,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=401479.8333333333, ans=0.125 2024-06-21 15:05:29,008 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=12.0 2024-06-21 15:05:31,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=401498.1666666667, ans=0.1 2024-06-21 15:05:33,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=401516.5, ans=0.125 2024-06-21 15:05:34,341 INFO [train.py:1028] (1/2) Epoch 22, batch 6550, loss[loss=0.191, simple_loss=0.2623, pruned_loss=0.05986, over 12702.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2546, pruned_loss=0.07207, over 2588292.88 frames. ], batch size: 22, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:05:34,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=401516.5, ans=0.0 2024-06-21 15:05:38,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=401516.5, ans=0.0 2024-06-21 15:05:47,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=401534.8333333333, ans=0.09899494936611666 2024-06-21 15:05:48,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=401534.8333333333, ans=0.0 2024-06-21 15:05:56,219 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.37 vs. limit=6.0 2024-06-21 15:05:56,842 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.33 vs. limit=15.0 2024-06-21 15:05:59,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=401571.5, ans=0.2 2024-06-21 15:06:11,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=401608.1666666667, ans=0.1 2024-06-21 15:06:12,168 INFO [train.py:1028] (1/2) Epoch 22, batch 6600, loss[loss=0.1881, simple_loss=0.2524, pruned_loss=0.06188, over 13274.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2557, pruned_loss=0.07264, over 2591250.64 frames. ], batch size: 72, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:06:14,238 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.212e+02 2.350e+02 2.504e+02 3.124e+02, threshold=4.701e+02, percent-clipped=0.0 2024-06-21 15:06:32,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=401663.1666666667, ans=10.0 2024-06-21 15:06:39,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=401681.5, ans=0.125 2024-06-21 15:06:44,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=401699.8333333333, ans=0.125 2024-06-21 15:06:44,789 INFO [train.py:1028] (1/2) Epoch 22, batch 6650, loss[loss=0.2308, simple_loss=0.2825, pruned_loss=0.08953, over 12932.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2571, pruned_loss=0.07304, over 2586310.63 frames. ], batch size: 158, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:06:50,761 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2024-06-21 15:06:53,361 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.50 vs. limit=15.0 2024-06-21 15:06:54,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=401718.1666666667, ans=0.2 2024-06-21 15:06:55,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=401718.1666666667, ans=0.0 2024-06-21 15:07:04,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=401754.8333333333, ans=0.0 2024-06-21 15:07:08,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=401754.8333333333, ans=0.5 2024-06-21 15:07:17,638 INFO [train.py:1028] (1/2) Epoch 22, batch 6700, loss[loss=0.2265, simple_loss=0.2767, pruned_loss=0.08816, over 12782.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.258, pruned_loss=0.07331, over 2585496.10 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:07:19,462 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.237e+02 2.396e+02 2.624e+02 3.925e+02, threshold=4.792e+02, percent-clipped=0.0 2024-06-21 15:07:27,786 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.05 vs. limit=22.5 2024-06-21 15:07:48,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=401846.5, ans=0.0 2024-06-21 15:07:52,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=401864.8333333333, ans=0.0 2024-06-21 15:07:54,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=401864.8333333333, ans=0.025 2024-06-21 15:07:54,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=401864.8333333333, ans=0.1 2024-06-21 15:07:55,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=401864.8333333333, ans=0.2 2024-06-21 15:07:57,631 INFO [train.py:1028] (1/2) Epoch 22, batch 6750, loss[loss=0.2529, simple_loss=0.3041, pruned_loss=0.1008, over 12271.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.259, pruned_loss=0.07403, over 2579022.84 frames. ], batch size: 241, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:08:05,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=401901.5, ans=0.05 2024-06-21 15:08:19,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=401938.1666666667, ans=0.125 2024-06-21 15:08:22,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=401956.5, ans=0.125 2024-06-21 15:08:23,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=401956.5, ans=0.0 2024-06-21 15:08:23,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=401956.5, ans=0.0 2024-06-21 15:08:25,399 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.39 vs. limit=15.0 2024-06-21 15:08:29,657 INFO [train.py:1028] (1/2) Epoch 22, batch 6800, loss[loss=0.1986, simple_loss=0.2604, pruned_loss=0.06839, over 13215.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.26, pruned_loss=0.07413, over 2580880.11 frames. ], batch size: 67, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:08:31,785 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.213e+02 2.403e+02 2.710e+02 4.229e+02, threshold=4.805e+02, percent-clipped=0.0 2024-06-21 15:08:34,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=401974.8333333333, ans=0.125 2024-06-21 15:08:47,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=402011.5, ans=0.0 2024-06-21 15:08:48,077 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.02 vs. limit=15.0 2024-06-21 15:08:59,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402048.1666666667, ans=0.1 2024-06-21 15:09:00,505 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.04 vs. limit=10.0 2024-06-21 15:09:02,679 INFO [train.py:1028] (1/2) Epoch 22, batch 6850, loss[loss=0.2222, simple_loss=0.2872, pruned_loss=0.07855, over 13270.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2601, pruned_loss=0.0737, over 2584222.48 frames. ], batch size: 63, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:09:07,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=402066.5, ans=0.125 2024-06-21 15:09:07,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=402066.5, ans=0.1 2024-06-21 15:09:08,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.19 vs. limit=12.0 2024-06-21 15:09:09,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=402084.8333333333, ans=10.0 2024-06-21 15:09:11,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=402084.8333333333, ans=0.2 2024-06-21 15:09:12,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=402084.8333333333, ans=0.125 2024-06-21 15:09:13,768 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.05 vs. limit=15.0 2024-06-21 15:09:16,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=402103.1666666667, ans=0.125 2024-06-21 15:09:16,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=402103.1666666667, ans=0.125 2024-06-21 15:09:20,224 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2024-06-21 15:09:34,774 INFO [train.py:1028] (1/2) Epoch 22, batch 6900, loss[loss=0.2168, simple_loss=0.2804, pruned_loss=0.0766, over 13321.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.261, pruned_loss=0.07404, over 2586132.10 frames. ], batch size: 49, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:09:34,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=402158.1666666667, ans=0.1 2024-06-21 15:09:41,825 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.217e+02 2.472e+02 2.679e+02 3.885e+02, threshold=4.944e+02, percent-clipped=0.0 2024-06-21 15:09:45,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=402158.1666666667, ans=0.125 2024-06-21 15:09:47,000 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.27 vs. limit=15.0 2024-06-21 15:09:51,452 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2024-06-21 15:09:55,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=402176.5, ans=0.0 2024-06-21 15:09:56,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=402194.8333333333, ans=0.125 2024-06-21 15:09:58,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=402194.8333333333, ans=0.0 2024-06-21 15:10:02,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=402213.1666666667, ans=0.0 2024-06-21 15:10:12,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=402231.5, ans=0.1 2024-06-21 15:10:14,604 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2024-06-21 15:10:16,873 INFO [train.py:1028] (1/2) Epoch 22, batch 6950, loss[loss=0.183, simple_loss=0.2432, pruned_loss=0.06142, over 11900.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2612, pruned_loss=0.07383, over 2580022.15 frames. ], batch size: 17, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:10:20,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=402249.8333333333, ans=0.0 2024-06-21 15:10:25,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=402268.1666666667, ans=0.125 2024-06-21 15:10:27,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402268.1666666667, ans=0.1 2024-06-21 15:10:30,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=402286.5, ans=0.0 2024-06-21 15:10:30,733 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2024-06-21 15:10:32,831 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.89 vs. limit=15.0 2024-06-21 15:10:35,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=402286.5, ans=0.0 2024-06-21 15:10:41,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=402304.8333333333, ans=0.0 2024-06-21 15:10:46,648 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.70 vs. limit=22.5 2024-06-21 15:10:49,624 INFO [train.py:1028] (1/2) Epoch 22, batch 7000, loss[loss=0.2171, simple_loss=0.2668, pruned_loss=0.08366, over 12930.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2615, pruned_loss=0.07385, over 2576932.88 frames. ], batch size: 158, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:10:51,528 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.184e+02 2.366e+02 2.623e+02 4.675e+02, threshold=4.731e+02, percent-clipped=0.0 2024-06-21 15:10:52,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=402341.5, ans=0.125 2024-06-21 15:10:57,982 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.62 vs. limit=15.0 2024-06-21 15:11:10,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=402396.5, ans=0.0 2024-06-21 15:11:21,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=402414.8333333333, ans=0.125 2024-06-21 15:11:23,606 INFO [train.py:1028] (1/2) Epoch 22, batch 7050, loss[loss=0.2241, simple_loss=0.2744, pruned_loss=0.0869, over 12799.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2626, pruned_loss=0.07417, over 2584370.54 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:11:27,646 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:11:30,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=402451.5, ans=0.125 2024-06-21 15:11:32,242 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:11:33,162 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=12.0 2024-06-21 15:11:34,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=402451.5, ans=0.04949747468305833 2024-06-21 15:11:42,425 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.43 vs. limit=22.5 2024-06-21 15:11:51,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=402488.1666666667, ans=0.0 2024-06-21 15:12:02,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=402506.5, ans=0.0 2024-06-21 15:12:03,183 INFO [train.py:1028] (1/2) Epoch 22, batch 7100, loss[loss=0.222, simple_loss=0.2787, pruned_loss=0.08268, over 13144.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2629, pruned_loss=0.07467, over 2576749.06 frames. ], batch size: 112, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:12:05,144 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.161e+02 2.310e+02 2.495e+02 4.017e+02, threshold=4.621e+02, percent-clipped=0.0 2024-06-21 15:12:12,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=402543.1666666667, ans=0.0 2024-06-21 15:12:21,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=402561.5, ans=0.125 2024-06-21 15:12:28,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=402579.8333333333, ans=12.0 2024-06-21 15:12:29,165 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.50 vs. limit=12.0 2024-06-21 15:12:29,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402598.1666666667, ans=0.1 2024-06-21 15:12:36,531 INFO [train.py:1028] (1/2) Epoch 22, batch 7150, loss[loss=0.2342, simple_loss=0.2843, pruned_loss=0.09202, over 12487.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2637, pruned_loss=0.07493, over 2573464.68 frames. ], batch size: 202, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:12:41,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=402616.5, ans=0.0 2024-06-21 15:12:46,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=402634.8333333333, ans=0.0 2024-06-21 15:12:48,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=402634.8333333333, ans=0.07 2024-06-21 15:12:53,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=402653.1666666667, ans=0.125 2024-06-21 15:12:53,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=402653.1666666667, ans=0.2 2024-06-21 15:12:54,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=402653.1666666667, ans=0.125 2024-06-21 15:12:55,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=402671.5, ans=0.2 2024-06-21 15:13:00,237 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=402671.5, ans=0.2 2024-06-21 15:13:00,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=402671.5, ans=0.025 2024-06-21 15:13:08,515 INFO [train.py:1028] (1/2) Epoch 22, batch 7200, loss[loss=0.2331, simple_loss=0.2932, pruned_loss=0.08649, over 13203.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2649, pruned_loss=0.07544, over 2578455.97 frames. ], batch size: 112, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:13:10,472 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.152e+02 2.335e+02 2.606e+02 3.795e+02, threshold=4.669e+02, percent-clipped=0.0 2024-06-21 15:13:11,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=402708.1666666667, ans=0.0 2024-06-21 15:13:26,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=402744.8333333333, ans=0.125 2024-06-21 15:13:28,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402763.1666666667, ans=0.1 2024-06-21 15:13:41,282 INFO [train.py:1028] (1/2) Epoch 22, batch 7250, loss[loss=0.2011, simple_loss=0.2663, pruned_loss=0.06797, over 13207.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2656, pruned_loss=0.07576, over 2579205.51 frames. ], batch size: 37, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:13:48,098 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.45 vs. limit=15.0 2024-06-21 15:13:52,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=402818.1666666667, ans=0.05 2024-06-21 15:13:59,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=402836.5, ans=0.125 2024-06-21 15:14:18,672 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.29 vs. limit=15.0 2024-06-21 15:14:20,811 INFO [train.py:1028] (1/2) Epoch 22, batch 7300, loss[loss=0.1837, simple_loss=0.2441, pruned_loss=0.06172, over 12957.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2667, pruned_loss=0.07609, over 2578765.09 frames. ], batch size: 36, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:14:22,761 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.349e+02 2.581e+02 2.877e+02 4.145e+02, threshold=5.162e+02, percent-clipped=0.0 2024-06-21 15:14:52,935 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2024-06-21 15:14:53,181 INFO [train.py:1028] (1/2) Epoch 22, batch 7350, loss[loss=0.2065, simple_loss=0.2705, pruned_loss=0.07124, over 13305.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2675, pruned_loss=0.07648, over 2580249.12 frames. ], batch size: 46, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:15:08,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=403019.8333333333, ans=0.125 2024-06-21 15:15:12,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=403038.1666666667, ans=0.1 2024-06-21 15:15:13,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=403038.1666666667, ans=0.125 2024-06-21 15:15:26,139 INFO [train.py:1028] (1/2) Epoch 22, batch 7400, loss[loss=0.2106, simple_loss=0.2704, pruned_loss=0.07538, over 13253.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2683, pruned_loss=0.07687, over 2585811.90 frames. ], batch size: 63, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:15:28,127 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.244e+02 2.421e+02 2.720e+02 3.518e+02, threshold=4.842e+02, percent-clipped=0.0 2024-06-21 15:15:35,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=403093.1666666667, ans=0.125 2024-06-21 15:15:48,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=403129.8333333333, ans=0.125 2024-06-21 15:15:59,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=403148.1666666667, ans=0.125 2024-06-21 15:16:03,831 INFO [train.py:1028] (1/2) Epoch 22, batch 7450, loss[loss=0.1813, simple_loss=0.2396, pruned_loss=0.06152, over 12634.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2674, pruned_loss=0.07629, over 2579287.16 frames. ], batch size: 29, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:16:08,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=403166.5, ans=0.125 2024-06-21 15:16:09,623 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.97 vs. limit=15.0 2024-06-21 15:16:14,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=403184.8333333333, ans=0.125 2024-06-21 15:16:20,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=403203.1666666667, ans=0.0 2024-06-21 15:16:35,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=403239.8333333333, ans=0.125 2024-06-21 15:16:40,955 INFO [train.py:1028] (1/2) Epoch 22, batch 7500, loss[loss=0.2104, simple_loss=0.2636, pruned_loss=0.0786, over 10445.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2682, pruned_loss=0.07663, over 2576586.63 frames. ], batch size: 303, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:16:41,443 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=15.0 2024-06-21 15:16:42,902 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 2.238e+02 2.423e+02 2.635e+02 3.666e+02, threshold=4.846e+02, percent-clipped=0.0 2024-06-21 15:16:47,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=403276.5, ans=0.2 2024-06-21 15:16:49,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=403276.5, ans=0.0 2024-06-21 15:16:54,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=403294.8333333333, ans=0.025 2024-06-21 15:16:54,559 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.67 vs. limit=10.0 2024-06-21 15:17:00,248 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.64 vs. limit=15.0 2024-06-21 15:17:04,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=403313.1666666667, ans=0.1 2024-06-21 15:17:19,191 INFO [train.py:1028] (1/2) Epoch 22, batch 7550, loss[loss=0.2139, simple_loss=0.2644, pruned_loss=0.08173, over 12927.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2689, pruned_loss=0.07698, over 2576086.12 frames. ], batch size: 158, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:17:20,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=403349.8333333333, ans=0.125 2024-06-21 15:17:22,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=403349.8333333333, ans=0.035 2024-06-21 15:17:26,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=403368.1666666667, ans=0.0 2024-06-21 15:17:37,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=403386.5, ans=0.125 2024-06-21 15:17:51,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=403423.1666666667, ans=0.09899494936611666 2024-06-21 15:17:52,500 INFO [train.py:1028] (1/2) Epoch 22, batch 7600, loss[loss=0.2051, simple_loss=0.2595, pruned_loss=0.07538, over 13167.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2689, pruned_loss=0.0769, over 2575892.20 frames. ], batch size: 83, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:17:54,595 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.194e+02 2.369e+02 2.611e+02 4.041e+02, threshold=4.737e+02, percent-clipped=0.0 2024-06-21 15:17:58,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=403441.5, ans=0.125 2024-06-21 15:18:01,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=403459.8333333333, ans=0.125 2024-06-21 15:18:06,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=403478.1666666667, ans=0.125 2024-06-21 15:18:17,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=403478.1666666667, ans=0.0 2024-06-21 15:18:19,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=403496.5, ans=0.0 2024-06-21 15:18:21,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=403496.5, ans=0.04949747468305833 2024-06-21 15:18:32,891 INFO [train.py:1028] (1/2) Epoch 22, batch 7650, loss[loss=0.1984, simple_loss=0.259, pruned_loss=0.06893, over 12969.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2697, pruned_loss=0.07725, over 2571978.12 frames. ], batch size: 33, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:18:35,612 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:18:40,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=403551.5, ans=0.125 2024-06-21 15:18:50,158 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.65 vs. limit=15.0 2024-06-21 15:18:50,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=403569.8333333333, ans=0.2 2024-06-21 15:19:06,288 INFO [train.py:1028] (1/2) Epoch 22, batch 7700, loss[loss=0.2131, simple_loss=0.2801, pruned_loss=0.07301, over 13267.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2706, pruned_loss=0.07755, over 2568524.22 frames. ], batch size: 63, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:19:07,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=403624.8333333333, ans=0.05 2024-06-21 15:19:08,344 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.221e+02 2.397e+02 2.603e+02 3.385e+02, threshold=4.794e+02, percent-clipped=0.0 2024-06-21 15:19:10,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=403624.8333333333, ans=0.0 2024-06-21 15:19:30,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=403679.8333333333, ans=0.0 2024-06-21 15:19:31,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=403679.8333333333, ans=0.125 2024-06-21 15:19:39,435 INFO [train.py:1028] (1/2) Epoch 22, batch 7750, loss[loss=0.2003, simple_loss=0.2667, pruned_loss=0.06698, over 13209.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2713, pruned_loss=0.07815, over 2573057.84 frames. ], batch size: 72, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:19:54,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=403753.1666666667, ans=0.125 2024-06-21 15:20:19,151 INFO [train.py:1028] (1/2) Epoch 22, batch 7800, loss[loss=0.2192, simple_loss=0.2775, pruned_loss=0.08047, over 13150.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2715, pruned_loss=0.07784, over 2578103.03 frames. ], batch size: 95, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:20:21,079 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.304e+02 2.551e+02 2.790e+02 3.705e+02, threshold=5.101e+02, percent-clipped=0.0 2024-06-21 15:20:29,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=403826.5, ans=0.07 2024-06-21 15:20:31,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=403826.5, ans=10.0 2024-06-21 15:20:32,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=403844.8333333333, ans=0.125 2024-06-21 15:20:41,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=403863.1666666667, ans=0.0 2024-06-21 15:20:52,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=403899.8333333333, ans=0.0 2024-06-21 15:20:52,951 INFO [train.py:1028] (1/2) Epoch 22, batch 7850, loss[loss=0.1846, simple_loss=0.2454, pruned_loss=0.0619, over 11022.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2725, pruned_loss=0.07851, over 2571848.88 frames. ], batch size: 16, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:20:55,264 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.93 vs. limit=22.5 2024-06-21 15:21:03,822 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.59 vs. limit=22.5 2024-06-21 15:21:16,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=403954.8333333333, ans=0.2 2024-06-21 15:21:17,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=403954.8333333333, ans=0.125 2024-06-21 15:21:25,504 INFO [train.py:1028] (1/2) Epoch 22, batch 7900, loss[loss=0.1901, simple_loss=0.2599, pruned_loss=0.06015, over 13177.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.273, pruned_loss=0.07867, over 2570386.68 frames. ], batch size: 77, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:21:27,618 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.330e+02 2.467e+02 2.824e+02 4.195e+02, threshold=4.933e+02, percent-clipped=0.0 2024-06-21 15:21:31,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=403991.5, ans=0.0 2024-06-21 15:21:43,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=404028.1666666667, ans=0.0 2024-06-21 15:21:44,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=404028.1666666667, ans=0.1 2024-06-21 15:21:44,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=404046.5, ans=0.125 2024-06-21 15:22:06,432 INFO [train.py:1028] (1/2) Epoch 22, batch 7950, loss[loss=0.1982, simple_loss=0.2532, pruned_loss=0.0716, over 10517.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2736, pruned_loss=0.07879, over 2574283.02 frames. ], batch size: 303, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:22:06,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=404083.1666666667, ans=0.07 2024-06-21 15:22:10,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=404083.1666666667, ans=0.0 2024-06-21 15:22:25,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=404119.8333333333, ans=0.0 2024-06-21 15:22:28,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=404138.1666666667, ans=0.0 2024-06-21 15:22:30,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=404138.1666666667, ans=0.125 2024-06-21 15:22:30,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=404138.1666666667, ans=0.125 2024-06-21 15:22:33,255 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.12 vs. limit=22.5 2024-06-21 15:22:33,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=404156.5, ans=0.125 2024-06-21 15:22:39,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=404174.8333333333, ans=0.0 2024-06-21 15:22:39,596 INFO [train.py:1028] (1/2) Epoch 22, batch 8000, loss[loss=0.2059, simple_loss=0.2601, pruned_loss=0.0758, over 12750.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2741, pruned_loss=0.07892, over 2572788.09 frames. ], batch size: 29, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:22:41,642 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.266e+02 2.478e+02 2.713e+02 3.698e+02, threshold=4.957e+02, percent-clipped=0.0 2024-06-21 15:22:44,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=404174.8333333333, ans=0.2 2024-06-21 15:22:46,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=404193.1666666667, ans=0.1 2024-06-21 15:22:51,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=404193.1666666667, ans=0.0 2024-06-21 15:22:53,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=404211.5, ans=0.125 2024-06-21 15:22:59,775 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.06 vs. limit=12.0 2024-06-21 15:23:01,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=404229.8333333333, ans=0.2 2024-06-21 15:23:09,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=404248.1666666667, ans=0.025 2024-06-21 15:23:12,951 INFO [train.py:1028] (1/2) Epoch 22, batch 8050, loss[loss=0.2212, simple_loss=0.278, pruned_loss=0.08215, over 13216.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2734, pruned_loss=0.07866, over 2572852.19 frames. ], batch size: 83, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:23:14,607 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.18 vs. limit=15.0 2024-06-21 15:23:16,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=404266.5, ans=0.1 2024-06-21 15:23:19,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=404284.8333333333, ans=0.05 2024-06-21 15:23:20,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=404284.8333333333, ans=0.1 2024-06-21 15:23:26,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=404303.1666666667, ans=0.125 2024-06-21 15:23:35,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=404321.5, ans=0.125 2024-06-21 15:23:36,709 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.20 vs. limit=15.0 2024-06-21 15:23:42,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=404339.8333333333, ans=10.0 2024-06-21 15:23:44,960 INFO [train.py:1028] (1/2) Epoch 22, batch 8100, loss[loss=0.2168, simple_loss=0.2751, pruned_loss=0.07922, over 13144.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2739, pruned_loss=0.07872, over 2576750.69 frames. ], batch size: 112, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:23:46,905 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.260e+02 2.365e+02 2.604e+02 3.308e+02, threshold=4.729e+02, percent-clipped=0.0 2024-06-21 15:23:52,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=404376.5, ans=0.125 2024-06-21 15:23:52,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=404376.5, ans=0.1 2024-06-21 15:23:55,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=404376.5, ans=0.125 2024-06-21 15:23:57,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=404394.8333333333, ans=0.1 2024-06-21 15:24:15,549 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.22 vs. limit=10.0 2024-06-21 15:24:17,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=404413.1666666667, ans=0.04949747468305833 2024-06-21 15:24:18,256 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.65 vs. limit=22.5 2024-06-21 15:24:20,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=404431.5, ans=0.025 2024-06-21 15:24:23,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=404431.5, ans=0.125 2024-06-21 15:24:27,422 INFO [train.py:1028] (1/2) Epoch 22, batch 8150, loss[loss=0.2187, simple_loss=0.2703, pruned_loss=0.08353, over 13047.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2737, pruned_loss=0.07809, over 2579803.02 frames. ], batch size: 121, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:24:33,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=404468.1666666667, ans=0.125 2024-06-21 15:24:35,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=404468.1666666667, ans=0.125 2024-06-21 15:24:36,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=404468.1666666667, ans=0.125 2024-06-21 15:24:42,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=404486.5, ans=0.025 2024-06-21 15:24:46,014 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2024-06-21 15:24:48,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=404504.8333333333, ans=0.0 2024-06-21 15:24:58,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=404523.1666666667, ans=0.0 2024-06-21 15:24:58,767 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.01 vs. limit=6.0 2024-06-21 15:24:59,721 INFO [train.py:1028] (1/2) Epoch 22, batch 8200, loss[loss=0.2368, simple_loss=0.2863, pruned_loss=0.09368, over 13148.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2737, pruned_loss=0.07831, over 2583202.75 frames. ], batch size: 112, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:25:01,502 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.260e+02 2.446e+02 2.726e+02 3.209e+02, threshold=4.892e+02, percent-clipped=0.0 2024-06-21 15:25:10,477 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.53 vs. limit=22.5 2024-06-21 15:25:10,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=404559.8333333333, ans=0.2 2024-06-21 15:25:14,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=404578.1666666667, ans=0.125 2024-06-21 15:25:20,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=404596.5, ans=0.125 2024-06-21 15:25:23,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=404596.5, ans=0.125 2024-06-21 15:25:26,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=404614.8333333333, ans=0.0 2024-06-21 15:25:32,919 INFO [train.py:1028] (1/2) Epoch 22, batch 8250, loss[loss=0.2077, simple_loss=0.2749, pruned_loss=0.07021, over 13254.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2739, pruned_loss=0.07838, over 2582574.15 frames. ], batch size: 52, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:25:40,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=404651.5, ans=0.125 2024-06-21 15:25:43,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=404651.5, ans=0.0 2024-06-21 15:25:44,747 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2024-06-21 15:25:54,938 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.94 vs. limit=15.0 2024-06-21 15:26:01,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=404706.5, ans=0.0 2024-06-21 15:26:04,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=404706.5, ans=0.0 2024-06-21 15:26:08,048 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.90 vs. limit=15.0 2024-06-21 15:26:08,301 INFO [train.py:1028] (1/2) Epoch 22, batch 8300, loss[loss=0.2208, simple_loss=0.2749, pruned_loss=0.08337, over 13024.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2734, pruned_loss=0.07819, over 2579360.90 frames. ], batch size: 102, lr: 2.61e-03, grad_scale: 64.0 2024-06-21 15:26:13,456 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.206e+02 2.332e+02 2.465e+02 3.147e+02, threshold=4.664e+02, percent-clipped=0.0 2024-06-21 15:26:16,990 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=22.5 2024-06-21 15:26:29,405 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2024-06-21 15:26:31,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=404779.8333333333, ans=0.125 2024-06-21 15:26:38,130 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.23 vs. limit=15.0 2024-06-21 15:26:44,462 INFO [train.py:1028] (1/2) Epoch 22, batch 8350, loss[loss=0.2229, simple_loss=0.273, pruned_loss=0.08639, over 13187.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2739, pruned_loss=0.07816, over 2579465.66 frames. ], batch size: 112, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:26:52,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=404834.8333333333, ans=0.0 2024-06-21 15:26:58,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=404853.1666666667, ans=0.04949747468305833 2024-06-21 15:27:00,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=404853.1666666667, ans=0.0 2024-06-21 15:27:18,194 INFO [train.py:1028] (1/2) Epoch 22, batch 8400, loss[loss=0.205, simple_loss=0.2655, pruned_loss=0.07221, over 13006.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2742, pruned_loss=0.07822, over 2575761.93 frames. ], batch size: 39, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:27:18,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=404908.1666666667, ans=0.125 2024-06-21 15:27:20,795 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.276e+02 2.418e+02 2.610e+02 3.662e+02, threshold=4.835e+02, percent-clipped=0.0 2024-06-21 15:27:20,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=404908.1666666667, ans=0.125 2024-06-21 15:27:26,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=404926.5, ans=0.125 2024-06-21 15:27:28,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=404926.5, ans=0.0 2024-06-21 15:27:34,251 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.63 vs. limit=15.0 2024-06-21 15:27:35,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=404944.8333333333, ans=0.2 2024-06-21 15:27:37,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=404963.1666666667, ans=0.1 2024-06-21 15:27:37,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=404963.1666666667, ans=0.1 2024-06-21 15:27:38,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=404963.1666666667, ans=0.125 2024-06-21 15:27:38,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=404963.1666666667, ans=0.025 2024-06-21 15:27:45,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=404981.5, ans=0.125 2024-06-21 15:27:45,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=404981.5, ans=0.0 2024-06-21 15:27:49,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=404981.5, ans=0.0 2024-06-21 15:27:50,793 INFO [train.py:1028] (1/2) Epoch 22, batch 8450, loss[loss=0.2161, simple_loss=0.2823, pruned_loss=0.07493, over 13107.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2752, pruned_loss=0.07874, over 2578947.30 frames. ], batch size: 112, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:28:13,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=405036.5, ans=0.0 2024-06-21 15:28:15,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=405036.5, ans=0.0 2024-06-21 15:28:19,510 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2024-06-21 15:28:21,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=405054.8333333333, ans=0.035 2024-06-21 15:28:28,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=405073.1666666667, ans=0.125 2024-06-21 15:28:29,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=405073.1666666667, ans=0.125 2024-06-21 15:28:31,868 INFO [train.py:1028] (1/2) Epoch 22, batch 8500, loss[loss=0.1931, simple_loss=0.2607, pruned_loss=0.06271, over 12666.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2761, pruned_loss=0.07936, over 2577311.32 frames. ], batch size: 29, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:28:34,350 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.245e+02 2.423e+02 2.671e+02 3.802e+02, threshold=4.845e+02, percent-clipped=0.0 2024-06-21 15:28:36,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=405091.5, ans=0.125 2024-06-21 15:28:38,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=405109.8333333333, ans=0.2 2024-06-21 15:28:42,058 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.16 vs. limit=22.5 2024-06-21 15:28:45,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=405128.1666666667, ans=0.125 2024-06-21 15:28:47,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=405128.1666666667, ans=0.125 2024-06-21 15:28:47,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=405128.1666666667, ans=0.2 2024-06-21 15:28:59,802 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.51 vs. limit=15.0 2024-06-21 15:29:05,494 INFO [train.py:1028] (1/2) Epoch 22, batch 8550, loss[loss=0.2019, simple_loss=0.273, pruned_loss=0.06541, over 12673.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2753, pruned_loss=0.07882, over 2576211.92 frames. ], batch size: 22, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:29:21,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=405219.8333333333, ans=0.1 2024-06-21 15:29:37,867 INFO [train.py:1028] (1/2) Epoch 22, batch 8600, loss[loss=0.2196, simple_loss=0.27, pruned_loss=0.08459, over 13104.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2759, pruned_loss=0.07897, over 2572475.25 frames. ], batch size: 121, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:29:40,479 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.241e+02 2.376e+02 2.574e+02 3.486e+02, threshold=4.753e+02, percent-clipped=0.0 2024-06-21 15:29:49,840 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2024-06-21 15:30:14,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=405348.1666666667, ans=0.125 2024-06-21 15:30:15,925 INFO [train.py:1028] (1/2) Epoch 22, batch 8650, loss[loss=0.2108, simple_loss=0.266, pruned_loss=0.07777, over 13046.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2764, pruned_loss=0.07876, over 2575383.06 frames. ], batch size: 102, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:30:16,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=405366.5, ans=0.2 2024-06-21 15:30:18,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=405366.5, ans=0.5 2024-06-21 15:30:19,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=405366.5, ans=0.0 2024-06-21 15:30:30,303 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.55 vs. limit=15.0 2024-06-21 15:30:30,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=405384.8333333333, ans=0.0 2024-06-21 15:30:31,071 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2024-06-21 15:30:32,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=405403.1666666667, ans=0.125 2024-06-21 15:30:39,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=405421.5, ans=0.0 2024-06-21 15:30:47,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=405439.8333333333, ans=0.025 2024-06-21 15:30:54,099 INFO [train.py:1028] (1/2) Epoch 22, batch 8700, loss[loss=0.2129, simple_loss=0.2768, pruned_loss=0.07445, over 13215.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2766, pruned_loss=0.0793, over 2572107.12 frames. ], batch size: 59, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:30:54,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=405458.1666666667, ans=0.125 2024-06-21 15:30:55,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=405458.1666666667, ans=0.125 2024-06-21 15:30:56,834 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.283e+02 2.430e+02 2.624e+02 3.622e+02, threshold=4.860e+02, percent-clipped=0.0 2024-06-21 15:30:57,372 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.41 vs. limit=15.0 2024-06-21 15:31:06,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=405476.5, ans=0.125 2024-06-21 15:31:06,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=405476.5, ans=0.0 2024-06-21 15:31:14,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=405513.1666666667, ans=0.1 2024-06-21 15:31:18,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=405513.1666666667, ans=0.05 2024-06-21 15:31:22,217 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:31:28,948 INFO [train.py:1028] (1/2) Epoch 22, batch 8750, loss[loss=0.2168, simple_loss=0.2761, pruned_loss=0.07877, over 13059.00 frames. ], tot_loss[loss=0.217, simple_loss=0.276, pruned_loss=0.07899, over 2567376.83 frames. ], batch size: 121, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:31:34,203 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2024-06-21 15:31:37,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.94 vs. limit=15.0 2024-06-21 15:31:39,781 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2024-06-21 15:31:40,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=405568.1666666667, ans=0.0 2024-06-21 15:31:47,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=405586.5, ans=0.0 2024-06-21 15:32:06,795 INFO [train.py:1028] (1/2) Epoch 22, batch 8800, loss[loss=0.2204, simple_loss=0.2859, pruned_loss=0.07752, over 13024.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2766, pruned_loss=0.07925, over 2572202.83 frames. ], batch size: 71, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:32:08,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=405641.5, ans=0.125 2024-06-21 15:32:09,359 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.221e+02 2.365e+02 2.503e+02 3.280e+02, threshold=4.730e+02, percent-clipped=0.0 2024-06-21 15:32:09,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=405641.5, ans=0.025 2024-06-21 15:32:09,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=405641.5, ans=0.125 2024-06-21 15:32:21,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=405659.8333333333, ans=0.125 2024-06-21 15:32:35,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=405696.5, ans=0.2 2024-06-21 15:32:45,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=405733.1666666667, ans=0.125 2024-06-21 15:32:45,672 INFO [train.py:1028] (1/2) Epoch 22, batch 8850, loss[loss=0.2634, simple_loss=0.3128, pruned_loss=0.107, over 12525.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2764, pruned_loss=0.07937, over 2562057.99 frames. ], batch size: 202, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:32:46,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=405733.1666666667, ans=0.125 2024-06-21 15:32:46,752 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.08 vs. limit=15.0 2024-06-21 15:32:47,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=405733.1666666667, ans=0.0 2024-06-21 15:33:05,251 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.55 vs. limit=22.5 2024-06-21 15:33:07,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=405788.1666666667, ans=0.2 2024-06-21 15:33:15,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=405806.5, ans=10.0 2024-06-21 15:33:17,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=405806.5, ans=0.0 2024-06-21 15:33:19,067 INFO [train.py:1028] (1/2) Epoch 22, batch 8900, loss[loss=0.2464, simple_loss=0.2998, pruned_loss=0.09647, over 13009.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2774, pruned_loss=0.08003, over 2560666.24 frames. ], batch size: 33, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:33:21,727 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.305e+02 2.488e+02 2.717e+02 3.445e+02, threshold=4.976e+02, percent-clipped=0.0 2024-06-21 15:33:24,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=405824.8333333333, ans=0.125 2024-06-21 15:33:45,040 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.33 vs. limit=10.0 2024-06-21 15:33:52,866 INFO [train.py:1028] (1/2) Epoch 22, batch 8950, loss[loss=0.2412, simple_loss=0.2963, pruned_loss=0.09309, over 12550.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2773, pruned_loss=0.07996, over 2560346.48 frames. ], batch size: 202, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:33:52,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=405916.5, ans=0.125 2024-06-21 15:33:54,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=405916.5, ans=0.0 2024-06-21 15:34:33,724 INFO [train.py:1028] (1/2) Epoch 22, batch 9000, loss[loss=0.2028, simple_loss=0.2679, pruned_loss=0.06889, over 13271.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2773, pruned_loss=0.07953, over 2565625.67 frames. ], batch size: 46, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:34:33,724 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 15:34:41,938 INFO [train.py:1060] (1/2) Epoch 22, validation: loss=0.1872, simple_loss=0.2511, pruned_loss=0.06169, over 351949.00 frames. 2024-06-21 15:34:41,939 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 15:34:42,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=406008.1666666667, ans=0.125 2024-06-21 15:34:44,815 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.219e+02 2.364e+02 2.565e+02 3.217e+02, threshold=4.728e+02, percent-clipped=0.0 2024-06-21 15:34:52,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=406026.5, ans=0.1 2024-06-21 15:35:01,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=406063.1666666667, ans=0.0 2024-06-21 15:35:03,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=406063.1666666667, ans=0.125 2024-06-21 15:35:09,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=406081.5, ans=0.1 2024-06-21 15:35:11,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=406081.5, ans=0.125 2024-06-21 15:35:13,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=406081.5, ans=0.125 2024-06-21 15:35:14,543 INFO [train.py:1028] (1/2) Epoch 22, batch 9050, loss[loss=0.2046, simple_loss=0.2599, pruned_loss=0.07464, over 11394.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.278, pruned_loss=0.07967, over 2565574.73 frames. ], batch size: 17, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:35:18,935 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.46 vs. limit=6.0 2024-06-21 15:35:21,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=406118.1666666667, ans=0.0 2024-06-21 15:35:29,330 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.04 vs. limit=22.5 2024-06-21 15:35:37,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=406154.8333333333, ans=0.1 2024-06-21 15:35:44,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=406173.1666666667, ans=0.0 2024-06-21 15:35:47,208 INFO [train.py:1028] (1/2) Epoch 22, batch 9100, loss[loss=0.211, simple_loss=0.2726, pruned_loss=0.07475, over 13248.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2789, pruned_loss=0.07986, over 2568653.21 frames. ], batch size: 72, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:35:49,747 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.308e+02 2.428e+02 2.622e+02 3.376e+02, threshold=4.856e+02, percent-clipped=0.0 2024-06-21 15:35:51,289 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.71 vs. limit=15.0 2024-06-21 15:35:56,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=406209.8333333333, ans=0.0 2024-06-21 15:35:58,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=406209.8333333333, ans=0.125 2024-06-21 15:36:04,329 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.16 vs. limit=12.0 2024-06-21 15:36:07,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=406246.5, ans=0.1 2024-06-21 15:36:09,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.whiten.whitening_limit, batch_count=406246.5, ans=12.0 2024-06-21 15:36:16,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.45 vs. limit=6.0 2024-06-21 15:36:19,226 INFO [train.py:1028] (1/2) Epoch 22, batch 9150, loss[loss=0.2073, simple_loss=0.2724, pruned_loss=0.07116, over 13118.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2787, pruned_loss=0.07989, over 2570142.86 frames. ], batch size: 77, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:36:25,350 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.26 vs. limit=10.0 2024-06-21 15:36:26,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=406301.5, ans=0.0 2024-06-21 15:36:32,114 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:36:48,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=406356.5, ans=0.025 2024-06-21 15:36:51,213 INFO [train.py:1028] (1/2) Epoch 22, batch 9200, loss[loss=0.2105, simple_loss=0.2724, pruned_loss=0.0743, over 13037.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2781, pruned_loss=0.0793, over 2572692.92 frames. ], batch size: 36, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:36:51,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=406374.8333333333, ans=0.025 2024-06-21 15:36:53,670 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.275e+02 2.411e+02 2.679e+02 3.309e+02, threshold=4.822e+02, percent-clipped=0.0 2024-06-21 15:36:54,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=406374.8333333333, ans=10.0 2024-06-21 15:36:59,953 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.48 vs. limit=10.0 2024-06-21 15:37:00,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=406393.1666666667, ans=0.2 2024-06-21 15:37:03,968 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.26 vs. limit=22.5 2024-06-21 15:37:10,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=406429.8333333333, ans=0.125 2024-06-21 15:37:23,035 INFO [train.py:1028] (1/2) Epoch 22, batch 9250, loss[loss=0.2161, simple_loss=0.2786, pruned_loss=0.07673, over 13232.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2787, pruned_loss=0.0795, over 2575136.01 frames. ], batch size: 67, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:37:26,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=406466.5, ans=0.2 2024-06-21 15:37:29,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=406484.8333333333, ans=0.2 2024-06-21 15:37:30,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=406484.8333333333, ans=0.0 2024-06-21 15:37:31,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=406484.8333333333, ans=0.125 2024-06-21 15:37:39,742 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.83 vs. limit=15.0 2024-06-21 15:37:41,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=406503.1666666667, ans=0.125 2024-06-21 15:37:53,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=406539.8333333333, ans=0.0 2024-06-21 15:37:58,251 INFO [train.py:1028] (1/2) Epoch 22, batch 9300, loss[loss=0.212, simple_loss=0.2725, pruned_loss=0.07571, over 12957.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2786, pruned_loss=0.07935, over 2571126.93 frames. ], batch size: 39, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:38:00,744 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.226e+02 2.441e+02 2.618e+02 3.305e+02, threshold=4.883e+02, percent-clipped=0.0 2024-06-21 15:38:16,247 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=406594.8333333333, ans=0.0 2024-06-21 15:38:29,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.41 vs. limit=15.0 2024-06-21 15:38:32,443 INFO [train.py:1028] (1/2) Epoch 22, batch 9350, loss[loss=0.2019, simple_loss=0.27, pruned_loss=0.06684, over 12481.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2787, pruned_loss=0.0796, over 2568046.84 frames. ], batch size: 22, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:38:35,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=406649.8333333333, ans=0.0 2024-06-21 15:38:42,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=406668.1666666667, ans=0.125 2024-06-21 15:38:44,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=406686.5, ans=0.0 2024-06-21 15:38:45,509 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2024-06-21 15:39:04,016 INFO [train.py:1028] (1/2) Epoch 22, batch 9400, loss[loss=0.218, simple_loss=0.2843, pruned_loss=0.07585, over 13267.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2781, pruned_loss=0.07945, over 2568176.96 frames. ], batch size: 52, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:39:04,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=406741.5, ans=0.07 2024-06-21 15:39:06,372 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.243e+02 2.355e+02 2.634e+02 3.526e+02, threshold=4.710e+02, percent-clipped=0.0 2024-06-21 15:39:13,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=406759.8333333333, ans=0.125 2024-06-21 15:39:16,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=406778.1666666667, ans=0.125 2024-06-21 15:39:16,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=406778.1666666667, ans=0.0 2024-06-21 15:39:24,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=406796.5, ans=0.07 2024-06-21 15:39:25,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=406796.5, ans=0.125 2024-06-21 15:39:31,801 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.30 vs. limit=15.0 2024-06-21 15:39:35,142 INFO [train.py:1028] (1/2) Epoch 22, batch 9450, loss[loss=0.2184, simple_loss=0.2781, pruned_loss=0.07935, over 12551.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2782, pruned_loss=0.07973, over 2568673.07 frames. ], batch size: 22, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:39:45,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=406851.5, ans=0.5 2024-06-21 15:39:48,012 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.16 vs. limit=10.0 2024-06-21 15:39:56,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=406888.1666666667, ans=0.2 2024-06-21 15:39:57,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=406888.1666666667, ans=0.1 2024-06-21 15:40:05,858 INFO [train.py:1028] (1/2) Epoch 22, batch 9500, loss[loss=0.2229, simple_loss=0.2863, pruned_loss=0.0798, over 13256.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.278, pruned_loss=0.07929, over 2577867.56 frames. ], batch size: 43, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:40:08,124 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.247e+02 2.353e+02 2.587e+02 4.190e+02, threshold=4.706e+02, percent-clipped=0.0 2024-06-21 15:40:12,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=406943.1666666667, ans=0.125 2024-06-21 15:40:13,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=406943.1666666667, ans=0.0 2024-06-21 15:40:18,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=406961.5, ans=0.0 2024-06-21 15:40:24,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2024-06-21 15:40:26,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=406979.8333333333, ans=0.2 2024-06-21 15:40:32,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=406998.1666666667, ans=0.0 2024-06-21 15:40:36,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=406998.1666666667, ans=0.125 2024-06-21 15:40:39,228 INFO [train.py:1028] (1/2) Epoch 22, batch 9550, loss[loss=0.2044, simple_loss=0.269, pruned_loss=0.06991, over 12920.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2777, pruned_loss=0.07915, over 2573776.45 frames. ], batch size: 39, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:40:42,345 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.98 vs. limit=10.0 2024-06-21 15:40:47,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=407034.8333333333, ans=0.0 2024-06-21 15:41:08,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=407089.8333333333, ans=0.1 2024-06-21 15:41:11,998 INFO [train.py:1028] (1/2) Epoch 22, batch 9600, loss[loss=0.2455, simple_loss=0.2904, pruned_loss=0.1003, over 10328.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2773, pruned_loss=0.07911, over 2571861.89 frames. ], batch size: 304, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:41:12,340 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.13 vs. limit=15.0 2024-06-21 15:41:13,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=407108.1666666667, ans=0.07 2024-06-21 15:41:14,538 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 2.223e+02 2.369e+02 2.583e+02 3.273e+02, threshold=4.738e+02, percent-clipped=0.0 2024-06-21 15:41:16,814 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.42 vs. limit=15.0 2024-06-21 15:41:17,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=407126.5, ans=0.125 2024-06-21 15:41:23,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=407144.8333333333, ans=0.2 2024-06-21 15:41:23,856 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2024-06-21 15:41:24,665 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.46 vs. limit=15.0 2024-06-21 15:41:35,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=407163.1666666667, ans=0.125 2024-06-21 15:41:37,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=407181.5, ans=0.125 2024-06-21 15:41:40,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=407181.5, ans=0.0 2024-06-21 15:41:40,439 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.69 vs. limit=22.5 2024-06-21 15:41:42,465 INFO [train.py:1028] (1/2) Epoch 22, batch 9650, loss[loss=0.2132, simple_loss=0.2699, pruned_loss=0.07823, over 13125.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2772, pruned_loss=0.07968, over 2561110.48 frames. ], batch size: 132, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:41:53,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=407218.1666666667, ans=0.1 2024-06-21 15:42:05,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407254.8333333333, ans=0.1 2024-06-21 15:42:08,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=407273.1666666667, ans=0.1 2024-06-21 15:42:10,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407273.1666666667, ans=0.1 2024-06-21 15:42:13,009 INFO [train.py:1028] (1/2) Epoch 22, batch 9700, loss[loss=0.2081, simple_loss=0.2604, pruned_loss=0.07786, over 13115.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2771, pruned_loss=0.0798, over 2556871.35 frames. ], batch size: 144, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:42:14,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=407291.5, ans=0.04949747468305833 2024-06-21 15:42:15,251 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.308e+02 2.437e+02 2.668e+02 3.352e+02, threshold=4.874e+02, percent-clipped=0.0 2024-06-21 15:42:16,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=407291.5, ans=0.1 2024-06-21 15:42:19,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=407309.8333333333, ans=0.125 2024-06-21 15:42:23,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=407309.8333333333, ans=0.125 2024-06-21 15:42:23,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=407309.8333333333, ans=0.125 2024-06-21 15:42:27,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=407328.1666666667, ans=0.125 2024-06-21 15:42:31,971 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.44 vs. limit=22.5 2024-06-21 15:42:36,894 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.63 vs. limit=6.0 2024-06-21 15:42:45,434 INFO [train.py:1028] (1/2) Epoch 22, batch 9750, loss[loss=0.2161, simple_loss=0.2695, pruned_loss=0.08137, over 13145.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2761, pruned_loss=0.0789, over 2552690.47 frames. ], batch size: 132, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:42:48,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407383.1666666667, ans=0.1 2024-06-21 15:42:59,903 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.28 vs. limit=15.0 2024-06-21 15:43:13,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=407456.5, ans=0.125 2024-06-21 15:43:17,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=407474.8333333333, ans=0.0 2024-06-21 15:43:18,085 INFO [train.py:1028] (1/2) Epoch 22, batch 9800, loss[loss=0.2035, simple_loss=0.2639, pruned_loss=0.07155, over 12948.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2754, pruned_loss=0.07832, over 2545380.84 frames. ], batch size: 39, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:43:19,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=407474.8333333333, ans=0.1 2024-06-21 15:43:20,492 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.224e+02 2.364e+02 2.590e+02 3.586e+02, threshold=4.729e+02, percent-clipped=0.0 2024-06-21 15:43:28,023 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2024-06-21 15:43:39,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=407529.8333333333, ans=0.125 2024-06-21 15:43:48,671 INFO [train.py:1028] (1/2) Epoch 22, batch 9850, loss[loss=0.2244, simple_loss=0.285, pruned_loss=0.08197, over 13051.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2751, pruned_loss=0.07827, over 2538088.29 frames. ], batch size: 102, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:43:51,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=407566.5, ans=0.125 2024-06-21 15:44:04,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=407603.1666666667, ans=0.125 2024-06-21 15:44:07,010 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.88 vs. limit=15.0 2024-06-21 15:44:07,347 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.422e+00 2024-06-21 15:44:08,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=407621.5, ans=0.125 2024-06-21 15:44:10,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=407621.5, ans=0.125 2024-06-21 15:44:11,473 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.74 vs. limit=15.0 2024-06-21 15:44:19,290 INFO [train.py:1028] (1/2) Epoch 22, batch 9900, loss[loss=0.2156, simple_loss=0.2713, pruned_loss=0.07994, over 12944.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2745, pruned_loss=0.07841, over 2531932.06 frames. ], batch size: 39, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:44:20,061 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=407658.1666666667, ans=0.125 2024-06-21 15:44:22,889 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.175e+02 2.297e+02 2.531e+02 3.263e+02, threshold=4.594e+02, percent-clipped=0.0 2024-06-21 15:44:39,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=407713.1666666667, ans=0.2 2024-06-21 15:44:41,781 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.66 vs. limit=10.0 2024-06-21 15:44:43,441 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.73 vs. limit=15.0 2024-06-21 15:44:44,862 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2024-06-21 15:44:51,350 INFO [train.py:1028] (1/2) Epoch 22, batch 9950, loss[loss=0.2108, simple_loss=0.266, pruned_loss=0.07778, over 12884.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2738, pruned_loss=0.07843, over 2525835.45 frames. ], batch size: 29, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:44:54,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=407749.8333333333, ans=0.125 2024-06-21 15:45:04,768 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:45:06,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407786.5, ans=0.1 2024-06-21 15:45:09,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=407804.8333333333, ans=0.2 2024-06-21 15:45:12,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=407804.8333333333, ans=0.125 2024-06-21 15:45:23,029 INFO [train.py:1028] (1/2) Epoch 22, batch 10000, loss[loss=0.2125, simple_loss=0.2761, pruned_loss=0.07448, over 12724.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2742, pruned_loss=0.07881, over 2487715.33 frames. ], batch size: 22, lr: 2.60e-03, grad_scale: 32.0 2024-06-21 15:45:25,460 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.266e+02 2.442e+02 2.655e+02 3.714e+02, threshold=4.884e+02, percent-clipped=0.0 2024-06-21 15:45:33,906 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.14 vs. limit=10.0 2024-06-21 15:45:45,122 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:45:51,843 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=407914.8333333333, ans=0.125 2024-06-21 15:45:54,834 INFO [train.py:1028] (1/2) Epoch 22, batch 10050, loss[loss=0.202, simple_loss=0.2661, pruned_loss=0.06898, over 12437.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2742, pruned_loss=0.07926, over 2443725.43 frames. ], batch size: 22, lr: 2.60e-03, grad_scale: 32.0 2024-06-21 15:46:12,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=407988.1666666667, ans=0.0 2024-06-21 15:46:13,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=407988.1666666667, ans=0.0 2024-06-21 15:46:13,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=407988.1666666667, ans=0.025 2024-06-21 15:46:17,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=408006.5, ans=0.125 2024-06-21 15:46:24,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=408024.8333333333, ans=0.025 2024-06-21 15:46:24,717 INFO [train.py:1028] (1/2) Epoch 22, batch 10100, loss[loss=0.2153, simple_loss=0.2744, pruned_loss=0.07816, over 11345.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2736, pruned_loss=0.07882, over 2424554.98 frames. ], batch size: 17, lr: 2.60e-03, grad_scale: 32.0 2024-06-21 15:46:27,144 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.270e+02 2.484e+02 2.752e+02 5.288e+02, threshold=4.968e+02, percent-clipped=1.0 2024-06-21 15:46:27,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=408024.8333333333, ans=0.0 2024-06-21 15:46:29,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=408024.8333333333, ans=0.125 2024-06-21 15:46:31,854 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.74 vs. limit=6.0 2024-06-21 15:48:44,225 INFO [train.py:1028] (1/2) Epoch 23, batch 0, loss[loss=0.1972, simple_loss=0.2522, pruned_loss=0.07113, over 12953.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2522, pruned_loss=0.07113, over 12953.00 frames. ], batch size: 36, lr: 2.55e-03, grad_scale: 32.0 2024-06-21 15:48:44,226 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 15:48:51,153 INFO [train.py:1060] (1/2) Epoch 23, validation: loss=0.1885, simple_loss=0.2525, pruned_loss=0.06224, over 351949.00 frames. 2024-06-21 15:48:51,154 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 15:48:53,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=408057.8333333333, ans=0.07 2024-06-21 15:49:10,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=408094.5, ans=0.2 2024-06-21 15:49:25,106 INFO [train.py:1028] (1/2) Epoch 23, batch 50, loss[loss=0.2044, simple_loss=0.2631, pruned_loss=0.07287, over 12631.00 frames. ], tot_loss[loss=0.2, simple_loss=0.256, pruned_loss=0.07196, over 574585.14 frames. ], batch size: 29, lr: 2.55e-03, grad_scale: 32.0 2024-06-21 15:49:25,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=408149.5, ans=0.125 2024-06-21 15:49:29,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=408149.5, ans=0.0 2024-06-21 15:49:34,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=408167.8333333333, ans=0.125 2024-06-21 15:49:47,673 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.117e+02 2.231e+02 2.405e+02 4.538e+02, threshold=4.462e+02, percent-clipped=0.0 2024-06-21 15:50:01,818 INFO [train.py:1028] (1/2) Epoch 23, batch 100, loss[loss=0.1862, simple_loss=0.2517, pruned_loss=0.06039, over 13321.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2554, pruned_loss=0.07162, over 1017585.38 frames. ], batch size: 46, lr: 2.55e-03, grad_scale: 32.0 2024-06-21 15:50:02,322 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2024-06-21 15:50:04,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=408241.1666666667, ans=0.0 2024-06-21 15:50:05,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=408241.1666666667, ans=0.125 2024-06-21 15:50:10,128 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.09 vs. limit=15.0 2024-06-21 15:50:29,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=408296.1666666667, ans=0.125 2024-06-21 15:50:31,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=408314.5, ans=0.0 2024-06-21 15:50:36,691 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.00 vs. limit=15.0 2024-06-21 15:50:37,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=408332.8333333333, ans=0.125 2024-06-21 15:50:37,669 INFO [train.py:1028] (1/2) Epoch 23, batch 150, loss[loss=0.1861, simple_loss=0.2551, pruned_loss=0.0586, over 12699.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2537, pruned_loss=0.06996, over 1367213.08 frames. ], batch size: 29, lr: 2.55e-03, grad_scale: 32.0 2024-06-21 15:50:39,405 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2024-06-21 15:50:46,244 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.586e-03 2024-06-21 15:50:46,485 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.34 vs. limit=10.0 2024-06-21 15:50:56,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=408387.8333333333, ans=0.2 2024-06-21 15:50:56,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=408387.8333333333, ans=0.0 2024-06-21 15:51:00,840 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.142e+02 2.276e+02 2.640e+02 4.088e+02, threshold=4.553e+02, percent-clipped=0.0 2024-06-21 15:51:05,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=408406.1666666667, ans=0.0 2024-06-21 15:51:10,057 INFO [train.py:1028] (1/2) Epoch 23, batch 200, loss[loss=0.2053, simple_loss=0.2575, pruned_loss=0.07656, over 12590.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2542, pruned_loss=0.07031, over 1637220.44 frames. ], batch size: 202, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:51:17,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=408442.8333333333, ans=0.125 2024-06-21 15:51:22,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=408461.1666666667, ans=0.0 2024-06-21 15:51:23,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=408461.1666666667, ans=0.0 2024-06-21 15:51:24,187 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2024-06-21 15:51:25,388 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.45 vs. limit=15.0 2024-06-21 15:51:31,274 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.68 vs. limit=6.0 2024-06-21 15:51:31,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=408479.5, ans=0.1 2024-06-21 15:51:33,068 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=12.0 2024-06-21 15:51:39,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=408497.8333333333, ans=0.0 2024-06-21 15:51:42,126 INFO [train.py:1028] (1/2) Epoch 23, batch 250, loss[loss=0.1954, simple_loss=0.2462, pruned_loss=0.07231, over 12978.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2538, pruned_loss=0.07007, over 1847751.25 frames. ], batch size: 144, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:51:46,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=408516.1666666667, ans=0.125 2024-06-21 15:51:55,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=408552.8333333333, ans=0.2 2024-06-21 15:51:56,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=408552.8333333333, ans=0.1 2024-06-21 15:52:07,927 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:52:08,953 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.104e+02 2.235e+02 2.394e+02 2.888e+02, threshold=4.469e+02, percent-clipped=0.0 2024-06-21 15:52:15,640 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.25 vs. limit=15.0 2024-06-21 15:52:22,671 INFO [train.py:1028] (1/2) Epoch 23, batch 300, loss[loss=0.2142, simple_loss=0.2574, pruned_loss=0.08545, over 13138.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2549, pruned_loss=0.07059, over 2010284.87 frames. ], batch size: 112, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:52:34,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=408626.1666666667, ans=10.0 2024-06-21 15:52:41,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=408662.8333333333, ans=0.0 2024-06-21 15:52:47,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=408681.1666666667, ans=0.2 2024-06-21 15:52:54,074 INFO [train.py:1028] (1/2) Epoch 23, batch 350, loss[loss=0.1984, simple_loss=0.2554, pruned_loss=0.07073, over 13004.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2552, pruned_loss=0.07061, over 2140046.49 frames. ], batch size: 33, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:53:08,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=408736.1666666667, ans=0.05 2024-06-21 15:53:08,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=408736.1666666667, ans=0.125 2024-06-21 15:53:11,555 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2024-06-21 15:53:16,999 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.137e+02 2.255e+02 2.406e+02 3.009e+02, threshold=4.510e+02, percent-clipped=0.0 2024-06-21 15:53:19,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=408772.8333333333, ans=0.0 2024-06-21 15:53:19,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=408772.8333333333, ans=0.5 2024-06-21 15:53:21,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=408772.8333333333, ans=0.0 2024-06-21 15:53:25,774 INFO [train.py:1028] (1/2) Epoch 23, batch 400, loss[loss=0.197, simple_loss=0.2597, pruned_loss=0.06711, over 13264.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.255, pruned_loss=0.07043, over 2239742.15 frames. ], batch size: 63, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:53:26,194 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.80 vs. limit=15.0 2024-06-21 15:53:27,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=408791.1666666667, ans=0.0 2024-06-21 15:53:41,435 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.86 vs. limit=15.0 2024-06-21 15:53:55,627 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.84 vs. limit=6.0 2024-06-21 15:53:57,747 INFO [train.py:1028] (1/2) Epoch 23, batch 450, loss[loss=0.1915, simple_loss=0.2556, pruned_loss=0.06374, over 13265.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2548, pruned_loss=0.07033, over 2313594.67 frames. ], batch size: 67, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:53:59,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=408882.8333333333, ans=0.0 2024-06-21 15:54:10,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=408901.1666666667, ans=0.125 2024-06-21 15:54:10,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=408901.1666666667, ans=0.125 2024-06-21 15:54:14,779 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.81 vs. limit=22.5 2024-06-21 15:54:18,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=408919.5, ans=0.025 2024-06-21 15:54:23,849 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.132e+02 2.260e+02 2.448e+02 3.944e+02, threshold=4.520e+02, percent-clipped=0.0 2024-06-21 15:54:25,504 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.07 vs. limit=22.5 2024-06-21 15:54:34,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=408956.1666666667, ans=0.025 2024-06-21 15:54:35,934 INFO [train.py:1028] (1/2) Epoch 23, batch 500, loss[loss=0.1773, simple_loss=0.2314, pruned_loss=0.06155, over 13194.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2545, pruned_loss=0.0697, over 2376106.68 frames. ], batch size: 121, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:54:42,311 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.67 vs. limit=5.0 2024-06-21 15:54:47,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=408992.8333333333, ans=0.2 2024-06-21 15:54:53,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=409011.1666666667, ans=0.125 2024-06-21 15:55:08,137 INFO [train.py:1028] (1/2) Epoch 23, batch 550, loss[loss=0.1977, simple_loss=0.2476, pruned_loss=0.07386, over 12900.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2535, pruned_loss=0.06914, over 2421314.10 frames. ], batch size: 158, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:55:08,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=409066.1666666667, ans=0.025 2024-06-21 15:55:08,561 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.91 vs. limit=6.0 2024-06-21 15:55:11,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=409066.1666666667, ans=0.0 2024-06-21 15:55:12,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=409066.1666666667, ans=0.125 2024-06-21 15:55:13,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=409066.1666666667, ans=0.125 2024-06-21 15:55:15,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=409084.5, ans=0.1 2024-06-21 15:55:17,619 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.29 vs. limit=10.0 2024-06-21 15:55:18,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=409084.5, ans=0.025 2024-06-21 15:55:24,419 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.87 vs. limit=10.0 2024-06-21 15:55:28,033 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=409121.1666666667, ans=0.0 2024-06-21 15:55:30,916 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.127e+02 2.243e+02 2.449e+02 3.168e+02, threshold=4.485e+02, percent-clipped=0.0 2024-06-21 15:55:33,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=409139.5, ans=0.5 2024-06-21 15:55:37,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=409139.5, ans=0.125 2024-06-21 15:55:39,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=409157.8333333333, ans=0.07 2024-06-21 15:55:39,758 INFO [train.py:1028] (1/2) Epoch 23, batch 600, loss[loss=0.1859, simple_loss=0.2325, pruned_loss=0.0697, over 13035.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2535, pruned_loss=0.06947, over 2458819.16 frames. ], batch size: 144, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:56:14,435 INFO [train.py:1028] (1/2) Epoch 23, batch 650, loss[loss=0.2174, simple_loss=0.2766, pruned_loss=0.07908, over 13212.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2544, pruned_loss=0.06951, over 2489785.53 frames. ], batch size: 59, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:56:23,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=409267.8333333333, ans=0.125 2024-06-21 15:56:28,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=409286.1666666667, ans=0.025 2024-06-21 15:56:29,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=409286.1666666667, ans=0.2 2024-06-21 15:56:33,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=409304.5, ans=0.2 2024-06-21 15:56:39,869 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.878e+02 2.168e+02 2.337e+02 2.527e+02 3.744e+02, threshold=4.674e+02, percent-clipped=0.0 2024-06-21 15:56:47,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=409322.8333333333, ans=0.2 2024-06-21 15:56:48,523 INFO [train.py:1028] (1/2) Epoch 23, batch 700, loss[loss=0.1843, simple_loss=0.2491, pruned_loss=0.05977, over 13305.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2546, pruned_loss=0.06983, over 2512553.75 frames. ], batch size: 46, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:56:53,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=409341.1666666667, ans=0.0 2024-06-21 15:56:53,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=409341.1666666667, ans=0.2 2024-06-21 15:57:04,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=409377.8333333333, ans=0.025 2024-06-21 15:57:17,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409414.5, ans=0.1 2024-06-21 15:57:19,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=409414.5, ans=0.0 2024-06-21 15:57:20,461 INFO [train.py:1028] (1/2) Epoch 23, batch 750, loss[loss=0.1774, simple_loss=0.2454, pruned_loss=0.0547, over 13296.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.255, pruned_loss=0.06994, over 2527723.63 frames. ], batch size: 63, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:57:31,115 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.39 vs. limit=15.0 2024-06-21 15:57:33,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=409469.5, ans=0.125 2024-06-21 15:57:42,940 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.152e+02 2.285e+02 2.475e+02 3.135e+02, threshold=4.570e+02, percent-clipped=0.0 2024-06-21 15:57:51,949 INFO [train.py:1028] (1/2) Epoch 23, batch 800, loss[loss=0.1662, simple_loss=0.226, pruned_loss=0.0532, over 13002.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2544, pruned_loss=0.0698, over 2540557.17 frames. ], batch size: 36, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:57:56,172 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.13 vs. limit=10.0 2024-06-21 15:58:07,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=409561.1666666667, ans=0.2 2024-06-21 15:58:09,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=409561.1666666667, ans=0.125 2024-06-21 15:58:19,607 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:58:27,672 INFO [train.py:1028] (1/2) Epoch 23, batch 850, loss[loss=0.1842, simple_loss=0.243, pruned_loss=0.06269, over 13192.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2537, pruned_loss=0.06949, over 2551383.85 frames. ], batch size: 95, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:58:38,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=409634.5, ans=0.125 2024-06-21 15:58:51,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=409671.1666666667, ans=0.125 2024-06-21 15:58:54,188 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.820e+02 2.149e+02 2.289e+02 2.463e+02 3.648e+02, threshold=4.578e+02, percent-clipped=0.0 2024-06-21 15:58:58,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=409689.5, ans=0.1 2024-06-21 15:59:03,467 INFO [train.py:1028] (1/2) Epoch 23, batch 900, loss[loss=0.1876, simple_loss=0.2492, pruned_loss=0.06306, over 12846.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2537, pruned_loss=0.06965, over 2556004.18 frames. ], batch size: 36, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:59:07,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=409707.8333333333, ans=0.125 2024-06-21 15:59:09,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=409726.1666666667, ans=0.125 2024-06-21 15:59:09,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=409726.1666666667, ans=0.125 2024-06-21 15:59:12,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=409726.1666666667, ans=0.125 2024-06-21 15:59:21,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409744.5, ans=0.1 2024-06-21 15:59:22,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=409762.8333333333, ans=0.125 2024-06-21 15:59:36,239 INFO [train.py:1028] (1/2) Epoch 23, batch 950, loss[loss=0.1876, simple_loss=0.2548, pruned_loss=0.06024, over 13002.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2541, pruned_loss=0.06977, over 2559643.04 frames. ], batch size: 39, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:59:49,513 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2024-06-21 15:59:53,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=409836.1666666667, ans=0.125 2024-06-21 15:59:59,038 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.922e+02 2.132e+02 2.283e+02 2.449e+02 2.805e+02, threshold=4.565e+02, percent-clipped=0.0 2024-06-21 16:00:02,006 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.96 vs. limit=22.5 2024-06-21 16:00:10,741 INFO [train.py:1028] (1/2) Epoch 23, batch 1000, loss[loss=0.199, simple_loss=0.2583, pruned_loss=0.06983, over 13024.00 frames. ], tot_loss[loss=0.197, simple_loss=0.254, pruned_loss=0.07005, over 2561346.36 frames. ], batch size: 48, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:00:17,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=409891.1666666667, ans=0.025 2024-06-21 16:00:18,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=409909.5, ans=0.125 2024-06-21 16:00:18,659 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.22 vs. limit=15.0 2024-06-21 16:00:24,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=409927.8333333333, ans=0.0 2024-06-21 16:00:33,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=409946.1666666667, ans=0.125 2024-06-21 16:00:38,879 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:00:39,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=409964.5, ans=0.0 2024-06-21 16:00:47,742 INFO [train.py:1028] (1/2) Epoch 23, batch 1050, loss[loss=0.199, simple_loss=0.2614, pruned_loss=0.0683, over 13164.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2544, pruned_loss=0.07017, over 2564515.69 frames. ], batch size: 77, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:01:02,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=410019.5, ans=0.1 2024-06-21 16:01:06,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=410037.8333333333, ans=0.125 2024-06-21 16:01:10,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=410037.8333333333, ans=0.125 2024-06-21 16:01:11,523 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.103e+02 2.287e+02 2.506e+02 3.300e+02, threshold=4.574e+02, percent-clipped=0.0 2024-06-21 16:01:11,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=410037.8333333333, ans=0.1 2024-06-21 16:01:17,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=410056.1666666667, ans=0.125 2024-06-21 16:01:20,974 INFO [train.py:1028] (1/2) Epoch 23, batch 1100, loss[loss=0.2026, simple_loss=0.2645, pruned_loss=0.07039, over 13211.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2542, pruned_loss=0.06995, over 2569652.14 frames. ], batch size: 52, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:01:21,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=410074.5, ans=0.125 2024-06-21 16:01:24,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=410074.5, ans=0.1 2024-06-21 16:01:25,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=410074.5, ans=0.125 2024-06-21 16:01:36,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=410111.1666666667, ans=0.1 2024-06-21 16:01:43,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=410129.5, ans=0.0 2024-06-21 16:01:46,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=410129.5, ans=0.1 2024-06-21 16:01:46,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=410129.5, ans=0.0 2024-06-21 16:01:47,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=410147.8333333333, ans=0.125 2024-06-21 16:01:48,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=410147.8333333333, ans=0.125 2024-06-21 16:01:48,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=410147.8333333333, ans=0.0 2024-06-21 16:01:50,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=410147.8333333333, ans=0.0 2024-06-21 16:01:54,284 INFO [train.py:1028] (1/2) Epoch 23, batch 1150, loss[loss=0.206, simple_loss=0.2684, pruned_loss=0.07177, over 13247.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2546, pruned_loss=0.07012, over 2571892.26 frames. ], batch size: 52, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:01:55,347 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2024-06-21 16:01:56,891 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.76 vs. limit=22.5 2024-06-21 16:01:58,718 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.05 vs. limit=22.5 2024-06-21 16:02:01,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=410184.5, ans=0.125 2024-06-21 16:02:20,227 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.150e+02 2.238e+02 2.488e+02 3.313e+02, threshold=4.476e+02, percent-clipped=0.0 2024-06-21 16:02:24,039 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.13 vs. limit=15.0 2024-06-21 16:02:28,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=410239.5, ans=0.025 2024-06-21 16:02:29,567 INFO [train.py:1028] (1/2) Epoch 23, batch 1200, loss[loss=0.1833, simple_loss=0.249, pruned_loss=0.0588, over 13139.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2543, pruned_loss=0.07004, over 2573329.09 frames. ], batch size: 77, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:02:55,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=410312.8333333333, ans=0.125 2024-06-21 16:02:58,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=410331.1666666667, ans=0.1 2024-06-21 16:03:02,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=410331.1666666667, ans=0.125 2024-06-21 16:03:03,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=410331.1666666667, ans=0.0 2024-06-21 16:03:04,964 INFO [train.py:1028] (1/2) Epoch 23, batch 1250, loss[loss=0.2022, simple_loss=0.2553, pruned_loss=0.07451, over 13158.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2543, pruned_loss=0.06994, over 2583265.78 frames. ], batch size: 112, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:03:13,895 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.13 vs. limit=15.0 2024-06-21 16:03:16,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=410367.8333333333, ans=0.125 2024-06-21 16:03:19,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=410386.1666666667, ans=0.125 2024-06-21 16:03:28,490 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.089e+02 2.211e+02 2.343e+02 3.024e+02, threshold=4.422e+02, percent-clipped=0.0 2024-06-21 16:03:37,730 INFO [train.py:1028] (1/2) Epoch 23, batch 1300, loss[loss=0.2191, simple_loss=0.2656, pruned_loss=0.08628, over 12806.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2548, pruned_loss=0.07023, over 2583217.90 frames. ], batch size: 177, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:03:38,683 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2024-06-21 16:03:39,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=410441.1666666667, ans=0.0 2024-06-21 16:03:43,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=410459.5, ans=0.125 2024-06-21 16:03:44,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=410459.5, ans=0.2 2024-06-21 16:03:55,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=410477.8333333333, ans=0.025 2024-06-21 16:04:01,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=410496.1666666667, ans=0.0 2024-06-21 16:04:09,869 INFO [train.py:1028] (1/2) Epoch 23, batch 1350, loss[loss=0.2144, simple_loss=0.2703, pruned_loss=0.07928, over 13221.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2552, pruned_loss=0.07009, over 2585634.24 frames. ], batch size: 59, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:04:20,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=410551.1666666667, ans=10.0 2024-06-21 16:04:25,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=410569.5, ans=0.125 2024-06-21 16:04:31,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=410569.5, ans=0.125 2024-06-21 16:04:37,430 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.138e+02 2.264e+02 2.397e+02 2.852e+02, threshold=4.528e+02, percent-clipped=0.0 2024-06-21 16:04:41,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=410606.1666666667, ans=0.0 2024-06-21 16:04:48,583 INFO [train.py:1028] (1/2) Epoch 23, batch 1400, loss[loss=0.2052, simple_loss=0.2607, pruned_loss=0.07489, over 12826.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2552, pruned_loss=0.07011, over 2587202.89 frames. ], batch size: 26, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:05:01,924 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.55 vs. limit=6.0 2024-06-21 16:05:09,359 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.44 vs. limit=15.0 2024-06-21 16:05:11,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=410661.1666666667, ans=0.125 2024-06-21 16:05:16,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=410679.5, ans=0.0 2024-06-21 16:05:19,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=410697.8333333333, ans=15.0 2024-06-21 16:05:24,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=410697.8333333333, ans=22.5 2024-06-21 16:05:26,784 INFO [train.py:1028] (1/2) Epoch 23, batch 1450, loss[loss=0.1829, simple_loss=0.2368, pruned_loss=0.06454, over 13149.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2553, pruned_loss=0.07037, over 2586569.93 frames. ], batch size: 121, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:05:34,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=410734.5, ans=0.035 2024-06-21 16:05:35,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=410734.5, ans=0.1 2024-06-21 16:05:46,412 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.70 vs. limit=15.0 2024-06-21 16:05:50,443 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.166e+02 2.315e+02 2.437e+02 3.068e+02, threshold=4.630e+02, percent-clipped=0.0 2024-06-21 16:05:50,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=410771.1666666667, ans=0.125 2024-06-21 16:05:55,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=410789.5, ans=0.0 2024-06-21 16:05:58,852 INFO [train.py:1028] (1/2) Epoch 23, batch 1500, loss[loss=0.2051, simple_loss=0.2607, pruned_loss=0.07474, over 13192.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2553, pruned_loss=0.07083, over 2588927.58 frames. ], batch size: 83, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:06:05,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=410826.1666666667, ans=0.125 2024-06-21 16:06:14,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=410844.5, ans=0.0 2024-06-21 16:06:23,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=410862.8333333333, ans=0.0 2024-06-21 16:06:25,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=410862.8333333333, ans=0.125 2024-06-21 16:06:25,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=410862.8333333333, ans=0.05 2024-06-21 16:06:34,114 INFO [train.py:1028] (1/2) Epoch 23, batch 1550, loss[loss=0.2243, simple_loss=0.2706, pruned_loss=0.08893, over 13034.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2553, pruned_loss=0.071, over 2583859.33 frames. ], batch size: 102, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:06:38,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=410899.5, ans=0.1 2024-06-21 16:06:39,052 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=22.5 2024-06-21 16:06:42,552 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=15.0 2024-06-21 16:06:48,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=410917.8333333333, ans=0.125 2024-06-21 16:06:59,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=410954.5, ans=0.125 2024-06-21 16:07:02,016 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.216e+02 2.384e+02 2.553e+02 3.642e+02, threshold=4.768e+02, percent-clipped=0.0 2024-06-21 16:07:09,434 INFO [train.py:1028] (1/2) Epoch 23, batch 1600, loss[loss=0.186, simple_loss=0.2528, pruned_loss=0.05963, over 13144.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2555, pruned_loss=0.07085, over 2579397.26 frames. ], batch size: 77, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:07:25,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=411027.8333333333, ans=0.0 2024-06-21 16:07:30,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.65 vs. limit=12.0 2024-06-21 16:07:40,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=411064.5, ans=0.2 2024-06-21 16:07:42,353 INFO [train.py:1028] (1/2) Epoch 23, batch 1650, loss[loss=0.1984, simple_loss=0.2496, pruned_loss=0.07361, over 13199.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2556, pruned_loss=0.07108, over 2575868.72 frames. ], batch size: 95, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:07:49,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=411101.1666666667, ans=0.125 2024-06-21 16:07:57,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=411119.5, ans=0.125 2024-06-21 16:07:59,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=411119.5, ans=0.125 2024-06-21 16:08:01,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=411137.8333333333, ans=0.0 2024-06-21 16:08:07,093 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.854e+02 2.123e+02 2.253e+02 2.398e+02 2.932e+02, threshold=4.506e+02, percent-clipped=0.0 2024-06-21 16:08:16,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=411156.1666666667, ans=0.09899494936611666 2024-06-21 16:08:19,318 INFO [train.py:1028] (1/2) Epoch 23, batch 1700, loss[loss=0.2099, simple_loss=0.2647, pruned_loss=0.07759, over 12759.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2561, pruned_loss=0.0709, over 2581300.44 frames. ], batch size: 26, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:08:25,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=411192.8333333333, ans=0.125 2024-06-21 16:08:34,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=411211.1666666667, ans=0.04949747468305833 2024-06-21 16:08:38,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=411229.5, ans=10.0 2024-06-21 16:08:39,925 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.26 vs. limit=15.0 2024-06-21 16:08:44,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=411247.8333333333, ans=0.125 2024-06-21 16:08:47,015 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2024-06-21 16:08:52,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=411247.8333333333, ans=0.0 2024-06-21 16:08:54,151 INFO [train.py:1028] (1/2) Epoch 23, batch 1750, loss[loss=0.1778, simple_loss=0.2428, pruned_loss=0.05637, over 12614.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2558, pruned_loss=0.07075, over 2582198.80 frames. ], batch size: 22, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:08:54,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=411266.1666666667, ans=0.05 2024-06-21 16:09:04,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=411284.5, ans=0.025 2024-06-21 16:09:18,524 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.118e+02 2.248e+02 2.419e+02 3.038e+02, threshold=4.496e+02, percent-clipped=0.0 2024-06-21 16:09:20,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=411339.5, ans=0.125 2024-06-21 16:09:26,112 INFO [train.py:1028] (1/2) Epoch 23, batch 1800, loss[loss=0.1987, simple_loss=0.2581, pruned_loss=0.06964, over 13253.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2557, pruned_loss=0.07083, over 2582530.01 frames. ], batch size: 67, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:09:30,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=411357.8333333333, ans=0.2 2024-06-21 16:09:30,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=411357.8333333333, ans=0.125 2024-06-21 16:09:36,221 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.31 vs. limit=6.0 2024-06-21 16:09:36,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=411376.1666666667, ans=0.1 2024-06-21 16:09:45,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=411412.8333333333, ans=0.0 2024-06-21 16:09:50,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=411412.8333333333, ans=0.0 2024-06-21 16:09:52,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=411431.1666666667, ans=0.125 2024-06-21 16:09:54,685 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.89 vs. limit=15.0 2024-06-21 16:09:58,569 INFO [train.py:1028] (1/2) Epoch 23, batch 1850, loss[loss=0.1864, simple_loss=0.239, pruned_loss=0.06689, over 13168.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2555, pruned_loss=0.07062, over 2583920.55 frames. ], batch size: 83, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:10:04,449 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.05 vs. limit=15.0 2024-06-21 16:10:24,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=411504.5, ans=0.0 2024-06-21 16:10:26,469 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.138e+02 2.267e+02 2.439e+02 2.868e+02, threshold=4.535e+02, percent-clipped=0.0 2024-06-21 16:10:28,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=411522.8333333333, ans=0.05 2024-06-21 16:10:31,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=411522.8333333333, ans=0.07 2024-06-21 16:10:34,500 INFO [train.py:1028] (1/2) Epoch 23, batch 1900, loss[loss=0.1957, simple_loss=0.2515, pruned_loss=0.06993, over 13109.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2546, pruned_loss=0.07064, over 2586279.82 frames. ], batch size: 95, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:10:48,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=411559.5, ans=0.05 2024-06-21 16:10:55,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=411577.8333333333, ans=0.125 2024-06-21 16:11:05,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=411614.5, ans=0.0 2024-06-21 16:11:08,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=411614.5, ans=0.125 2024-06-21 16:11:09,995 INFO [train.py:1028] (1/2) Epoch 23, batch 1950, loss[loss=0.2033, simple_loss=0.2599, pruned_loss=0.07336, over 13251.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2543, pruned_loss=0.07081, over 2591394.02 frames. ], batch size: 52, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:11:34,614 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.173e+02 2.354e+02 2.588e+02 3.801e+02, threshold=4.707e+02, percent-clipped=0.0 2024-06-21 16:11:39,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=411706.1666666667, ans=0.125 2024-06-21 16:11:42,739 INFO [train.py:1028] (1/2) Epoch 23, batch 2000, loss[loss=0.2065, simple_loss=0.2703, pruned_loss=0.07131, over 12588.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2541, pruned_loss=0.07076, over 2586973.37 frames. ], batch size: 22, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:11:52,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=411742.8333333333, ans=0.125 2024-06-21 16:11:58,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=411761.1666666667, ans=0.125 2024-06-21 16:12:04,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=411779.5, ans=0.125 2024-06-21 16:12:19,537 INFO [train.py:1028] (1/2) Epoch 23, batch 2050, loss[loss=0.1937, simple_loss=0.251, pruned_loss=0.06824, over 12500.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2547, pruned_loss=0.07101, over 2582609.30 frames. ], batch size: 29, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:12:23,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=411816.1666666667, ans=0.125 2024-06-21 16:12:23,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=411816.1666666667, ans=0.1 2024-06-21 16:12:26,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=411834.5, ans=0.125 2024-06-21 16:12:37,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=411852.8333333333, ans=0.125 2024-06-21 16:12:39,272 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.54 vs. limit=15.0 2024-06-21 16:12:44,758 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.165e+02 2.276e+02 2.490e+02 3.428e+02, threshold=4.553e+02, percent-clipped=0.0 2024-06-21 16:12:46,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=411889.5, ans=0.1 2024-06-21 16:12:52,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=411889.5, ans=0.0 2024-06-21 16:12:55,508 INFO [train.py:1028] (1/2) Epoch 23, batch 2100, loss[loss=0.2007, simple_loss=0.2594, pruned_loss=0.07101, over 13191.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2552, pruned_loss=0.07065, over 2585444.05 frames. ], batch size: 59, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:13:02,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=411926.1666666667, ans=0.5 2024-06-21 16:13:25,327 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.04 vs. limit=15.0 2024-06-21 16:13:27,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=411981.1666666667, ans=0.125 2024-06-21 16:13:27,919 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=411981.1666666667, ans=0.0 2024-06-21 16:13:28,708 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.78 vs. limit=15.0 2024-06-21 16:13:28,861 INFO [train.py:1028] (1/2) Epoch 23, batch 2150, loss[loss=0.1975, simple_loss=0.2564, pruned_loss=0.06934, over 13293.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.255, pruned_loss=0.07057, over 2588965.52 frames. ], batch size: 52, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:13:51,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=412054.5, ans=0.5 2024-06-21 16:13:53,416 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.208e+02 2.368e+02 2.567e+02 3.186e+02, threshold=4.736e+02, percent-clipped=0.0 2024-06-21 16:13:54,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=412072.8333333333, ans=0.125 2024-06-21 16:13:57,035 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2024-06-21 16:14:00,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=412072.8333333333, ans=0.1 2024-06-21 16:14:00,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=412072.8333333333, ans=0.2 2024-06-21 16:14:01,344 INFO [train.py:1028] (1/2) Epoch 23, batch 2200, loss[loss=0.2135, simple_loss=0.2656, pruned_loss=0.08071, over 13212.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.255, pruned_loss=0.07063, over 2589685.24 frames. ], batch size: 83, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:14:08,523 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.89 vs. limit=15.0 2024-06-21 16:14:23,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=412146.1666666667, ans=0.125 2024-06-21 16:14:25,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=412146.1666666667, ans=0.125 2024-06-21 16:14:29,719 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=15.0 2024-06-21 16:14:32,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=412164.5, ans=0.125 2024-06-21 16:14:36,489 INFO [train.py:1028] (1/2) Epoch 23, batch 2250, loss[loss=0.1971, simple_loss=0.2553, pruned_loss=0.0695, over 13254.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2549, pruned_loss=0.07051, over 2587280.28 frames. ], batch size: 63, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:14:51,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=412219.5, ans=0.125 2024-06-21 16:15:03,733 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.191e+02 2.347e+02 2.542e+02 2.948e+02, threshold=4.694e+02, percent-clipped=0.0 2024-06-21 16:15:11,920 INFO [train.py:1028] (1/2) Epoch 23, batch 2300, loss[loss=0.1802, simple_loss=0.2408, pruned_loss=0.05981, over 12968.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2551, pruned_loss=0.07019, over 2580854.55 frames. ], batch size: 33, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:15:12,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=412274.5, ans=0.025 2024-06-21 16:15:19,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=412292.8333333333, ans=0.2 2024-06-21 16:15:34,163 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2024-06-21 16:15:34,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=412329.5, ans=0.07 2024-06-21 16:15:37,147 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2024-06-21 16:15:44,354 INFO [train.py:1028] (1/2) Epoch 23, batch 2350, loss[loss=0.1964, simple_loss=0.2587, pruned_loss=0.06712, over 13185.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.255, pruned_loss=0.0703, over 2584405.96 frames. ], batch size: 67, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:15:48,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=412366.1666666667, ans=0.1 2024-06-21 16:15:58,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=412402.8333333333, ans=0.0 2024-06-21 16:16:03,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=412421.1666666667, ans=0.07 2024-06-21 16:16:07,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=412421.1666666667, ans=0.07 2024-06-21 16:16:09,428 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.138e+02 2.256e+02 2.402e+02 3.027e+02, threshold=4.511e+02, percent-clipped=0.0 2024-06-21 16:16:20,124 INFO [train.py:1028] (1/2) Epoch 23, batch 2400, loss[loss=0.1948, simple_loss=0.2571, pruned_loss=0.06627, over 13320.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2546, pruned_loss=0.07051, over 2586443.88 frames. ], batch size: 46, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:16:24,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=412457.8333333333, ans=0.125 2024-06-21 16:16:25,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=412457.8333333333, ans=0.1 2024-06-21 16:16:26,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=412476.1666666667, ans=0.125 2024-06-21 16:16:30,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=412476.1666666667, ans=0.125 2024-06-21 16:16:32,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=412494.5, ans=0.125 2024-06-21 16:16:32,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=412494.5, ans=15.0 2024-06-21 16:16:49,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=412531.1666666667, ans=0.07 2024-06-21 16:16:49,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=412531.1666666667, ans=0.125 2024-06-21 16:16:50,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=412531.1666666667, ans=0.1 2024-06-21 16:16:50,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=412531.1666666667, ans=0.125 2024-06-21 16:16:55,991 INFO [train.py:1028] (1/2) Epoch 23, batch 2450, loss[loss=0.1895, simple_loss=0.244, pruned_loss=0.06747, over 13272.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2538, pruned_loss=0.07041, over 2582772.80 frames. ], batch size: 63, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:17:03,840 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:17:10,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=412586.1666666667, ans=0.125 2024-06-21 16:17:20,672 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.161e+02 2.269e+02 2.540e+02 3.446e+02, threshold=4.537e+02, percent-clipped=0.0 2024-06-21 16:17:28,549 INFO [train.py:1028] (1/2) Epoch 23, batch 2500, loss[loss=0.1946, simple_loss=0.2514, pruned_loss=0.06888, over 13185.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2534, pruned_loss=0.06983, over 2585997.67 frames. ], batch size: 83, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:17:29,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=412641.1666666667, ans=0.09899494936611666 2024-06-21 16:17:38,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=412659.5, ans=0.125 2024-06-21 16:17:38,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=412659.5, ans=0.1 2024-06-21 16:17:40,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=412677.8333333333, ans=0.1 2024-06-21 16:17:46,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=412677.8333333333, ans=0.125 2024-06-21 16:17:46,763 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2024-06-21 16:17:48,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=412696.1666666667, ans=0.0 2024-06-21 16:17:48,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=412696.1666666667, ans=0.125 2024-06-21 16:18:00,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=412732.8333333333, ans=0.2 2024-06-21 16:18:00,543 INFO [train.py:1028] (1/2) Epoch 23, batch 2550, loss[loss=0.2153, simple_loss=0.2812, pruned_loss=0.07472, over 12404.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2527, pruned_loss=0.0696, over 2586484.44 frames. ], batch size: 22, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:18:01,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=412732.8333333333, ans=0.0 2024-06-21 16:18:27,684 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.132e+02 2.240e+02 2.447e+02 2.935e+02, threshold=4.480e+02, percent-clipped=0.0 2024-06-21 16:18:33,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=412806.1666666667, ans=0.125 2024-06-21 16:18:38,021 INFO [train.py:1028] (1/2) Epoch 23, batch 2600, loss[loss=0.1892, simple_loss=0.2501, pruned_loss=0.06416, over 13260.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2515, pruned_loss=0.06935, over 2586373.59 frames. ], batch size: 52, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:18:38,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=412824.5, ans=0.125 2024-06-21 16:18:43,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=412842.8333333333, ans=0.125 2024-06-21 16:18:44,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=412842.8333333333, ans=0.1 2024-06-21 16:18:47,254 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.50 vs. limit=6.0 2024-06-21 16:18:51,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=412861.1666666667, ans=0.125 2024-06-21 16:18:59,747 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.65 vs. limit=6.0 2024-06-21 16:19:02,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=412879.5, ans=0.0 2024-06-21 16:19:04,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=412897.8333333333, ans=0.125 2024-06-21 16:19:10,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=412916.1666666667, ans=0.04949747468305833 2024-06-21 16:19:10,558 INFO [train.py:1028] (1/2) Epoch 23, batch 2650, loss[loss=0.1744, simple_loss=0.2253, pruned_loss=0.06178, over 13058.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2499, pruned_loss=0.06889, over 2586748.44 frames. ], batch size: 144, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:19:17,637 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.551e+00 2024-06-21 16:19:25,022 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.55 vs. limit=8.0 2024-06-21 16:19:34,894 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.089e+02 2.242e+02 2.444e+02 2.931e+02, threshold=4.484e+02, percent-clipped=0.0 2024-06-21 16:19:34,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=412971.1666666667, ans=0.1 2024-06-21 16:19:35,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=412989.5, ans=0.05 2024-06-21 16:19:42,947 INFO [train.py:1028] (1/2) Epoch 23, batch 2700, loss[loss=0.1925, simple_loss=0.2438, pruned_loss=0.07064, over 13302.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2487, pruned_loss=0.06879, over 2584417.45 frames. ], batch size: 89, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:19:50,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.04 vs. limit=15.0 2024-06-21 16:19:51,190 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2024-06-21 16:19:51,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=413026.1666666667, ans=0.0 2024-06-21 16:19:52,085 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.96 vs. limit=22.5 2024-06-21 16:19:59,777 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.01 vs. limit=15.0 2024-06-21 16:20:00,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=413044.5, ans=0.125 2024-06-21 16:20:05,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=413044.5, ans=0.1 2024-06-21 16:20:07,741 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.51 vs. limit=10.0 2024-06-21 16:20:19,889 INFO [train.py:1028] (1/2) Epoch 23, batch 2750, loss[loss=0.2022, simple_loss=0.255, pruned_loss=0.07472, over 13231.00 frames. ], tot_loss[loss=0.1913, simple_loss=0.247, pruned_loss=0.06782, over 2581397.46 frames. ], batch size: 43, lr: 2.53e-03, grad_scale: 16.0 2024-06-21 16:20:27,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=413117.8333333333, ans=15.0 2024-06-21 16:20:34,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=413117.8333333333, ans=0.125 2024-06-21 16:20:44,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=413154.5, ans=0.0 2024-06-21 16:20:49,125 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.079e+02 2.196e+02 2.353e+02 2.927e+02, threshold=4.392e+02, percent-clipped=0.0 2024-06-21 16:20:50,297 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.23 vs. limit=10.0 2024-06-21 16:20:51,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=413172.8333333333, ans=0.0 2024-06-21 16:20:56,528 INFO [train.py:1028] (1/2) Epoch 23, batch 2800, loss[loss=0.2011, simple_loss=0.2464, pruned_loss=0.07788, over 10796.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2466, pruned_loss=0.06789, over 2578683.74 frames. ], batch size: 303, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:21:05,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=413209.5, ans=0.05 2024-06-21 16:21:09,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=413227.8333333333, ans=0.0 2024-06-21 16:21:09,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=413227.8333333333, ans=0.1 2024-06-21 16:21:18,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=413246.1666666667, ans=0.035 2024-06-21 16:21:28,705 INFO [train.py:1028] (1/2) Epoch 23, batch 2850, loss[loss=0.2059, simple_loss=0.263, pruned_loss=0.07443, over 13280.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2458, pruned_loss=0.06797, over 2577080.43 frames. ], batch size: 49, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:21:31,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=413282.8333333333, ans=0.125 2024-06-21 16:21:31,602 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.76 vs. limit=15.0 2024-06-21 16:21:32,866 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.22 vs. limit=15.0 2024-06-21 16:21:33,280 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=6.39 vs. limit=12.0 2024-06-21 16:21:33,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=413282.8333333333, ans=0.2 2024-06-21 16:21:35,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413301.1666666667, ans=0.1 2024-06-21 16:21:35,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=413301.1666666667, ans=0.125 2024-06-21 16:21:44,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=413319.5, ans=0.2 2024-06-21 16:21:51,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=413337.8333333333, ans=0.0 2024-06-21 16:21:56,708 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.097e+02 2.224e+02 2.385e+02 2.857e+02, threshold=4.449e+02, percent-clipped=0.0 2024-06-21 16:21:57,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=413356.1666666667, ans=15.0 2024-06-21 16:22:00,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=413356.1666666667, ans=0.0 2024-06-21 16:22:03,712 INFO [train.py:1028] (1/2) Epoch 23, batch 2900, loss[loss=0.1783, simple_loss=0.2336, pruned_loss=0.06152, over 13078.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2449, pruned_loss=0.06771, over 2585166.84 frames. ], batch size: 55, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:22:10,849 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.63 vs. limit=15.0 2024-06-21 16:22:13,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413392.8333333333, ans=0.1 2024-06-21 16:22:20,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413411.1666666667, ans=0.1 2024-06-21 16:22:26,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=413429.5, ans=0.125 2024-06-21 16:22:31,656 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.76 vs. limit=22.5 2024-06-21 16:22:37,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=413447.8333333333, ans=0.0 2024-06-21 16:22:37,455 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:22:39,896 INFO [train.py:1028] (1/2) Epoch 23, batch 2950, loss[loss=0.1748, simple_loss=0.2306, pruned_loss=0.05947, over 13234.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2446, pruned_loss=0.06755, over 2579326.31 frames. ], batch size: 43, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:22:53,648 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=22.5 2024-06-21 16:22:56,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=413502.8333333333, ans=0.025 2024-06-21 16:23:05,979 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.067e+02 2.205e+02 2.441e+02 3.726e+02, threshold=4.411e+02, percent-clipped=0.0 2024-06-21 16:23:08,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=413539.5, ans=0.125 2024-06-21 16:23:13,430 INFO [train.py:1028] (1/2) Epoch 23, batch 3000, loss[loss=0.1837, simple_loss=0.2436, pruned_loss=0.06194, over 13187.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2433, pruned_loss=0.06696, over 2577010.03 frames. ], batch size: 59, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:23:13,431 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 16:23:21,495 INFO [train.py:1060] (1/2) Epoch 23, validation: loss=0.1874, simple_loss=0.2508, pruned_loss=0.06199, over 351949.00 frames. 2024-06-21 16:23:21,496 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 16:23:32,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=413576.1666666667, ans=0.125 2024-06-21 16:23:41,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=413612.8333333333, ans=0.125 2024-06-21 16:23:47,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413612.8333333333, ans=0.1 2024-06-21 16:23:53,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=413631.1666666667, ans=0.125 2024-06-21 16:23:57,420 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.77 vs. limit=22.5 2024-06-21 16:23:58,346 INFO [train.py:1028] (1/2) Epoch 23, batch 3050, loss[loss=0.1748, simple_loss=0.232, pruned_loss=0.05874, over 13263.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2424, pruned_loss=0.06696, over 2576809.44 frames. ], batch size: 46, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:23:58,812 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2024-06-21 16:23:59,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=413649.5, ans=0.125 2024-06-21 16:24:12,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=413686.1666666667, ans=0.125 2024-06-21 16:24:21,102 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.06 vs. limit=15.0 2024-06-21 16:24:28,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=413704.5, ans=0.125 2024-06-21 16:24:29,192 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.067e+02 2.193e+02 2.332e+02 2.841e+02, threshold=4.385e+02, percent-clipped=0.0 2024-06-21 16:24:35,770 INFO [train.py:1028] (1/2) Epoch 23, batch 3100, loss[loss=0.1829, simple_loss=0.2287, pruned_loss=0.06852, over 12997.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2414, pruned_loss=0.0666, over 2578314.86 frames. ], batch size: 144, lr: 2.53e-03, grad_scale: 16.0 2024-06-21 16:24:46,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=413759.5, ans=0.05 2024-06-21 16:24:47,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=413759.5, ans=0.09899494936611666 2024-06-21 16:24:48,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413759.5, ans=0.1 2024-06-21 16:24:49,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=413777.8333333333, ans=0.0 2024-06-21 16:24:54,195 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.89 vs. limit=22.5 2024-06-21 16:24:59,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=413796.1666666667, ans=0.07 2024-06-21 16:25:01,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=413796.1666666667, ans=0.0 2024-06-21 16:25:01,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=413796.1666666667, ans=0.125 2024-06-21 16:25:05,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=413814.5, ans=0.125 2024-06-21 16:25:06,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=413814.5, ans=0.0 2024-06-21 16:25:07,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=413814.5, ans=0.1 2024-06-21 16:25:09,069 INFO [train.py:1028] (1/2) Epoch 23, batch 3150, loss[loss=0.2052, simple_loss=0.2569, pruned_loss=0.07672, over 12968.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2414, pruned_loss=0.0668, over 2580409.44 frames. ], batch size: 158, lr: 2.53e-03, grad_scale: 16.0 2024-06-21 16:25:11,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=413832.8333333333, ans=0.05 2024-06-21 16:25:14,659 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:25:21,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=413869.5, ans=0.1 2024-06-21 16:25:35,661 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.083e+02 2.230e+02 2.420e+02 3.144e+02, threshold=4.460e+02, percent-clipped=0.0 2024-06-21 16:25:42,147 INFO [train.py:1028] (1/2) Epoch 23, batch 3200, loss[loss=0.1739, simple_loss=0.2355, pruned_loss=0.05618, over 13082.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2409, pruned_loss=0.06623, over 2580911.44 frames. ], batch size: 55, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:25:52,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=413942.8333333333, ans=0.025 2024-06-21 16:26:00,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=413961.1666666667, ans=0.025 2024-06-21 16:26:09,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=413979.5, ans=0.0 2024-06-21 16:26:14,298 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.64 vs. limit=6.0 2024-06-21 16:26:17,051 INFO [train.py:1028] (1/2) Epoch 23, batch 3250, loss[loss=0.178, simple_loss=0.239, pruned_loss=0.05853, over 13221.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2406, pruned_loss=0.06644, over 2584332.72 frames. ], batch size: 72, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:26:19,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=414016.1666666667, ans=0.0 2024-06-21 16:26:36,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=414052.8333333333, ans=0.1 2024-06-21 16:26:38,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=414052.8333333333, ans=0.125 2024-06-21 16:26:39,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=414071.1666666667, ans=0.125 2024-06-21 16:26:46,425 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 2.055e+02 2.163e+02 2.294e+02 3.373e+02, threshold=4.326e+02, percent-clipped=0.0 2024-06-21 16:26:47,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=414089.5, ans=0.0 2024-06-21 16:26:47,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=414089.5, ans=0.025 2024-06-21 16:26:53,138 INFO [train.py:1028] (1/2) Epoch 23, batch 3300, loss[loss=0.1897, simple_loss=0.2376, pruned_loss=0.07084, over 12730.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2402, pruned_loss=0.06636, over 2582695.51 frames. ], batch size: 176, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:26:53,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=414107.8333333333, ans=0.125 2024-06-21 16:27:04,105 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.79 vs. limit=22.5 2024-06-21 16:27:15,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=414162.8333333333, ans=0.125 2024-06-21 16:27:16,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=414162.8333333333, ans=0.025 2024-06-21 16:27:20,377 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2024-06-21 16:27:25,138 INFO [train.py:1028] (1/2) Epoch 23, batch 3350, loss[loss=0.1827, simple_loss=0.2308, pruned_loss=0.06731, over 12951.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2397, pruned_loss=0.06642, over 2577315.48 frames. ], batch size: 158, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:27:31,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=414217.8333333333, ans=0.0 2024-06-21 16:27:33,472 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.16 vs. limit=15.0 2024-06-21 16:27:38,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=414217.8333333333, ans=0.04949747468305833 2024-06-21 16:27:40,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=414236.1666666667, ans=0.025 2024-06-21 16:27:44,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=414236.1666666667, ans=0.0 2024-06-21 16:27:45,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=414236.1666666667, ans=0.125 2024-06-21 16:27:47,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=414254.5, ans=0.125 2024-06-21 16:27:53,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=414254.5, ans=0.2 2024-06-21 16:27:54,166 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.155e+02 2.244e+02 2.442e+02 2.825e+02, threshold=4.487e+02, percent-clipped=0.0 2024-06-21 16:27:57,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=414272.8333333333, ans=0.2 2024-06-21 16:28:00,682 INFO [train.py:1028] (1/2) Epoch 23, batch 3400, loss[loss=0.198, simple_loss=0.2453, pruned_loss=0.07538, over 12736.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2393, pruned_loss=0.06667, over 2576551.55 frames. ], batch size: 22, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:28:01,532 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=414291.1666666667, ans=0.125 2024-06-21 16:28:08,157 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.62 vs. limit=15.0 2024-06-21 16:28:19,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=414327.8333333333, ans=0.05 2024-06-21 16:28:19,289 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.51 vs. limit=15.0 2024-06-21 16:28:33,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=414364.5, ans=0.09899494936611666 2024-06-21 16:28:36,548 INFO [train.py:1028] (1/2) Epoch 23, batch 3450, loss[loss=0.1895, simple_loss=0.2398, pruned_loss=0.06961, over 12763.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2382, pruned_loss=0.06612, over 2576627.82 frames. ], batch size: 176, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:28:44,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=414401.1666666667, ans=0.125 2024-06-21 16:28:58,827 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.52 vs. limit=15.0 2024-06-21 16:29:02,272 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 2.105e+02 2.269e+02 2.452e+02 3.193e+02, threshold=4.537e+02, percent-clipped=0.0 2024-06-21 16:29:09,116 INFO [train.py:1028] (1/2) Epoch 23, batch 3500, loss[loss=0.1759, simple_loss=0.2349, pruned_loss=0.05847, over 12848.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.238, pruned_loss=0.06585, over 2575123.41 frames. ], batch size: 33, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:29:25,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=414511.1666666667, ans=0.125 2024-06-21 16:29:31,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=414529.5, ans=0.05 2024-06-21 16:29:32,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=414529.5, ans=0.2 2024-06-21 16:29:40,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=414547.8333333333, ans=0.025 2024-06-21 16:29:44,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=414547.8333333333, ans=0.125 2024-06-21 16:29:45,737 INFO [train.py:1028] (1/2) Epoch 23, batch 3550, loss[loss=0.1778, simple_loss=0.2298, pruned_loss=0.0629, over 13138.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2379, pruned_loss=0.06592, over 2576589.01 frames. ], batch size: 95, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:29:46,111 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.70 vs. limit=15.0 2024-06-21 16:29:52,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=414584.5, ans=0.125 2024-06-21 16:29:58,716 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.95 vs. limit=15.0 2024-06-21 16:29:59,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=414602.8333333333, ans=0.125 2024-06-21 16:30:04,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=414621.1666666667, ans=0.0 2024-06-21 16:30:11,211 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 2.067e+02 2.199e+02 2.413e+02 3.033e+02, threshold=4.399e+02, percent-clipped=0.0 2024-06-21 16:30:15,757 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.96 vs. limit=22.5 2024-06-21 16:30:21,196 INFO [train.py:1028] (1/2) Epoch 23, batch 3600, loss[loss=0.1838, simple_loss=0.2407, pruned_loss=0.06339, over 13276.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2374, pruned_loss=0.06558, over 2580301.76 frames. ], batch size: 49, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:30:25,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=414657.8333333333, ans=0.1 2024-06-21 16:30:26,212 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:30:27,245 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.78 vs. limit=22.5 2024-06-21 16:30:27,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=414676.1666666667, ans=0.125 2024-06-21 16:30:30,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=414676.1666666667, ans=0.025 2024-06-21 16:30:36,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=414694.5, ans=0.125 2024-06-21 16:30:48,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=414731.1666666667, ans=0.125 2024-06-21 16:30:54,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=414731.1666666667, ans=0.0 2024-06-21 16:30:55,882 INFO [train.py:1028] (1/2) Epoch 23, batch 3650, loss[loss=0.1738, simple_loss=0.2222, pruned_loss=0.06272, over 13048.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.2368, pruned_loss=0.06526, over 2579476.40 frames. ], batch size: 102, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:31:13,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=414786.1666666667, ans=0.125 2024-06-21 16:31:16,376 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=414804.5, ans=0.125 2024-06-21 16:31:22,540 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.067e+02 2.165e+02 2.287e+02 3.362e+02, threshold=4.330e+02, percent-clipped=0.0 2024-06-21 16:31:29,538 INFO [train.py:1028] (1/2) Epoch 23, batch 3700, loss[loss=0.1619, simple_loss=0.2235, pruned_loss=0.05015, over 13309.00 frames. ], tot_loss[loss=0.1829, simple_loss=0.236, pruned_loss=0.06497, over 2584753.75 frames. ], batch size: 72, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:31:34,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=414841.1666666667, ans=0.09899494936611666 2024-06-21 16:31:56,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=414896.1666666667, ans=0.0 2024-06-21 16:32:07,803 INFO [train.py:1028] (1/2) Epoch 23, batch 3750, loss[loss=0.1856, simple_loss=0.2429, pruned_loss=0.06414, over 12663.00 frames. ], tot_loss[loss=0.1833, simple_loss=0.2362, pruned_loss=0.06519, over 2586614.35 frames. ], batch size: 22, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:32:12,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=414932.8333333333, ans=0.025 2024-06-21 16:32:16,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=414951.1666666667, ans=0.1 2024-06-21 16:32:32,316 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:32:32,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=414987.8333333333, ans=0.1 2024-06-21 16:32:34,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=414987.8333333333, ans=0.125 2024-06-21 16:32:35,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=414987.8333333333, ans=0.04949747468305833 2024-06-21 16:32:36,928 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.032e+02 2.163e+02 2.344e+02 3.282e+02, threshold=4.326e+02, percent-clipped=0.0 2024-06-21 16:32:43,773 INFO [train.py:1028] (1/2) Epoch 23, batch 3800, loss[loss=0.1815, simple_loss=0.2284, pruned_loss=0.06731, over 13231.00 frames. ], tot_loss[loss=0.1829, simple_loss=0.2359, pruned_loss=0.06491, over 2584001.27 frames. ], batch size: 83, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:32:46,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=415024.5, ans=0.125 2024-06-21 16:32:48,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415024.5, ans=0.1 2024-06-21 16:32:51,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=415042.8333333333, ans=0.0 2024-06-21 16:33:00,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=415061.1666666667, ans=0.0 2024-06-21 16:33:04,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=415079.5, ans=0.2 2024-06-21 16:33:10,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=415097.8333333333, ans=0.035 2024-06-21 16:33:10,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=415097.8333333333, ans=0.5 2024-06-21 16:33:11,195 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.12 vs. limit=22.5 2024-06-21 16:33:14,395 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:33:17,573 INFO [train.py:1028] (1/2) Epoch 23, batch 3850, loss[loss=0.1784, simple_loss=0.227, pruned_loss=0.06496, over 13054.00 frames. ], tot_loss[loss=0.1822, simple_loss=0.2355, pruned_loss=0.06449, over 2582988.39 frames. ], batch size: 144, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:33:20,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=415116.1666666667, ans=0.125 2024-06-21 16:33:26,893 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=415134.5, ans=0.125 2024-06-21 16:33:32,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=415152.8333333333, ans=0.2 2024-06-21 16:33:35,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=415152.8333333333, ans=0.125 2024-06-21 16:33:38,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=415171.1666666667, ans=0.125 2024-06-21 16:33:38,931 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2024-06-21 16:33:42,830 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.064e+02 2.244e+02 2.445e+02 3.288e+02, threshold=4.488e+02, percent-clipped=0.0 2024-06-21 16:33:47,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=415189.5, ans=0.125 2024-06-21 16:33:47,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415189.5, ans=0.1 2024-06-21 16:33:48,982 INFO [train.py:1028] (1/2) Epoch 23, batch 3900, loss[loss=0.1922, simple_loss=0.2377, pruned_loss=0.07336, over 13206.00 frames. ], tot_loss[loss=0.182, simple_loss=0.2354, pruned_loss=0.06436, over 2585861.84 frames. ], batch size: 83, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:33:57,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=415207.8333333333, ans=0.0 2024-06-21 16:33:57,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=415226.1666666667, ans=0.125 2024-06-21 16:34:01,061 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=415226.1666666667, ans=0.95 2024-06-21 16:34:01,249 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2024-06-21 16:34:07,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=415244.5, ans=0.1 2024-06-21 16:34:10,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=415262.8333333333, ans=0.0 2024-06-21 16:34:11,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=415262.8333333333, ans=0.0 2024-06-21 16:34:15,140 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2024-06-21 16:34:16,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=415262.8333333333, ans=0.125 2024-06-21 16:34:16,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=415262.8333333333, ans=0.025 2024-06-21 16:34:21,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=415281.1666666667, ans=0.125 2024-06-21 16:34:23,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415281.1666666667, ans=0.1 2024-06-21 16:34:23,522 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.03 vs. limit=15.0 2024-06-21 16:34:24,480 INFO [train.py:1028] (1/2) Epoch 23, batch 3950, loss[loss=0.1798, simple_loss=0.2226, pruned_loss=0.06856, over 13154.00 frames. ], tot_loss[loss=0.1816, simple_loss=0.2352, pruned_loss=0.06405, over 2589330.33 frames. ], batch size: 132, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:34:27,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=415299.5, ans=0.0 2024-06-21 16:34:43,456 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:34:44,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=415336.1666666667, ans=0.125 2024-06-21 16:34:53,089 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.032e+02 2.128e+02 2.271e+02 3.267e+02, threshold=4.257e+02, percent-clipped=0.0 2024-06-21 16:34:57,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=415372.8333333333, ans=0.0 2024-06-21 16:34:59,677 INFO [train.py:1028] (1/2) Epoch 23, batch 4000, loss[loss=0.181, simple_loss=0.2476, pruned_loss=0.05722, over 12912.00 frames. ], tot_loss[loss=0.1813, simple_loss=0.2349, pruned_loss=0.06388, over 2583608.04 frames. ], batch size: 39, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:35:06,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=415409.5, ans=0.5 2024-06-21 16:35:16,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=415427.8333333333, ans=0.0 2024-06-21 16:35:20,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=415446.1666666667, ans=0.0 2024-06-21 16:35:30,295 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.82 vs. limit=15.0 2024-06-21 16:35:30,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=415464.5, ans=0.125 2024-06-21 16:35:33,799 INFO [train.py:1028] (1/2) Epoch 23, batch 4050, loss[loss=0.1938, simple_loss=0.2331, pruned_loss=0.07725, over 11094.00 frames. ], tot_loss[loss=0.181, simple_loss=0.2344, pruned_loss=0.06377, over 2581321.03 frames. ], batch size: 303, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:35:59,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=415537.8333333333, ans=0.125 2024-06-21 16:36:03,092 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.014e+02 2.190e+02 2.392e+02 3.266e+02, threshold=4.380e+02, percent-clipped=0.0 2024-06-21 16:36:09,693 INFO [train.py:1028] (1/2) Epoch 23, batch 4100, loss[loss=0.1662, simple_loss=0.2175, pruned_loss=0.05743, over 13024.00 frames. ], tot_loss[loss=0.1809, simple_loss=0.2341, pruned_loss=0.06386, over 2577327.78 frames. ], batch size: 102, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:36:19,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=415592.8333333333, ans=0.2 2024-06-21 16:36:19,205 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.55 vs. limit=15.0 2024-06-21 16:36:21,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=415592.8333333333, ans=0.0 2024-06-21 16:36:25,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=415611.1666666667, ans=0.0 2024-06-21 16:36:31,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=415629.5, ans=0.125 2024-06-21 16:36:45,361 INFO [train.py:1028] (1/2) Epoch 23, batch 4150, loss[loss=0.1923, simple_loss=0.2432, pruned_loss=0.07072, over 13134.00 frames. ], tot_loss[loss=0.1811, simple_loss=0.2344, pruned_loss=0.06388, over 2575169.75 frames. ], batch size: 55, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:36:46,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=415666.1666666667, ans=0.2 2024-06-21 16:36:46,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=415666.1666666667, ans=0.125 2024-06-21 16:36:49,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=415666.1666666667, ans=0.125 2024-06-21 16:36:52,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=415684.5, ans=0.125 2024-06-21 16:36:53,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=415684.5, ans=0.0 2024-06-21 16:36:53,654 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.81 vs. limit=15.0 2024-06-21 16:36:59,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=415702.8333333333, ans=0.0 2024-06-21 16:37:03,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=415702.8333333333, ans=0.125 2024-06-21 16:37:05,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=415721.1666666667, ans=0.0 2024-06-21 16:37:07,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=415721.1666666667, ans=0.025 2024-06-21 16:37:08,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=415721.1666666667, ans=0.025 2024-06-21 16:37:09,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=415721.1666666667, ans=0.2 2024-06-21 16:37:12,234 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.836e+02 2.067e+02 2.180e+02 2.371e+02 2.955e+02, threshold=4.361e+02, percent-clipped=0.0 2024-06-21 16:37:17,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=415739.5, ans=0.035 2024-06-21 16:37:18,718 INFO [train.py:1028] (1/2) Epoch 23, batch 4200, loss[loss=0.1834, simple_loss=0.2342, pruned_loss=0.06626, over 13007.00 frames. ], tot_loss[loss=0.1809, simple_loss=0.234, pruned_loss=0.06388, over 2578033.57 frames. ], batch size: 102, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:37:23,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=415757.8333333333, ans=0.2 2024-06-21 16:37:29,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=415776.1666666667, ans=0.2 2024-06-21 16:37:32,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=415794.5, ans=0.5 2024-06-21 16:37:43,406 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.42 vs. limit=6.0 2024-06-21 16:37:55,103 INFO [train.py:1028] (1/2) Epoch 23, batch 4250, loss[loss=0.1772, simple_loss=0.2456, pruned_loss=0.05444, over 13318.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2336, pruned_loss=0.06364, over 2579848.93 frames. ], batch size: 46, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:38:08,539 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:38:14,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=415904.5, ans=0.025 2024-06-21 16:38:21,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=415922.8333333333, ans=0.125 2024-06-21 16:38:21,491 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 2.017e+02 2.105e+02 2.277e+02 3.385e+02, threshold=4.210e+02, percent-clipped=0.0 2024-06-21 16:38:21,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=415922.8333333333, ans=0.025 2024-06-21 16:38:22,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=415922.8333333333, ans=0.2 2024-06-21 16:38:23,844 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.25 vs. limit=15.0 2024-06-21 16:38:28,103 INFO [train.py:1028] (1/2) Epoch 23, batch 4300, loss[loss=0.1785, simple_loss=0.2341, pruned_loss=0.06149, over 13201.00 frames. ], tot_loss[loss=0.1806, simple_loss=0.2335, pruned_loss=0.06382, over 2580276.11 frames. ], batch size: 59, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:38:38,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=415959.5, ans=0.125 2024-06-21 16:38:40,066 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.77 vs. limit=5.0 2024-06-21 16:38:53,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=415996.1666666667, ans=0.04949747468305833 2024-06-21 16:38:57,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=416014.5, ans=0.125 2024-06-21 16:39:03,988 INFO [train.py:1028] (1/2) Epoch 23, batch 4350, loss[loss=0.1844, simple_loss=0.2358, pruned_loss=0.06651, over 13205.00 frames. ], tot_loss[loss=0.18, simple_loss=0.2327, pruned_loss=0.06368, over 2584872.99 frames. ], batch size: 59, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:39:11,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=416051.1666666667, ans=0.1 2024-06-21 16:39:12,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=416051.1666666667, ans=0.125 2024-06-21 16:39:12,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=416051.1666666667, ans=0.05 2024-06-21 16:39:20,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=416069.5, ans=0.125 2024-06-21 16:39:23,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=416087.8333333333, ans=0.0 2024-06-21 16:39:29,871 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.086e+02 2.197e+02 2.398e+02 3.484e+02, threshold=4.393e+02, percent-clipped=0.0 2024-06-21 16:39:36,903 INFO [train.py:1028] (1/2) Epoch 23, batch 4400, loss[loss=0.177, simple_loss=0.2276, pruned_loss=0.06313, over 13248.00 frames. ], tot_loss[loss=0.1801, simple_loss=0.2328, pruned_loss=0.06371, over 2585050.07 frames. ], batch size: 83, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:39:38,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=416124.5, ans=0.025 2024-06-21 16:39:48,412 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.30 vs. limit=15.0 2024-06-21 16:39:51,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=416161.1666666667, ans=0.0 2024-06-21 16:40:03,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=416179.5, ans=0.1 2024-06-21 16:40:06,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416197.8333333333, ans=0.1 2024-06-21 16:40:13,469 INFO [train.py:1028] (1/2) Epoch 23, batch 4450, loss[loss=0.1638, simple_loss=0.2171, pruned_loss=0.0553, over 13037.00 frames. ], tot_loss[loss=0.1799, simple_loss=0.2325, pruned_loss=0.06364, over 2581251.44 frames. ], batch size: 33, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:40:20,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=416234.5, ans=0.0 2024-06-21 16:40:33,110 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.95 vs. limit=22.5 2024-06-21 16:40:42,528 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.158e+02 2.289e+02 2.418e+02 3.102e+02, threshold=4.579e+02, percent-clipped=0.0 2024-06-21 16:40:44,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=416289.5, ans=0.125 2024-06-21 16:40:49,049 INFO [train.py:1028] (1/2) Epoch 23, batch 4500, loss[loss=0.1712, simple_loss=0.2245, pruned_loss=0.05892, over 13226.00 frames. ], tot_loss[loss=0.1794, simple_loss=0.2318, pruned_loss=0.06348, over 2585123.33 frames. ], batch size: 89, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:40:51,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=416307.8333333333, ans=0.0 2024-06-21 16:40:52,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416307.8333333333, ans=0.1 2024-06-21 16:40:53,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=416307.8333333333, ans=0.0 2024-06-21 16:40:59,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=416326.1666666667, ans=0.125 2024-06-21 16:41:07,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=416344.5, ans=0.125 2024-06-21 16:41:10,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=416362.8333333333, ans=0.0 2024-06-21 16:41:12,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=416362.8333333333, ans=0.125 2024-06-21 16:41:19,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=416381.1666666667, ans=0.125 2024-06-21 16:41:22,270 INFO [train.py:1028] (1/2) Epoch 23, batch 4550, loss[loss=0.1815, simple_loss=0.2323, pruned_loss=0.06528, over 13191.00 frames. ], tot_loss[loss=0.1799, simple_loss=0.2324, pruned_loss=0.06371, over 2588669.95 frames. ], batch size: 52, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:41:26,431 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.95 vs. limit=15.0 2024-06-21 16:41:30,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=416417.8333333333, ans=0.2 2024-06-21 16:41:41,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=416454.5, ans=0.0 2024-06-21 16:41:45,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=416454.5, ans=0.1 2024-06-21 16:41:51,701 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.060e+02 2.246e+02 2.493e+02 3.852e+02, threshold=4.493e+02, percent-clipped=0.0 2024-06-21 16:41:52,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=416472.8333333333, ans=0.0 2024-06-21 16:41:58,189 INFO [train.py:1028] (1/2) Epoch 23, batch 4600, loss[loss=0.2052, simple_loss=0.2531, pruned_loss=0.07864, over 12530.00 frames. ], tot_loss[loss=0.1797, simple_loss=0.2324, pruned_loss=0.06348, over 2585169.31 frames. ], batch size: 202, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:42:00,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=416491.1666666667, ans=0.1 2024-06-21 16:42:06,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=416509.5, ans=0.0 2024-06-21 16:42:24,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416564.5, ans=0.1 2024-06-21 16:42:25,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=416564.5, ans=0.125 2024-06-21 16:42:26,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=416564.5, ans=0.125 2024-06-21 16:42:31,096 INFO [train.py:1028] (1/2) Epoch 23, batch 4650, loss[loss=0.174, simple_loss=0.2184, pruned_loss=0.06483, over 13104.00 frames. ], tot_loss[loss=0.1793, simple_loss=0.2318, pruned_loss=0.06336, over 2587538.69 frames. ], batch size: 132, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:42:35,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=416582.8333333333, ans=0.125 2024-06-21 16:42:42,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=416601.1666666667, ans=0.2 2024-06-21 16:42:46,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416601.1666666667, ans=0.1 2024-06-21 16:43:00,082 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.045e+02 2.200e+02 2.369e+02 2.947e+02, threshold=4.400e+02, percent-clipped=0.0 2024-06-21 16:43:00,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=416656.1666666667, ans=0.0 2024-06-21 16:43:04,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=416656.1666666667, ans=0.025 2024-06-21 16:43:06,818 INFO [train.py:1028] (1/2) Epoch 23, batch 4700, loss[loss=0.1601, simple_loss=0.2165, pruned_loss=0.05187, over 12992.00 frames. ], tot_loss[loss=0.1797, simple_loss=0.2321, pruned_loss=0.06369, over 2582563.21 frames. ], batch size: 26, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:43:06,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=416674.5, ans=0.02 2024-06-21 16:43:25,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=416729.5, ans=0.0 2024-06-21 16:43:39,871 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.49 vs. limit=15.0 2024-06-21 16:43:39,982 INFO [train.py:1028] (1/2) Epoch 23, batch 4750, loss[loss=0.2072, simple_loss=0.2503, pruned_loss=0.08211, over 12516.00 frames. ], tot_loss[loss=0.1799, simple_loss=0.2323, pruned_loss=0.06381, over 2580729.32 frames. ], batch size: 202, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:43:53,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=416784.5, ans=0.2 2024-06-21 16:43:53,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=416784.5, ans=0.125 2024-06-21 16:43:54,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=416784.5, ans=0.0 2024-06-21 16:43:57,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=416802.8333333333, ans=0.125 2024-06-21 16:44:05,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=416821.1666666667, ans=0.025 2024-06-21 16:44:09,689 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.052e+02 2.194e+02 2.357e+02 3.097e+02, threshold=4.388e+02, percent-clipped=0.0 2024-06-21 16:44:16,134 INFO [train.py:1028] (1/2) Epoch 23, batch 4800, loss[loss=0.1831, simple_loss=0.2387, pruned_loss=0.06378, over 13235.00 frames. ], tot_loss[loss=0.1797, simple_loss=0.2321, pruned_loss=0.06365, over 2578015.86 frames. ], batch size: 63, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:44:23,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=416876.1666666667, ans=0.1 2024-06-21 16:44:25,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=416876.1666666667, ans=0.2 2024-06-21 16:44:26,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=416876.1666666667, ans=0.125 2024-06-21 16:44:28,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416876.1666666667, ans=0.1 2024-06-21 16:44:30,580 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.76 vs. limit=15.0 2024-06-21 16:44:35,448 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.44 vs. limit=12.0 2024-06-21 16:44:36,683 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.66 vs. limit=10.0 2024-06-21 16:44:45,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=416912.8333333333, ans=0.1 2024-06-21 16:44:46,935 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:44:52,908 INFO [train.py:1028] (1/2) Epoch 23, batch 4850, loss[loss=0.1698, simple_loss=0.2197, pruned_loss=0.05995, over 13259.00 frames. ], tot_loss[loss=0.1787, simple_loss=0.2312, pruned_loss=0.06304, over 2576451.03 frames. ], batch size: 89, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:45:14,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=417004.5, ans=0.0 2024-06-21 16:45:20,064 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.009e+02 2.120e+02 2.254e+02 3.202e+02, threshold=4.240e+02, percent-clipped=0.0 2024-06-21 16:45:24,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=417022.8333333333, ans=0.0 2024-06-21 16:45:26,690 INFO [train.py:1028] (1/2) Epoch 23, batch 4900, loss[loss=0.1781, simple_loss=0.2355, pruned_loss=0.06037, over 13177.00 frames. ], tot_loss[loss=0.1788, simple_loss=0.2312, pruned_loss=0.06321, over 2576020.53 frames. ], batch size: 59, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:45:35,714 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2024-06-21 16:45:38,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=417059.5, ans=0.04949747468305833 2024-06-21 16:45:39,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=417077.8333333333, ans=0.025 2024-06-21 16:46:04,969 INFO [train.py:1028] (1/2) Epoch 23, batch 4950, loss[loss=0.1976, simple_loss=0.2365, pruned_loss=0.07932, over 10992.00 frames. ], tot_loss[loss=0.1791, simple_loss=0.2312, pruned_loss=0.06344, over 2569580.14 frames. ], batch size: 304, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:46:14,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=417151.1666666667, ans=0.0 2024-06-21 16:46:14,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=417151.1666666667, ans=0.125 2024-06-21 16:46:20,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=417169.5, ans=0.025 2024-06-21 16:46:21,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=417169.5, ans=0.125 2024-06-21 16:46:22,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=417169.5, ans=0.04949747468305833 2024-06-21 16:46:30,891 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.018e+02 2.135e+02 2.283e+02 2.656e+02, threshold=4.270e+02, percent-clipped=0.0 2024-06-21 16:46:31,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=417206.1666666667, ans=0.125 2024-06-21 16:46:37,225 INFO [train.py:1028] (1/2) Epoch 23, batch 5000, loss[loss=0.1948, simple_loss=0.2395, pruned_loss=0.07505, over 13189.00 frames. ], tot_loss[loss=0.1787, simple_loss=0.2311, pruned_loss=0.06316, over 2574006.39 frames. ], batch size: 95, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:46:37,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=417224.5, ans=0.0 2024-06-21 16:47:14,660 INFO [train.py:1028] (1/2) Epoch 23, batch 5050, loss[loss=0.1885, simple_loss=0.2421, pruned_loss=0.06741, over 12991.00 frames. ], tot_loss[loss=0.1789, simple_loss=0.2314, pruned_loss=0.06319, over 2573809.03 frames. ], batch size: 36, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:47:19,899 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.15 vs. limit=15.0 2024-06-21 16:47:22,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=417334.5, ans=0.95 2024-06-21 16:47:34,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=417371.1666666667, ans=0.5 2024-06-21 16:47:35,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=417371.1666666667, ans=0.0 2024-06-21 16:47:38,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=417371.1666666667, ans=0.125 2024-06-21 16:47:41,328 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.113e+02 2.228e+02 2.433e+02 3.192e+02, threshold=4.457e+02, percent-clipped=0.0 2024-06-21 16:47:42,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=417389.5, ans=0.2 2024-06-21 16:47:50,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=417389.5, ans=0.0 2024-06-21 16:47:51,529 INFO [train.py:1028] (1/2) Epoch 23, batch 5100, loss[loss=0.179, simple_loss=0.2318, pruned_loss=0.06307, over 12896.00 frames. ], tot_loss[loss=0.1792, simple_loss=0.2314, pruned_loss=0.06345, over 2569606.42 frames. ], batch size: 39, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:47:51,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=417407.8333333333, ans=0.1 2024-06-21 16:47:53,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=417407.8333333333, ans=0.025 2024-06-21 16:48:12,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=417462.8333333333, ans=0.2 2024-06-21 16:48:24,812 INFO [train.py:1028] (1/2) Epoch 23, batch 5150, loss[loss=0.1694, simple_loss=0.2181, pruned_loss=0.0603, over 13066.00 frames. ], tot_loss[loss=0.1793, simple_loss=0.2314, pruned_loss=0.06366, over 2570699.37 frames. ], batch size: 132, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:48:28,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=417499.5, ans=0.2 2024-06-21 16:48:39,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=417517.8333333333, ans=0.125 2024-06-21 16:48:42,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=417536.1666666667, ans=0.0 2024-06-21 16:48:42,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=417536.1666666667, ans=0.5 2024-06-21 16:48:44,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=417536.1666666667, ans=0.1 2024-06-21 16:48:48,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=417554.5, ans=0.0 2024-06-21 16:48:49,296 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.70 vs. limit=22.5 2024-06-21 16:48:54,815 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.029e+02 2.159e+02 2.371e+02 3.039e+02, threshold=4.317e+02, percent-clipped=0.0 2024-06-21 16:48:56,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=417572.8333333333, ans=15.0 2024-06-21 16:48:58,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=417572.8333333333, ans=0.1 2024-06-21 16:48:59,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=417572.8333333333, ans=0.125 2024-06-21 16:49:01,410 INFO [train.py:1028] (1/2) Epoch 23, batch 5200, loss[loss=0.1808, simple_loss=0.2267, pruned_loss=0.06745, over 13159.00 frames. ], tot_loss[loss=0.1789, simple_loss=0.2309, pruned_loss=0.06344, over 2574810.22 frames. ], batch size: 95, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:49:09,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=417609.5, ans=0.125 2024-06-21 16:49:26,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.94 vs. limit=10.0 2024-06-21 16:49:29,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=417664.5, ans=0.125 2024-06-21 16:49:34,781 INFO [train.py:1028] (1/2) Epoch 23, batch 5250, loss[loss=0.1763, simple_loss=0.2366, pruned_loss=0.058, over 13256.00 frames. ], tot_loss[loss=0.1787, simple_loss=0.2309, pruned_loss=0.06322, over 2571686.90 frames. ], batch size: 52, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:49:36,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=417682.8333333333, ans=0.1 2024-06-21 16:50:04,341 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.065e+02 2.283e+02 2.534e+02 3.251e+02, threshold=4.566e+02, percent-clipped=0.0 2024-06-21 16:50:09,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=417756.1666666667, ans=0.0 2024-06-21 16:50:11,024 INFO [train.py:1028] (1/2) Epoch 23, batch 5300, loss[loss=0.1584, simple_loss=0.2119, pruned_loss=0.05242, over 13038.00 frames. ], tot_loss[loss=0.1788, simple_loss=0.231, pruned_loss=0.06327, over 2567099.64 frames. ], batch size: 144, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:50:12,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=417774.5, ans=0.07 2024-06-21 16:50:18,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=417792.8333333333, ans=0.1 2024-06-21 16:50:24,578 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:50:38,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=417829.5, ans=0.04949747468305833 2024-06-21 16:50:39,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=417829.5, ans=0.025 2024-06-21 16:50:43,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=417847.8333333333, ans=0.0 2024-06-21 16:50:48,461 INFO [train.py:1028] (1/2) Epoch 23, batch 5350, loss[loss=0.1657, simple_loss=0.236, pruned_loss=0.04772, over 11858.00 frames. ], tot_loss[loss=0.1783, simple_loss=0.2304, pruned_loss=0.06307, over 2573594.90 frames. ], batch size: 17, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:50:51,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=417866.1666666667, ans=0.2 2024-06-21 16:50:51,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=417866.1666666667, ans=0.0 2024-06-21 16:51:00,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=417902.8333333333, ans=0.0 2024-06-21 16:51:14,401 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.062e+02 2.185e+02 2.313e+02 2.927e+02, threshold=4.369e+02, percent-clipped=0.0 2024-06-21 16:51:15,435 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2024-06-21 16:51:20,897 INFO [train.py:1028] (1/2) Epoch 23, batch 5400, loss[loss=0.2036, simple_loss=0.2445, pruned_loss=0.08131, over 12267.00 frames. ], tot_loss[loss=0.1787, simple_loss=0.2305, pruned_loss=0.06343, over 2566530.09 frames. ], batch size: 240, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:51:22,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=417957.8333333333, ans=0.2 2024-06-21 16:51:27,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=417976.1666666667, ans=0.0 2024-06-21 16:51:29,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=417976.1666666667, ans=0.125 2024-06-21 16:51:33,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=417994.5, ans=0.95 2024-06-21 16:51:51,381 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.99 vs. limit=22.5 2024-06-21 16:51:52,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.whiten.whitening_limit, batch_count=418031.1666666667, ans=12.0 2024-06-21 16:51:58,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=418049.5, ans=15.0 2024-06-21 16:51:59,114 INFO [train.py:1028] (1/2) Epoch 23, batch 5450, loss[loss=0.171, simple_loss=0.2244, pruned_loss=0.05882, over 12820.00 frames. ], tot_loss[loss=0.1789, simple_loss=0.2307, pruned_loss=0.0635, over 2570685.26 frames. ], batch size: 26, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:52:04,258 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.95 vs. limit=15.0 2024-06-21 16:52:04,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=418049.5, ans=0.0 2024-06-21 16:52:08,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=418067.8333333333, ans=0.125 2024-06-21 16:52:20,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=418104.5, ans=0.125 2024-06-21 16:52:21,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=418104.5, ans=0.0 2024-06-21 16:52:27,647 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:52:28,794 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.036e+02 2.159e+02 2.324e+02 2.858e+02, threshold=4.318e+02, percent-clipped=0.0 2024-06-21 16:52:33,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=418122.8333333333, ans=0.0 2024-06-21 16:52:35,493 INFO [train.py:1028] (1/2) Epoch 23, batch 5500, loss[loss=0.1854, simple_loss=0.2275, pruned_loss=0.07165, over 12127.00 frames. ], tot_loss[loss=0.1787, simple_loss=0.2305, pruned_loss=0.06339, over 2562899.60 frames. ], batch size: 241, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:52:43,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=418159.5, ans=15.0 2024-06-21 16:53:01,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=418214.5, ans=0.1 2024-06-21 16:53:08,056 INFO [train.py:1028] (1/2) Epoch 23, batch 5550, loss[loss=0.1785, simple_loss=0.2361, pruned_loss=0.06045, over 13253.00 frames. ], tot_loss[loss=0.1781, simple_loss=0.2302, pruned_loss=0.063, over 2567911.82 frames. ], batch size: 43, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:53:14,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=418251.1666666667, ans=0.125 2024-06-21 16:53:15,847 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2024-06-21 16:53:28,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=418287.8333333333, ans=0.125 2024-06-21 16:53:34,093 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 2.032e+02 2.154e+02 2.273e+02 2.743e+02, threshold=4.309e+02, percent-clipped=0.0 2024-06-21 16:53:43,684 INFO [train.py:1028] (1/2) Epoch 23, batch 5600, loss[loss=0.1813, simple_loss=0.2276, pruned_loss=0.06754, over 13251.00 frames. ], tot_loss[loss=0.1779, simple_loss=0.2299, pruned_loss=0.06291, over 2569660.71 frames. ], batch size: 89, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:53:44,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=418324.5, ans=0.0 2024-06-21 16:53:45,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=418324.5, ans=0.0 2024-06-21 16:53:53,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=418342.8333333333, ans=0.125 2024-06-21 16:53:53,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=418342.8333333333, ans=10.0 2024-06-21 16:53:53,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=418342.8333333333, ans=0.125 2024-06-21 16:53:58,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=418361.1666666667, ans=0.0 2024-06-21 16:53:58,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=418361.1666666667, ans=0.1 2024-06-21 16:54:00,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=418361.1666666667, ans=0.125 2024-06-21 16:54:06,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=418379.5, ans=0.0 2024-06-21 16:54:06,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=418379.5, ans=0.125 2024-06-21 16:54:09,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=418397.8333333333, ans=0.125 2024-06-21 16:54:11,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=418397.8333333333, ans=0.2 2024-06-21 16:54:16,178 INFO [train.py:1028] (1/2) Epoch 23, batch 5650, loss[loss=0.1906, simple_loss=0.2341, pruned_loss=0.0736, over 12505.00 frames. ], tot_loss[loss=0.1778, simple_loss=0.23, pruned_loss=0.06284, over 2574165.03 frames. ], batch size: 202, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:54:19,932 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.21 vs. limit=15.0 2024-06-21 16:54:20,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=418416.1666666667, ans=0.2 2024-06-21 16:54:21,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=418416.1666666667, ans=0.125 2024-06-21 16:54:23,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=418434.5, ans=0.125 2024-06-21 16:54:45,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=418471.1666666667, ans=0.2 2024-06-21 16:54:46,585 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.049e+02 2.187e+02 2.377e+02 3.305e+02, threshold=4.373e+02, percent-clipped=0.0 2024-06-21 16:54:51,515 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=8.936e+00 2024-06-21 16:54:53,295 INFO [train.py:1028] (1/2) Epoch 23, batch 5700, loss[loss=0.1743, simple_loss=0.2313, pruned_loss=0.05866, over 13252.00 frames. ], tot_loss[loss=0.1783, simple_loss=0.2304, pruned_loss=0.06313, over 2577176.70 frames. ], batch size: 63, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:54:57,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=418507.8333333333, ans=0.0 2024-06-21 16:54:59,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=418526.1666666667, ans=0.0 2024-06-21 16:55:17,186 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2024-06-21 16:55:17,864 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.28 vs. limit=15.0 2024-06-21 16:55:23,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=418581.1666666667, ans=0.2 2024-06-21 16:55:26,554 INFO [train.py:1028] (1/2) Epoch 23, batch 5750, loss[loss=0.2083, simple_loss=0.2571, pruned_loss=0.07974, over 12755.00 frames. ], tot_loss[loss=0.1795, simple_loss=0.2316, pruned_loss=0.06373, over 2577629.97 frames. ], batch size: 176, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:55:26,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=418599.5, ans=0.0 2024-06-21 16:55:28,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=418599.5, ans=0.2 2024-06-21 16:55:34,368 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.52 vs. limit=15.0 2024-06-21 16:55:47,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=418636.1666666667, ans=0.0 2024-06-21 16:55:51,985 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.00 vs. limit=22.5 2024-06-21 16:55:56,674 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.052e+02 2.216e+02 2.399e+02 3.058e+02, threshold=4.431e+02, percent-clipped=0.0 2024-06-21 16:56:01,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=418672.8333333333, ans=0.2 2024-06-21 16:56:02,973 INFO [train.py:1028] (1/2) Epoch 23, batch 5800, loss[loss=0.1863, simple_loss=0.2369, pruned_loss=0.06784, over 12759.00 frames. ], tot_loss[loss=0.1807, simple_loss=0.2327, pruned_loss=0.06437, over 2576504.64 frames. ], batch size: 176, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:56:03,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=418691.1666666667, ans=0.07 2024-06-21 16:56:07,983 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.47 vs. limit=22.5 2024-06-21 16:56:14,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=418709.5, ans=0.125 2024-06-21 16:56:19,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=418727.8333333333, ans=0.5 2024-06-21 16:56:26,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=418746.1666666667, ans=0.1 2024-06-21 16:56:27,385 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.14 vs. limit=15.0 2024-06-21 16:56:39,888 INFO [train.py:1028] (1/2) Epoch 23, batch 5850, loss[loss=0.2005, simple_loss=0.2492, pruned_loss=0.07588, over 12598.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2348, pruned_loss=0.06503, over 2575177.83 frames. ], batch size: 202, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:56:54,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=418819.5, ans=0.125 2024-06-21 16:57:07,048 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.171e+02 2.301e+02 2.500e+02 3.619e+02, threshold=4.603e+02, percent-clipped=0.0 2024-06-21 16:57:09,988 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.41 vs. limit=15.0 2024-06-21 16:57:12,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=418856.1666666667, ans=0.1 2024-06-21 16:57:13,695 INFO [train.py:1028] (1/2) Epoch 23, batch 5900, loss[loss=0.1753, simple_loss=0.2219, pruned_loss=0.06431, over 13062.00 frames. ], tot_loss[loss=0.1833, simple_loss=0.2359, pruned_loss=0.06538, over 2575957.19 frames. ], batch size: 121, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:57:15,511 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.01 vs. limit=15.0 2024-06-21 16:57:34,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=418929.5, ans=0.0 2024-06-21 16:57:44,425 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.50 vs. limit=15.0 2024-06-21 16:57:46,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=418947.8333333333, ans=0.0 2024-06-21 16:57:48,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=418947.8333333333, ans=0.125 2024-06-21 16:57:52,429 INFO [train.py:1028] (1/2) Epoch 23, batch 5950, loss[loss=0.1578, simple_loss=0.2077, pruned_loss=0.05396, over 13099.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.2364, pruned_loss=0.06551, over 2579901.71 frames. ], batch size: 121, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:58:00,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=418984.5, ans=0.025 2024-06-21 16:58:02,548 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.69 vs. limit=10.0 2024-06-21 16:58:13,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=419021.1666666667, ans=0.125 2024-06-21 16:58:19,029 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.113e+02 2.280e+02 2.544e+02 4.069e+02, threshold=4.560e+02, percent-clipped=0.0 2024-06-21 16:58:21,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=419039.5, ans=0.125 2024-06-21 16:58:23,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=419039.5, ans=0.125 2024-06-21 16:58:25,492 INFO [train.py:1028] (1/2) Epoch 23, batch 6000, loss[loss=0.2151, simple_loss=0.2672, pruned_loss=0.08146, over 12255.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2372, pruned_loss=0.06578, over 2573878.93 frames. ], batch size: 240, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:58:25,493 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 16:58:31,185 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.6372, 3.1272, 3.0707, 2.4223, 2.8738, 2.9448, 3.0538, 2.8591], device='cuda:1') 2024-06-21 16:58:38,707 INFO [train.py:1060] (1/2) Epoch 23, validation: loss=0.1878, simple_loss=0.2508, pruned_loss=0.06241, over 351949.00 frames. 2024-06-21 16:58:38,707 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 16:59:12,948 INFO [train.py:1028] (1/2) Epoch 23, batch 6050, loss[loss=0.208, simple_loss=0.2618, pruned_loss=0.07713, over 12934.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2394, pruned_loss=0.06657, over 2576585.52 frames. ], batch size: 39, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:59:29,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=419186.1666666667, ans=0.025 2024-06-21 16:59:36,112 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2024-06-21 16:59:39,472 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 2.190e+02 2.300e+02 2.576e+02 3.315e+02, threshold=4.601e+02, percent-clipped=0.0 2024-06-21 16:59:46,438 INFO [train.py:1028] (1/2) Epoch 23, batch 6100, loss[loss=0.1713, simple_loss=0.217, pruned_loss=0.06278, over 13180.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2401, pruned_loss=0.06665, over 2580265.66 frames. ], batch size: 121, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:59:54,147 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.47 vs. limit=22.5 2024-06-21 16:59:57,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=419259.5, ans=0.1 2024-06-21 17:00:03,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=419277.8333333333, ans=0.125 2024-06-21 17:00:06,643 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:00:11,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=419296.1666666667, ans=0.125 2024-06-21 17:00:12,543 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.14 vs. limit=15.0 2024-06-21 17:00:19,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=419314.5, ans=0.125 2024-06-21 17:00:20,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=419314.5, ans=0.125 2024-06-21 17:00:25,313 INFO [train.py:1028] (1/2) Epoch 23, batch 6150, loss[loss=0.203, simple_loss=0.2386, pruned_loss=0.08367, over 11174.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2417, pruned_loss=0.06722, over 2578709.33 frames. ], batch size: 303, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:00:25,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=419332.8333333333, ans=0.1 2024-06-21 17:00:31,863 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.48 vs. limit=15.0 2024-06-21 17:00:39,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=419369.5, ans=0.0 2024-06-21 17:00:49,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=419369.5, ans=0.125 2024-06-21 17:00:52,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=419387.8333333333, ans=0.025 2024-06-21 17:00:56,492 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.47 vs. limit=15.0 2024-06-21 17:00:57,463 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.168e+02 2.448e+02 2.867e+02 4.396e+02, threshold=4.896e+02, percent-clipped=0.0 2024-06-21 17:01:04,693 INFO [train.py:1028] (1/2) Epoch 23, batch 6200, loss[loss=0.2266, simple_loss=0.2756, pruned_loss=0.08882, over 13250.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2434, pruned_loss=0.06803, over 2576462.98 frames. ], batch size: 89, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:01:08,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=419424.5, ans=0.09899494936611666 2024-06-21 17:01:12,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=419442.8333333333, ans=0.125 2024-06-21 17:01:15,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=419442.8333333333, ans=0.025 2024-06-21 17:01:26,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=419479.5, ans=0.0 2024-06-21 17:01:28,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=419479.5, ans=0.125 2024-06-21 17:01:39,300 INFO [train.py:1028] (1/2) Epoch 23, batch 6250, loss[loss=0.1908, simple_loss=0.2492, pruned_loss=0.06619, over 13238.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2441, pruned_loss=0.06824, over 2568970.81 frames. ], batch size: 83, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:01:41,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=419516.1666666667, ans=0.125 2024-06-21 17:01:42,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=419516.1666666667, ans=0.125 2024-06-21 17:01:56,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=419552.8333333333, ans=0.1 2024-06-21 17:01:58,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=419552.8333333333, ans=10.0 2024-06-21 17:01:58,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=419552.8333333333, ans=0.125 2024-06-21 17:01:59,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=419552.8333333333, ans=0.0 2024-06-21 17:01:59,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=419552.8333333333, ans=0.09899494936611666 2024-06-21 17:02:00,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=419552.8333333333, ans=0.2 2024-06-21 17:02:08,634 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.127e+02 2.363e+02 2.536e+02 3.890e+02, threshold=4.725e+02, percent-clipped=0.0 2024-06-21 17:02:12,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=419589.5, ans=0.125 2024-06-21 17:02:14,933 INFO [train.py:1028] (1/2) Epoch 23, batch 6300, loss[loss=0.177, simple_loss=0.2299, pruned_loss=0.06211, over 11175.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2453, pruned_loss=0.06842, over 2563568.89 frames. ], batch size: 16, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:02:22,754 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.00 vs. limit=12.0 2024-06-21 17:02:29,454 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2024-06-21 17:02:48,342 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.69 vs. limit=15.0 2024-06-21 17:02:49,582 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.29 vs. limit=15.0 2024-06-21 17:02:51,128 INFO [train.py:1028] (1/2) Epoch 23, batch 6350, loss[loss=0.2149, simple_loss=0.26, pruned_loss=0.08489, over 12501.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.247, pruned_loss=0.06893, over 2573477.32 frames. ], batch size: 202, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:02:54,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=419699.5, ans=0.125 2024-06-21 17:03:02,442 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.75 vs. limit=10.0 2024-06-21 17:03:10,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=419754.5, ans=0.2 2024-06-21 17:03:17,239 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.244e+02 2.400e+02 2.607e+02 3.431e+02, threshold=4.801e+02, percent-clipped=0.0 2024-06-21 17:03:23,888 INFO [train.py:1028] (1/2) Epoch 23, batch 6400, loss[loss=0.1849, simple_loss=0.248, pruned_loss=0.06092, over 13205.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2494, pruned_loss=0.06992, over 2574513.07 frames. ], batch size: 67, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:03:29,112 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.23 vs. limit=15.0 2024-06-21 17:03:33,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=419809.5, ans=0.125 2024-06-21 17:03:35,054 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:03:40,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=419827.8333333333, ans=0.125 2024-06-21 17:03:44,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=419846.1666666667, ans=0.2 2024-06-21 17:03:54,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=419864.5, ans=0.1 2024-06-21 17:03:57,124 INFO [train.py:1028] (1/2) Epoch 23, batch 6450, loss[loss=0.2304, simple_loss=0.2749, pruned_loss=0.09299, over 12574.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2512, pruned_loss=0.07062, over 2580304.11 frames. ], batch size: 202, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:04:22,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=419937.8333333333, ans=0.125 2024-06-21 17:04:28,399 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.221e+02 2.366e+02 2.529e+02 3.614e+02, threshold=4.733e+02, percent-clipped=0.0 2024-06-21 17:04:34,834 INFO [train.py:1028] (1/2) Epoch 23, batch 6500, loss[loss=0.2153, simple_loss=0.2609, pruned_loss=0.08487, over 10899.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2527, pruned_loss=0.07092, over 2584358.80 frames. ], batch size: 303, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:05:05,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=420047.8333333333, ans=0.2 2024-06-21 17:05:11,452 INFO [train.py:1028] (1/2) Epoch 23, batch 6550, loss[loss=0.2003, simple_loss=0.2616, pruned_loss=0.06946, over 12530.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2536, pruned_loss=0.07123, over 2589304.35 frames. ], batch size: 22, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:05:18,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=420084.5, ans=0.2 2024-06-21 17:05:18,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=420084.5, ans=0.125 2024-06-21 17:05:24,449 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:05:26,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=420102.8333333333, ans=0.125 2024-06-21 17:05:33,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=420121.1666666667, ans=0.1 2024-06-21 17:05:38,180 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.203e+02 2.341e+02 2.506e+02 3.544e+02, threshold=4.683e+02, percent-clipped=0.0 2024-06-21 17:05:43,732 INFO [train.py:1028] (1/2) Epoch 23, batch 6600, loss[loss=0.1772, simple_loss=0.2432, pruned_loss=0.05566, over 13251.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2536, pruned_loss=0.07107, over 2592609.37 frames. ], batch size: 72, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:05:51,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=420176.1666666667, ans=0.2 2024-06-21 17:05:53,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=420176.1666666667, ans=0.1 2024-06-21 17:06:03,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=420212.8333333333, ans=0.125 2024-06-21 17:06:07,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=420212.8333333333, ans=0.125 2024-06-21 17:06:11,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=420212.8333333333, ans=0.025 2024-06-21 17:06:16,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=420231.1666666667, ans=0.2 2024-06-21 17:06:18,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=420231.1666666667, ans=0.0 2024-06-21 17:06:20,112 INFO [train.py:1028] (1/2) Epoch 23, batch 6650, loss[loss=0.2003, simple_loss=0.2569, pruned_loss=0.07189, over 12913.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2553, pruned_loss=0.07165, over 2587062.67 frames. ], batch size: 158, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:06:27,796 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:06:38,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=420286.1666666667, ans=0.0 2024-06-21 17:06:43,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=420304.5, ans=0.125 2024-06-21 17:06:47,493 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.193e+02 2.403e+02 2.706e+02 4.064e+02, threshold=4.806e+02, percent-clipped=0.0 2024-06-21 17:06:48,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=420322.8333333333, ans=0.1 2024-06-21 17:06:53,403 INFO [train.py:1028] (1/2) Epoch 23, batch 6700, loss[loss=0.2276, simple_loss=0.2789, pruned_loss=0.08811, over 12758.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2565, pruned_loss=0.07191, over 2586236.86 frames. ], batch size: 176, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:07:03,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=420359.5, ans=0.125 2024-06-21 17:07:03,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=420359.5, ans=0.125 2024-06-21 17:07:23,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=420414.5, ans=0.1 2024-06-21 17:07:28,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=420414.5, ans=0.125 2024-06-21 17:07:30,149 INFO [train.py:1028] (1/2) Epoch 23, batch 6750, loss[loss=0.2448, simple_loss=0.2869, pruned_loss=0.1013, over 12247.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2572, pruned_loss=0.07241, over 2579381.51 frames. ], batch size: 241, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:07:40,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=420451.1666666667, ans=0.1 2024-06-21 17:07:42,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=420469.5, ans=0.125 2024-06-21 17:07:46,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=420469.5, ans=0.125 2024-06-21 17:07:49,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=420487.8333333333, ans=0.1 2024-06-21 17:07:56,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=420506.1666666667, ans=0.0 2024-06-21 17:07:57,235 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.308e+02 2.490e+02 2.770e+02 3.712e+02, threshold=4.981e+02, percent-clipped=0.0 2024-06-21 17:08:03,108 INFO [train.py:1028] (1/2) Epoch 23, batch 6800, loss[loss=0.1762, simple_loss=0.2376, pruned_loss=0.05745, over 13206.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2579, pruned_loss=0.07254, over 2581003.01 frames. ], batch size: 67, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:08:09,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=420524.5, ans=0.125 2024-06-21 17:08:13,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=420542.8333333333, ans=0.2 2024-06-21 17:08:14,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=420542.8333333333, ans=0.0 2024-06-21 17:08:19,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=420561.1666666667, ans=0.0 2024-06-21 17:08:20,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=420561.1666666667, ans=0.0 2024-06-21 17:08:34,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=420597.8333333333, ans=0.0 2024-06-21 17:08:35,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=420597.8333333333, ans=0.125 2024-06-21 17:08:38,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=420616.1666666667, ans=0.1 2024-06-21 17:08:38,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=420616.1666666667, ans=0.125 2024-06-21 17:08:39,262 INFO [train.py:1028] (1/2) Epoch 23, batch 6850, loss[loss=0.2068, simple_loss=0.2719, pruned_loss=0.07083, over 13209.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2585, pruned_loss=0.07239, over 2584862.60 frames. ], batch size: 63, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:08:42,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=420616.1666666667, ans=0.1 2024-06-21 17:08:43,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=420616.1666666667, ans=0.0 2024-06-21 17:08:46,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=420634.5, ans=0.2 2024-06-21 17:08:48,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=420634.5, ans=0.0 2024-06-21 17:08:50,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=420634.5, ans=0.125 2024-06-21 17:08:54,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=420652.8333333333, ans=0.025 2024-06-21 17:08:54,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=420652.8333333333, ans=0.0 2024-06-21 17:09:05,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=420671.1666666667, ans=0.0 2024-06-21 17:09:08,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=420689.5, ans=0.5 2024-06-21 17:09:09,821 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.215e+02 2.460e+02 2.801e+02 3.938e+02, threshold=4.919e+02, percent-clipped=0.0 2024-06-21 17:09:14,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=420689.5, ans=0.02 2024-06-21 17:09:15,962 INFO [train.py:1028] (1/2) Epoch 23, batch 6900, loss[loss=0.1814, simple_loss=0.2459, pruned_loss=0.05842, over 13028.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2595, pruned_loss=0.07264, over 2585809.12 frames. ], batch size: 48, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:09:28,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=420744.5, ans=0.025 2024-06-21 17:09:32,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=420744.5, ans=0.2 2024-06-21 17:09:44,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=420781.1666666667, ans=0.125 2024-06-21 17:09:49,124 INFO [train.py:1028] (1/2) Epoch 23, batch 6950, loss[loss=0.1798, simple_loss=0.2364, pruned_loss=0.06163, over 10988.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2592, pruned_loss=0.07217, over 2579470.88 frames. ], batch size: 16, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:09:49,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=420799.5, ans=0.125 2024-06-21 17:10:00,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=420817.8333333333, ans=0.0 2024-06-21 17:10:19,609 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.171e+02 2.391e+02 2.564e+02 3.307e+02, threshold=4.782e+02, percent-clipped=0.0 2024-06-21 17:10:21,954 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.47 vs. limit=22.5 2024-06-21 17:10:25,307 INFO [train.py:1028] (1/2) Epoch 23, batch 7000, loss[loss=0.211, simple_loss=0.2654, pruned_loss=0.07834, over 12921.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2597, pruned_loss=0.07245, over 2575913.37 frames. ], batch size: 158, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:10:27,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=420891.1666666667, ans=0.125 2024-06-21 17:10:31,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=420909.5, ans=0.125 2024-06-21 17:10:34,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=420909.5, ans=0.125 2024-06-21 17:10:39,716 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2024-06-21 17:10:45,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=420927.8333333333, ans=0.125 2024-06-21 17:10:48,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=420946.1666666667, ans=0.1 2024-06-21 17:10:50,640 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.92 vs. limit=15.0 2024-06-21 17:10:59,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=420982.8333333333, ans=0.1 2024-06-21 17:11:00,150 INFO [train.py:1028] (1/2) Epoch 23, batch 7050, loss[loss=0.2124, simple_loss=0.2631, pruned_loss=0.08087, over 12787.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2609, pruned_loss=0.07273, over 2582893.34 frames. ], batch size: 177, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:11:01,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=420982.8333333333, ans=0.125 2024-06-21 17:11:27,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=421037.8333333333, ans=0.125 2024-06-21 17:11:28,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=421056.1666666667, ans=0.125 2024-06-21 17:11:29,656 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.236e+02 2.440e+02 2.712e+02 3.589e+02, threshold=4.880e+02, percent-clipped=0.0 2024-06-21 17:11:31,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=421056.1666666667, ans=10.0 2024-06-21 17:11:33,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=421056.1666666667, ans=0.125 2024-06-21 17:11:35,224 INFO [train.py:1028] (1/2) Epoch 23, batch 7100, loss[loss=0.2388, simple_loss=0.2968, pruned_loss=0.09038, over 13168.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2615, pruned_loss=0.0732, over 2575913.33 frames. ], batch size: 112, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:11:44,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.37 vs. limit=22.5 2024-06-21 17:11:45,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=421092.8333333333, ans=0.125 2024-06-21 17:11:47,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=421092.8333333333, ans=0.0 2024-06-21 17:11:53,334 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.10 vs. limit=22.5 2024-06-21 17:11:57,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=421129.5, ans=0.125 2024-06-21 17:11:58,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=421129.5, ans=0.015 2024-06-21 17:12:01,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=421129.5, ans=0.1 2024-06-21 17:12:05,177 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.22 vs. limit=15.0 2024-06-21 17:12:08,883 INFO [train.py:1028] (1/2) Epoch 23, batch 7150, loss[loss=0.229, simple_loss=0.2814, pruned_loss=0.08833, over 12501.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.262, pruned_loss=0.07317, over 2574647.34 frames. ], batch size: 202, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:12:17,441 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.35 vs. limit=10.0 2024-06-21 17:12:18,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=421184.5, ans=0.025 2024-06-21 17:12:25,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=421202.8333333333, ans=0.125 2024-06-21 17:12:26,036 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2024-06-21 17:12:39,421 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.241e+02 2.374e+02 2.639e+02 4.131e+02, threshold=4.749e+02, percent-clipped=0.0 2024-06-21 17:12:41,514 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:12:45,137 INFO [train.py:1028] (1/2) Epoch 23, batch 7200, loss[loss=0.2177, simple_loss=0.2706, pruned_loss=0.08237, over 13152.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2629, pruned_loss=0.07343, over 2579335.20 frames. ], batch size: 112, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:12:45,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=421257.8333333333, ans=0.0 2024-06-21 17:12:47,458 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.42 vs. limit=15.0 2024-06-21 17:12:58,683 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.65 vs. limit=15.0 2024-06-21 17:13:04,660 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.57 vs. limit=15.0 2024-06-21 17:13:11,580 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.55 vs. limit=10.0 2024-06-21 17:13:13,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=421331.1666666667, ans=0.1 2024-06-21 17:13:15,660 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2024-06-21 17:13:21,877 INFO [train.py:1028] (1/2) Epoch 23, batch 7250, loss[loss=0.1993, simple_loss=0.2605, pruned_loss=0.06904, over 12963.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2634, pruned_loss=0.07354, over 2580727.41 frames. ], batch size: 36, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:13:26,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=421349.5, ans=0.125 2024-06-21 17:13:30,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=421367.8333333333, ans=0.0 2024-06-21 17:13:33,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=421367.8333333333, ans=10.0 2024-06-21 17:13:40,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=421386.1666666667, ans=0.125 2024-06-21 17:13:47,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=421404.5, ans=0.0 2024-06-21 17:13:48,969 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.269e+02 2.470e+02 2.809e+02 3.868e+02, threshold=4.940e+02, percent-clipped=0.0 2024-06-21 17:13:50,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=421422.8333333333, ans=0.125 2024-06-21 17:13:52,446 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.27 vs. limit=12.0 2024-06-21 17:13:52,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=421422.8333333333, ans=0.125 2024-06-21 17:13:53,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=421441.1666666667, ans=0.2 2024-06-21 17:13:54,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=421441.1666666667, ans=0.125 2024-06-21 17:13:54,455 INFO [train.py:1028] (1/2) Epoch 23, batch 7300, loss[loss=0.2045, simple_loss=0.2693, pruned_loss=0.06985, over 12896.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2643, pruned_loss=0.07397, over 2579993.48 frames. ], batch size: 36, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:13:54,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=421441.1666666667, ans=0.0 2024-06-21 17:13:59,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=421441.1666666667, ans=0.125 2024-06-21 17:14:00,059 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2024-06-21 17:14:00,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.07 vs. limit=22.5 2024-06-21 17:14:00,860 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.26 vs. limit=22.5 2024-06-21 17:14:05,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=421459.5, ans=0.025 2024-06-21 17:14:07,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=421477.8333333333, ans=0.125 2024-06-21 17:14:27,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=421514.5, ans=0.015 2024-06-21 17:14:30,770 INFO [train.py:1028] (1/2) Epoch 23, batch 7350, loss[loss=0.2195, simple_loss=0.2823, pruned_loss=0.07832, over 13279.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2654, pruned_loss=0.07487, over 2581437.62 frames. ], batch size: 46, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:14:31,258 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.74 vs. limit=15.0 2024-06-21 17:14:37,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=421551.1666666667, ans=0.125 2024-06-21 17:14:39,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=421551.1666666667, ans=0.2 2024-06-21 17:14:42,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=421551.1666666667, ans=0.125 2024-06-21 17:14:50,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=421587.8333333333, ans=0.125 2024-06-21 17:14:55,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=421587.8333333333, ans=0.2 2024-06-21 17:14:58,014 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.271e+02 2.454e+02 2.689e+02 3.742e+02, threshold=4.907e+02, percent-clipped=0.0 2024-06-21 17:14:58,099 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:15:01,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=421606.1666666667, ans=0.025 2024-06-21 17:15:03,918 INFO [train.py:1028] (1/2) Epoch 23, batch 7400, loss[loss=0.2019, simple_loss=0.2598, pruned_loss=0.07196, over 13236.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2651, pruned_loss=0.07456, over 2587279.92 frames. ], batch size: 63, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:15:09,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=421624.5, ans=0.125 2024-06-21 17:15:15,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=421642.8333333333, ans=0.0 2024-06-21 17:15:41,622 INFO [train.py:1028] (1/2) Epoch 23, batch 7450, loss[loss=0.1858, simple_loss=0.2466, pruned_loss=0.06248, over 12603.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2645, pruned_loss=0.07392, over 2580026.84 frames. ], batch size: 29, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:15:44,851 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.50 vs. limit=22.5 2024-06-21 17:15:48,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=421734.5, ans=0.125 2024-06-21 17:15:58,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=421752.8333333333, ans=0.0 2024-06-21 17:16:09,613 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.263e+02 2.400e+02 2.749e+02 3.809e+02, threshold=4.801e+02, percent-clipped=0.0 2024-06-21 17:16:15,115 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.66 vs. limit=15.0 2024-06-21 17:16:15,285 INFO [train.py:1028] (1/2) Epoch 23, batch 7500, loss[loss=0.2286, simple_loss=0.2728, pruned_loss=0.09221, over 10429.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2659, pruned_loss=0.07483, over 2577342.64 frames. ], batch size: 303, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:16:24,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=421807.8333333333, ans=0.125 2024-06-21 17:16:26,187 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=421826.1666666667, ans=0.125 2024-06-21 17:16:28,437 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.52 vs. limit=15.0 2024-06-21 17:16:31,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=421844.5, ans=0.2 2024-06-21 17:16:39,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=421862.8333333333, ans=0.125 2024-06-21 17:16:51,335 INFO [train.py:1028] (1/2) Epoch 23, batch 7550, loss[loss=0.181, simple_loss=0.2388, pruned_loss=0.06155, over 12958.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2665, pruned_loss=0.07528, over 2576893.78 frames. ], batch size: 158, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:16:56,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=421899.5, ans=0.07 2024-06-21 17:17:10,268 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.37 vs. limit=22.5 2024-06-21 17:17:18,466 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.284e+02 2.397e+02 2.675e+02 3.911e+02, threshold=4.793e+02, percent-clipped=0.0 2024-06-21 17:17:20,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=421972.8333333333, ans=0.125 2024-06-21 17:17:27,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=421972.8333333333, ans=0.0 2024-06-21 17:17:28,247 INFO [train.py:1028] (1/2) Epoch 23, batch 7600, loss[loss=0.2113, simple_loss=0.2763, pruned_loss=0.07313, over 13211.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2674, pruned_loss=0.07553, over 2575988.25 frames. ], batch size: 83, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:17:30,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=421991.1666666667, ans=0.125 2024-06-21 17:17:36,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=422009.5, ans=0.125 2024-06-21 17:17:47,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=422046.1666666667, ans=0.2 2024-06-21 17:17:50,906 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.09 vs. limit=22.5 2024-06-21 17:17:52,391 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.28 vs. limit=10.0 2024-06-21 17:17:56,097 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:17:56,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=422064.5, ans=0.125 2024-06-21 17:18:01,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=422082.8333333333, ans=0.125 2024-06-21 17:18:02,104 INFO [train.py:1028] (1/2) Epoch 23, batch 7650, loss[loss=0.2032, simple_loss=0.2662, pruned_loss=0.0701, over 12926.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2678, pruned_loss=0.0756, over 2571752.15 frames. ], batch size: 33, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:18:05,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=422082.8333333333, ans=0.125 2024-06-21 17:18:10,199 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2024-06-21 17:18:26,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=422137.8333333333, ans=0.125 2024-06-21 17:18:27,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=422137.8333333333, ans=0.125 2024-06-21 17:18:30,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=422137.8333333333, ans=0.125 2024-06-21 17:18:31,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=422137.8333333333, ans=0.125 2024-06-21 17:18:33,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=422156.1666666667, ans=0.125 2024-06-21 17:18:33,806 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.270e+02 2.433e+02 2.654e+02 3.335e+02, threshold=4.867e+02, percent-clipped=0.0 2024-06-21 17:18:34,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=422156.1666666667, ans=0.2 2024-06-21 17:18:39,969 INFO [train.py:1028] (1/2) Epoch 23, batch 7700, loss[loss=0.2277, simple_loss=0.2982, pruned_loss=0.0786, over 13285.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2688, pruned_loss=0.07605, over 2569378.19 frames. ], batch size: 63, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:18:44,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=422174.5, ans=0.125 2024-06-21 17:18:48,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=422192.8333333333, ans=0.125 2024-06-21 17:18:52,449 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.89 vs. limit=6.0 2024-06-21 17:18:55,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=422211.1666666667, ans=0.0 2024-06-21 17:18:56,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=422211.1666666667, ans=0.125 2024-06-21 17:18:57,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=422211.1666666667, ans=0.0 2024-06-21 17:19:03,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=422229.5, ans=0.02 2024-06-21 17:19:12,670 INFO [train.py:1028] (1/2) Epoch 23, batch 7750, loss[loss=0.2185, simple_loss=0.2811, pruned_loss=0.07796, over 13320.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2693, pruned_loss=0.07651, over 2574144.94 frames. ], batch size: 72, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:19:27,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=422284.5, ans=0.0 2024-06-21 17:19:29,024 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=12.0 2024-06-21 17:19:43,203 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.297e+02 2.433e+02 2.676e+02 3.659e+02, threshold=4.866e+02, percent-clipped=0.0 2024-06-21 17:19:45,100 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.93 vs. limit=15.0 2024-06-21 17:19:49,475 INFO [train.py:1028] (1/2) Epoch 23, batch 7800, loss[loss=0.2179, simple_loss=0.2723, pruned_loss=0.08177, over 13158.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2705, pruned_loss=0.07689, over 2578922.81 frames. ], batch size: 95, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:19:55,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=422376.1666666667, ans=0.025 2024-06-21 17:20:01,131 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.31 vs. limit=22.5 2024-06-21 17:20:05,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=422394.5, ans=15.0 2024-06-21 17:20:19,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=422431.1666666667, ans=0.1 2024-06-21 17:20:23,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=422431.1666666667, ans=0.0 2024-06-21 17:20:25,831 INFO [train.py:1028] (1/2) Epoch 23, batch 7850, loss[loss=0.1878, simple_loss=0.2497, pruned_loss=0.06299, over 12086.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2712, pruned_loss=0.07725, over 2574532.86 frames. ], batch size: 18, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:20:35,191 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=12.0 2024-06-21 17:20:49,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=422504.5, ans=0.1 2024-06-21 17:20:52,649 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.296e+02 2.425e+02 2.552e+02 3.284e+02, threshold=4.850e+02, percent-clipped=0.0 2024-06-21 17:20:58,639 INFO [train.py:1028] (1/2) Epoch 23, batch 7900, loss[loss=0.209, simple_loss=0.2744, pruned_loss=0.07182, over 13143.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2719, pruned_loss=0.07737, over 2574195.04 frames. ], batch size: 77, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:21:09,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=422559.5, ans=0.125 2024-06-21 17:21:11,061 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2024-06-21 17:21:11,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=422577.8333333333, ans=0.0 2024-06-21 17:21:19,606 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.38 vs. limit=22.5 2024-06-21 17:21:34,974 INFO [train.py:1028] (1/2) Epoch 23, batch 7950, loss[loss=0.2012, simple_loss=0.2491, pruned_loss=0.0766, over 10619.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2718, pruned_loss=0.07714, over 2576697.03 frames. ], batch size: 304, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:21:47,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=422669.5, ans=0.125 2024-06-21 17:21:59,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=422687.8333333333, ans=0.125 2024-06-21 17:22:01,899 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.249e+02 2.398e+02 2.663e+02 3.324e+02, threshold=4.796e+02, percent-clipped=0.0 2024-06-21 17:22:08,113 INFO [train.py:1028] (1/2) Epoch 23, batch 8000, loss[loss=0.1937, simple_loss=0.2605, pruned_loss=0.06342, over 12717.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2727, pruned_loss=0.07749, over 2574024.47 frames. ], batch size: 29, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:22:08,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=422724.5, ans=0.125 2024-06-21 17:22:08,436 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.64 vs. limit=22.5 2024-06-21 17:22:11,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=422724.5, ans=0.2 2024-06-21 17:22:26,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=422761.1666666667, ans=0.2 2024-06-21 17:22:41,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=422797.8333333333, ans=0.0 2024-06-21 17:22:45,761 INFO [train.py:1028] (1/2) Epoch 23, batch 8050, loss[loss=0.2277, simple_loss=0.2844, pruned_loss=0.08549, over 13244.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2719, pruned_loss=0.07698, over 2573582.56 frames. ], batch size: 83, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:22:50,664 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.66 vs. limit=15.0 2024-06-21 17:23:00,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=422852.8333333333, ans=0.125 2024-06-21 17:23:03,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=422852.8333333333, ans=0.0 2024-06-21 17:23:11,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=422889.5, ans=0.125 2024-06-21 17:23:11,444 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.66 vs. limit=22.5 2024-06-21 17:23:12,258 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.021e+02 2.258e+02 2.389e+02 2.590e+02 3.796e+02, threshold=4.777e+02, percent-clipped=0.0 2024-06-21 17:23:21,339 INFO [train.py:1028] (1/2) Epoch 23, batch 8100, loss[loss=0.2227, simple_loss=0.2786, pruned_loss=0.08339, over 13178.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2726, pruned_loss=0.07747, over 2577465.62 frames. ], batch size: 112, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:23:33,455 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.33 vs. limit=22.5 2024-06-21 17:23:41,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=422962.8333333333, ans=0.125 2024-06-21 17:23:51,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=422981.1666666667, ans=0.2 2024-06-21 17:23:54,835 INFO [train.py:1028] (1/2) Epoch 23, batch 8150, loss[loss=0.2031, simple_loss=0.264, pruned_loss=0.07114, over 13084.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2729, pruned_loss=0.07717, over 2581913.48 frames. ], batch size: 121, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:24:07,953 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.23 vs. limit=15.0 2024-06-21 17:24:10,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=423036.1666666667, ans=0.1 2024-06-21 17:24:12,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=423036.1666666667, ans=0.2 2024-06-21 17:24:13,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=423054.5, ans=0.125 2024-06-21 17:24:14,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=423054.5, ans=0.0 2024-06-21 17:24:23,890 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.81 vs. limit=22.5 2024-06-21 17:24:24,796 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.289e+02 2.395e+02 2.540e+02 3.096e+02, threshold=4.791e+02, percent-clipped=0.0 2024-06-21 17:24:28,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=423072.8333333333, ans=0.1 2024-06-21 17:24:30,765 INFO [train.py:1028] (1/2) Epoch 23, batch 8200, loss[loss=0.2199, simple_loss=0.2803, pruned_loss=0.07976, over 13128.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2734, pruned_loss=0.07745, over 2585013.36 frames. ], batch size: 112, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:24:39,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=423109.5, ans=0.125 2024-06-21 17:24:44,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=423127.8333333333, ans=0.1 2024-06-21 17:24:49,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=423127.8333333333, ans=0.125 2024-06-21 17:24:53,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=423146.1666666667, ans=0.125 2024-06-21 17:24:53,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=423146.1666666667, ans=0.1 2024-06-21 17:25:02,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=423164.5, ans=0.125 2024-06-21 17:25:04,420 INFO [train.py:1028] (1/2) Epoch 23, batch 8250, loss[loss=0.1956, simple_loss=0.2505, pruned_loss=0.07037, over 13238.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2731, pruned_loss=0.07748, over 2584810.78 frames. ], batch size: 52, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:25:15,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=423201.1666666667, ans=0.0 2024-06-21 17:25:17,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=423219.5, ans=0.125 2024-06-21 17:25:30,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=423237.8333333333, ans=0.125 2024-06-21 17:25:33,557 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.244e+02 2.353e+02 2.485e+02 4.515e+02, threshold=4.706e+02, percent-clipped=0.0 2024-06-21 17:25:39,136 INFO [train.py:1028] (1/2) Epoch 23, batch 8300, loss[loss=0.2166, simple_loss=0.2713, pruned_loss=0.08096, over 13022.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2723, pruned_loss=0.07701, over 2581817.96 frames. ], batch size: 102, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:25:50,630 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2024-06-21 17:25:53,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=423311.1666666667, ans=0.125 2024-06-21 17:25:54,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=423311.1666666667, ans=0.0 2024-06-21 17:25:54,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=423311.1666666667, ans=0.125 2024-06-21 17:25:57,722 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=423311.1666666667, ans=0.125 2024-06-21 17:26:03,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=423329.5, ans=0.125 2024-06-21 17:26:05,017 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2024-06-21 17:26:06,389 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=15.0 2024-06-21 17:26:08,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=423347.8333333333, ans=0.0 2024-06-21 17:26:08,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=423347.8333333333, ans=0.0 2024-06-21 17:26:12,258 INFO [train.py:1028] (1/2) Epoch 23, batch 8350, loss[loss=0.2302, simple_loss=0.286, pruned_loss=0.08716, over 13208.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2721, pruned_loss=0.07686, over 2582823.38 frames. ], batch size: 112, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:26:31,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=423402.8333333333, ans=0.0 2024-06-21 17:26:38,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=423421.1666666667, ans=0.125 2024-06-21 17:26:43,211 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.281e+02 2.427e+02 2.658e+02 3.952e+02, threshold=4.854e+02, percent-clipped=0.0 2024-06-21 17:26:43,681 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.17 vs. limit=6.0 2024-06-21 17:26:44,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=423439.5, ans=0.125 2024-06-21 17:26:49,486 INFO [train.py:1028] (1/2) Epoch 23, batch 8400, loss[loss=0.2129, simple_loss=0.27, pruned_loss=0.07793, over 12906.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2723, pruned_loss=0.07702, over 2578871.43 frames. ], batch size: 39, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:26:50,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=423457.8333333333, ans=0.2 2024-06-21 17:26:52,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=423457.8333333333, ans=0.1 2024-06-21 17:26:58,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=423476.1666666667, ans=0.1 2024-06-21 17:27:02,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=423494.5, ans=0.0 2024-06-21 17:27:13,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=423512.8333333333, ans=0.1 2024-06-21 17:27:24,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=423531.1666666667, ans=0.125 2024-06-21 17:27:26,510 INFO [train.py:1028] (1/2) Epoch 23, batch 8450, loss[loss=0.2212, simple_loss=0.2831, pruned_loss=0.07966, over 13137.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.273, pruned_loss=0.0773, over 2580235.17 frames. ], batch size: 112, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:27:31,062 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.38 vs. limit=15.0 2024-06-21 17:27:53,922 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.283e+02 2.427e+02 2.737e+02 3.622e+02, threshold=4.853e+02, percent-clipped=0.0 2024-06-21 17:27:59,906 INFO [train.py:1028] (1/2) Epoch 23, batch 8500, loss[loss=0.1924, simple_loss=0.2554, pruned_loss=0.06465, over 12646.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2741, pruned_loss=0.0775, over 2577408.27 frames. ], batch size: 29, lr: 2.50e-03, grad_scale: 64.0 2024-06-21 17:28:01,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=423641.1666666667, ans=0.0 2024-06-21 17:28:08,532 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.80 vs. limit=15.0 2024-06-21 17:28:10,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=423659.5, ans=0.0 2024-06-21 17:28:37,451 INFO [train.py:1028] (1/2) Epoch 23, batch 8550, loss[loss=0.2085, simple_loss=0.2695, pruned_loss=0.0738, over 12573.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2737, pruned_loss=0.07708, over 2576805.49 frames. ], batch size: 22, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:28:40,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=423732.8333333333, ans=0.1 2024-06-21 17:28:40,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=423732.8333333333, ans=0.2 2024-06-21 17:28:44,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=423751.1666666667, ans=0.1 2024-06-21 17:29:05,013 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.307e+02 2.470e+02 2.765e+02 3.807e+02, threshold=4.940e+02, percent-clipped=0.0 2024-06-21 17:29:08,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=423806.1666666667, ans=0.1 2024-06-21 17:29:10,356 INFO [train.py:1028] (1/2) Epoch 23, batch 8600, loss[loss=0.2117, simple_loss=0.2639, pruned_loss=0.07969, over 13134.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2741, pruned_loss=0.07733, over 2573060.43 frames. ], batch size: 121, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:29:11,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=423824.5, ans=0.125 2024-06-21 17:29:47,800 INFO [train.py:1028] (1/2) Epoch 23, batch 8650, loss[loss=0.1949, simple_loss=0.2528, pruned_loss=0.06853, over 13064.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2746, pruned_loss=0.07725, over 2575899.80 frames. ], batch size: 102, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:29:53,325 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.39 vs. limit=15.0 2024-06-21 17:29:58,464 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.42 vs. limit=15.0 2024-06-21 17:30:03,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=423952.8333333333, ans=0.125 2024-06-21 17:30:06,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=423971.1666666667, ans=0.05 2024-06-21 17:30:08,939 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.57 vs. limit=15.0 2024-06-21 17:30:15,189 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.277e+02 2.375e+02 2.589e+02 3.238e+02, threshold=4.751e+02, percent-clipped=0.0 2024-06-21 17:30:26,226 INFO [train.py:1028] (1/2) Epoch 23, batch 8700, loss[loss=0.2272, simple_loss=0.2941, pruned_loss=0.08017, over 13202.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2745, pruned_loss=0.07735, over 2573157.06 frames. ], batch size: 59, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:30:27,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=424007.8333333333, ans=0.125 2024-06-21 17:30:30,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=424007.8333333333, ans=0.125 2024-06-21 17:30:30,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=424007.8333333333, ans=0.125 2024-06-21 17:30:32,886 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:30:35,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=424026.1666666667, ans=0.125 2024-06-21 17:30:38,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=424026.1666666667, ans=0.125 2024-06-21 17:30:43,221 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.54 vs. limit=15.0 2024-06-21 17:30:56,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=424081.1666666667, ans=0.125 2024-06-21 17:30:57,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=424081.1666666667, ans=0.125 2024-06-21 17:31:00,915 INFO [train.py:1028] (1/2) Epoch 23, batch 8750, loss[loss=0.2134, simple_loss=0.269, pruned_loss=0.07886, over 13121.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.275, pruned_loss=0.07757, over 2568759.71 frames. ], batch size: 121, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:31:04,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=424099.5, ans=0.125 2024-06-21 17:31:08,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=424117.8333333333, ans=0.0 2024-06-21 17:31:18,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=424136.1666666667, ans=0.0 2024-06-21 17:31:23,224 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:31:24,742 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.85 vs. limit=15.0 2024-06-21 17:31:25,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=424154.5, ans=0.125 2024-06-21 17:31:32,904 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.322e+02 2.457e+02 2.775e+02 3.821e+02, threshold=4.913e+02, percent-clipped=0.0 2024-06-21 17:31:34,125 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.22 vs. limit=10.0 2024-06-21 17:31:36,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=424172.8333333333, ans=0.0 2024-06-21 17:31:38,485 INFO [train.py:1028] (1/2) Epoch 23, batch 8800, loss[loss=0.2026, simple_loss=0.2603, pruned_loss=0.07238, over 13260.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2753, pruned_loss=0.07779, over 2574264.94 frames. ], batch size: 72, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:31:40,688 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=424191.1666666667, ans=0.2 2024-06-21 17:31:54,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=424227.8333333333, ans=0.125 2024-06-21 17:32:12,748 INFO [train.py:1028] (1/2) Epoch 23, batch 8850, loss[loss=0.2471, simple_loss=0.2973, pruned_loss=0.09845, over 12637.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2753, pruned_loss=0.07812, over 2561979.52 frames. ], batch size: 203, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:32:21,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=424282.8333333333, ans=0.125 2024-06-21 17:32:27,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=424301.1666666667, ans=0.125 2024-06-21 17:32:45,763 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.298e+02 2.512e+02 2.729e+02 3.474e+02, threshold=5.024e+02, percent-clipped=0.0 2024-06-21 17:32:46,907 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.43 vs. limit=5.0 2024-06-21 17:32:51,223 INFO [train.py:1028] (1/2) Epoch 23, batch 8900, loss[loss=0.2199, simple_loss=0.2886, pruned_loss=0.07565, over 12936.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2757, pruned_loss=0.07849, over 2560211.69 frames. ], batch size: 33, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:32:52,427 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.26 vs. limit=15.0 2024-06-21 17:32:53,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=424374.5, ans=0.0 2024-06-21 17:32:53,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=424374.5, ans=0.125 2024-06-21 17:32:54,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=424374.5, ans=0.0 2024-06-21 17:32:58,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=424392.8333333333, ans=0.125 2024-06-21 17:33:04,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=424411.1666666667, ans=0.125 2024-06-21 17:33:12,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=424429.5, ans=0.025 2024-06-21 17:33:21,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=424447.8333333333, ans=0.125 2024-06-21 17:33:21,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424447.8333333333, ans=0.1 2024-06-21 17:33:28,689 INFO [train.py:1028] (1/2) Epoch 23, batch 8950, loss[loss=0.2349, simple_loss=0.292, pruned_loss=0.08892, over 12606.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2758, pruned_loss=0.07796, over 2560837.96 frames. ], batch size: 202, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:33:38,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=424484.5, ans=0.125 2024-06-21 17:33:39,106 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.01 vs. limit=15.0 2024-06-21 17:33:41,160 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.69 vs. limit=15.0 2024-06-21 17:33:43,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=424502.8333333333, ans=0.125 2024-06-21 17:33:52,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=424521.1666666667, ans=0.2 2024-06-21 17:33:54,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=424521.1666666667, ans=0.025 2024-06-21 17:33:56,811 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.317e+02 2.447e+02 2.721e+02 3.537e+02, threshold=4.893e+02, percent-clipped=0.0 2024-06-21 17:33:57,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=424539.5, ans=0.1 2024-06-21 17:33:58,598 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.56 vs. limit=15.0 2024-06-21 17:34:02,015 INFO [train.py:1028] (1/2) Epoch 23, batch 9000, loss[loss=0.2006, simple_loss=0.2673, pruned_loss=0.06689, over 13257.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2754, pruned_loss=0.07745, over 2566164.91 frames. ], batch size: 46, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:34:02,016 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 17:34:10,000 INFO [train.py:1060] (1/2) Epoch 23, validation: loss=0.1885, simple_loss=0.2513, pruned_loss=0.06289, over 351949.00 frames. 2024-06-21 17:34:10,000 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 17:34:39,801 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:34:45,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424649.5, ans=0.1 2024-06-21 17:34:46,055 INFO [train.py:1028] (1/2) Epoch 23, batch 9050, loss[loss=0.2258, simple_loss=0.2878, pruned_loss=0.08196, over 11867.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2765, pruned_loss=0.0779, over 2565376.38 frames. ], batch size: 17, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:34:52,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=424667.8333333333, ans=0.125 2024-06-21 17:35:13,532 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.312e+02 2.452e+02 2.669e+02 3.532e+02, threshold=4.905e+02, percent-clipped=0.0 2024-06-21 17:35:18,760 INFO [train.py:1028] (1/2) Epoch 23, batch 9100, loss[loss=0.1756, simple_loss=0.243, pruned_loss=0.05408, over 13303.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2753, pruned_loss=0.07706, over 2566259.52 frames. ], batch size: 72, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:35:19,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=424741.1666666667, ans=0.125 2024-06-21 17:35:50,331 INFO [train.py:1028] (1/2) Epoch 23, batch 9150, loss[loss=0.1981, simple_loss=0.2607, pruned_loss=0.06778, over 13232.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2757, pruned_loss=0.07756, over 2567332.43 frames. ], batch size: 77, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:35:56,568 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.53 vs. limit=15.0 2024-06-21 17:35:58,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=424851.1666666667, ans=0.125 2024-06-21 17:36:00,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=424851.1666666667, ans=0.025 2024-06-21 17:36:02,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=424851.1666666667, ans=0.125 2024-06-21 17:36:17,201 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.199e+02 2.303e+02 2.511e+02 3.040e+02, threshold=4.605e+02, percent-clipped=0.0 2024-06-21 17:36:22,382 INFO [train.py:1028] (1/2) Epoch 23, batch 9200, loss[loss=0.2232, simple_loss=0.294, pruned_loss=0.07614, over 13028.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2754, pruned_loss=0.07701, over 2571330.90 frames. ], batch size: 36, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:36:43,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=424979.5, ans=0.125 2024-06-21 17:36:48,429 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.51 vs. limit=12.0 2024-06-21 17:36:51,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=424997.8333333333, ans=0.025 2024-06-21 17:36:53,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=424997.8333333333, ans=0.025 2024-06-21 17:36:54,272 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.11 vs. limit=10.0 2024-06-21 17:36:56,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=424997.8333333333, ans=0.0 2024-06-21 17:36:57,428 INFO [train.py:1028] (1/2) Epoch 23, batch 9250, loss[loss=0.1993, simple_loss=0.2673, pruned_loss=0.06564, over 13159.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2757, pruned_loss=0.07686, over 2574666.42 frames. ], batch size: 67, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:36:59,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=425016.1666666667, ans=0.125 2024-06-21 17:37:00,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=425016.1666666667, ans=0.125 2024-06-21 17:37:00,506 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.41 vs. limit=6.0 2024-06-21 17:37:12,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=425052.8333333333, ans=0.1 2024-06-21 17:37:12,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=425052.8333333333, ans=0.1 2024-06-21 17:37:18,226 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.24 vs. limit=12.0 2024-06-21 17:37:24,474 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.269e+02 2.451e+02 2.577e+02 3.290e+02, threshold=4.903e+02, percent-clipped=0.0 2024-06-21 17:37:29,673 INFO [train.py:1028] (1/2) Epoch 23, batch 9300, loss[loss=0.1854, simple_loss=0.2503, pruned_loss=0.06022, over 12949.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2763, pruned_loss=0.07698, over 2571641.68 frames. ], batch size: 39, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:37:48,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=425162.8333333333, ans=0.0 2024-06-21 17:37:53,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=425162.8333333333, ans=0.125 2024-06-21 17:38:01,326 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.96 vs. limit=22.5 2024-06-21 17:38:01,555 INFO [train.py:1028] (1/2) Epoch 23, batch 9350, loss[loss=0.2111, simple_loss=0.2824, pruned_loss=0.0699, over 12688.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2767, pruned_loss=0.07708, over 2569187.83 frames. ], batch size: 22, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:38:10,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=425217.8333333333, ans=0.125 2024-06-21 17:38:11,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=425217.8333333333, ans=0.125 2024-06-21 17:38:28,759 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2024-06-21 17:38:29,410 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.296e+02 2.428e+02 2.618e+02 3.615e+02, threshold=4.857e+02, percent-clipped=0.0 2024-06-21 17:38:30,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=425272.8333333333, ans=0.125 2024-06-21 17:38:30,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=425272.8333333333, ans=0.0 2024-06-21 17:38:34,223 INFO [train.py:1028] (1/2) Epoch 23, batch 9400, loss[loss=0.2436, simple_loss=0.3065, pruned_loss=0.09033, over 13231.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2775, pruned_loss=0.07742, over 2569938.45 frames. ], batch size: 52, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:38:34,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=425291.1666666667, ans=0.125 2024-06-21 17:38:46,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=425327.8333333333, ans=0.125 2024-06-21 17:38:47,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=425327.8333333333, ans=0.0 2024-06-21 17:39:00,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=425346.1666666667, ans=0.125 2024-06-21 17:39:04,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=425364.5, ans=0.1 2024-06-21 17:39:04,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=425364.5, ans=0.125 2024-06-21 17:39:07,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=425364.5, ans=0.07 2024-06-21 17:39:09,992 INFO [train.py:1028] (1/2) Epoch 23, batch 9450, loss[loss=0.2433, simple_loss=0.3002, pruned_loss=0.09322, over 12463.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2781, pruned_loss=0.07787, over 2569214.12 frames. ], batch size: 22, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:39:13,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=425382.8333333333, ans=0.95 2024-06-21 17:39:27,334 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.75 vs. limit=15.0 2024-06-21 17:39:31,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=425437.8333333333, ans=0.0 2024-06-21 17:39:32,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=425437.8333333333, ans=0.02 2024-06-21 17:39:34,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=425456.1666666667, ans=0.125 2024-06-21 17:39:35,538 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.271e+02 2.413e+02 2.592e+02 3.343e+02, threshold=4.826e+02, percent-clipped=0.0 2024-06-21 17:39:39,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=425474.5, ans=0.1 2024-06-21 17:39:40,427 INFO [train.py:1028] (1/2) Epoch 23, batch 9500, loss[loss=0.2069, simple_loss=0.2704, pruned_loss=0.07174, over 13185.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2778, pruned_loss=0.07746, over 2577947.47 frames. ], batch size: 43, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:39:45,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=425474.5, ans=0.0 2024-06-21 17:39:52,570 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.34 vs. limit=22.5 2024-06-21 17:39:56,774 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2024-06-21 17:40:13,281 INFO [train.py:1028] (1/2) Epoch 23, batch 9550, loss[loss=0.193, simple_loss=0.255, pruned_loss=0.06545, over 12955.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2773, pruned_loss=0.07739, over 2573914.21 frames. ], batch size: 39, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:40:31,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=425621.1666666667, ans=0.1 2024-06-21 17:40:32,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=425621.1666666667, ans=0.125 2024-06-21 17:40:33,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=425621.1666666667, ans=0.125 2024-06-21 17:40:39,093 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.930e+02 2.265e+02 2.426e+02 2.648e+02 3.442e+02, threshold=4.853e+02, percent-clipped=0.0 2024-06-21 17:40:41,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=425639.5, ans=0.0 2024-06-21 17:40:44,249 INFO [train.py:1028] (1/2) Epoch 23, batch 9600, loss[loss=0.2234, simple_loss=0.274, pruned_loss=0.08642, over 10572.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2768, pruned_loss=0.07748, over 2570971.55 frames. ], batch size: 304, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:41:07,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=425712.8333333333, ans=0.1 2024-06-21 17:41:14,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=425731.1666666667, ans=0.125 2024-06-21 17:41:16,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=425731.1666666667, ans=0.1 2024-06-21 17:41:17,017 INFO [train.py:1028] (1/2) Epoch 23, batch 9650, loss[loss=0.2027, simple_loss=0.2536, pruned_loss=0.07591, over 13107.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2771, pruned_loss=0.07826, over 2561823.46 frames. ], batch size: 132, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:41:22,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=425767.8333333333, ans=0.025 2024-06-21 17:41:22,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=425767.8333333333, ans=0.05 2024-06-21 17:41:26,956 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:41:43,029 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.324e+02 2.526e+02 2.748e+02 4.281e+02, threshold=5.052e+02, percent-clipped=0.0 2024-06-21 17:41:44,654 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.67 vs. limit=22.5 2024-06-21 17:41:46,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=425822.8333333333, ans=0.1 2024-06-21 17:41:46,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=425822.8333333333, ans=0.125 2024-06-21 17:41:46,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=425822.8333333333, ans=0.2 2024-06-21 17:41:47,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=425841.1666666667, ans=0.125 2024-06-21 17:41:47,892 INFO [train.py:1028] (1/2) Epoch 23, batch 9700, loss[loss=0.2071, simple_loss=0.2622, pruned_loss=0.07604, over 13019.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2762, pruned_loss=0.07795, over 2556601.06 frames. ], batch size: 144, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:41:59,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=425859.5, ans=0.125 2024-06-21 17:42:10,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=425896.1666666667, ans=0.125 2024-06-21 17:42:15,895 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.11 vs. limit=15.0 2024-06-21 17:42:20,186 INFO [train.py:1028] (1/2) Epoch 23, batch 9750, loss[loss=0.22, simple_loss=0.2692, pruned_loss=0.08546, over 13088.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2749, pruned_loss=0.07734, over 2552695.38 frames. ], batch size: 132, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:42:28,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=425951.1666666667, ans=0.2 2024-06-21 17:42:31,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=425951.1666666667, ans=0.2 2024-06-21 17:42:42,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=425987.8333333333, ans=0.2 2024-06-21 17:42:44,848 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.41 vs. limit=15.0 2024-06-21 17:42:46,365 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.276e+02 2.368e+02 2.518e+02 3.119e+02, threshold=4.736e+02, percent-clipped=0.0 2024-06-21 17:42:51,377 INFO [train.py:1028] (1/2) Epoch 23, batch 9800, loss[loss=0.1999, simple_loss=0.2606, pruned_loss=0.06961, over 12925.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2744, pruned_loss=0.07685, over 2546424.03 frames. ], batch size: 39, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:43:02,948 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.66 vs. limit=10.0 2024-06-21 17:43:04,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=426061.1666666667, ans=0.0 2024-06-21 17:43:05,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=426061.1666666667, ans=0.125 2024-06-21 17:43:07,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=426061.1666666667, ans=0.0 2024-06-21 17:43:22,391 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.39 vs. limit=22.5 2024-06-21 17:43:23,860 INFO [train.py:1028] (1/2) Epoch 23, batch 9850, loss[loss=0.2138, simple_loss=0.2642, pruned_loss=0.08171, over 13006.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2736, pruned_loss=0.07664, over 2537705.52 frames. ], batch size: 102, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:43:23,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=426116.1666666667, ans=0.0 2024-06-21 17:43:23,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=426116.1666666667, ans=0.1 2024-06-21 17:43:50,158 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.280e+02 2.409e+02 2.617e+02 3.405e+02, threshold=4.817e+02, percent-clipped=0.0 2024-06-21 17:43:54,964 INFO [train.py:1028] (1/2) Epoch 23, batch 9900, loss[loss=0.2055, simple_loss=0.2672, pruned_loss=0.0719, over 12936.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.273, pruned_loss=0.0768, over 2530666.16 frames. ], batch size: 39, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:44:02,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=426226.1666666667, ans=0.0 2024-06-21 17:44:03,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=426226.1666666667, ans=0.0 2024-06-21 17:44:06,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=426226.1666666667, ans=0.125 2024-06-21 17:44:11,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=426244.5, ans=0.125 2024-06-21 17:44:14,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=426262.8333333333, ans=0.0 2024-06-21 17:44:18,021 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.73 vs. limit=22.5 2024-06-21 17:44:21,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=426281.1666666667, ans=0.2 2024-06-21 17:44:23,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=426281.1666666667, ans=0.025 2024-06-21 17:44:23,979 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.00 vs. limit=15.0 2024-06-21 17:44:26,865 INFO [train.py:1028] (1/2) Epoch 23, batch 9950, loss[loss=0.1936, simple_loss=0.2566, pruned_loss=0.06525, over 12612.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2719, pruned_loss=0.07678, over 2523214.23 frames. ], batch size: 29, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:44:28,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=426299.5, ans=0.125 2024-06-21 17:44:29,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=426299.5, ans=0.125 2024-06-21 17:44:36,232 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.93 vs. limit=15.0 2024-06-21 17:44:41,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=426336.1666666667, ans=0.1 2024-06-21 17:44:47,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=426354.5, ans=0.125 2024-06-21 17:44:53,927 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.277e+02 2.439e+02 2.633e+02 3.435e+02, threshold=4.879e+02, percent-clipped=0.0 2024-06-21 17:44:55,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=426372.8333333333, ans=0.125 2024-06-21 17:44:58,886 INFO [train.py:1028] (1/2) Epoch 23, batch 10000, loss[loss=0.2366, simple_loss=0.2932, pruned_loss=0.08999, over 12421.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2722, pruned_loss=0.07748, over 2486133.32 frames. ], batch size: 22, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:45:17,390 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:45:19,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=426446.1666666667, ans=0.05 2024-06-21 17:45:21,870 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.92 vs. limit=10.0 2024-06-21 17:45:22,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=426446.1666666667, ans=0.05 2024-06-21 17:45:29,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=426464.5, ans=0.0 2024-06-21 17:45:31,344 INFO [train.py:1028] (1/2) Epoch 23, batch 10050, loss[loss=0.2199, simple_loss=0.2845, pruned_loss=0.07763, over 12482.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2731, pruned_loss=0.07823, over 2444555.26 frames. ], batch size: 22, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:45:51,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=426537.8333333333, ans=0.125 2024-06-21 17:45:51,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=426537.8333333333, ans=0.1 2024-06-21 17:45:56,638 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.314e+02 2.443e+02 2.672e+02 4.002e+02, threshold=4.887e+02, percent-clipped=0.0 2024-06-21 17:46:01,840 INFO [train.py:1028] (1/2) Epoch 23, batch 10100, loss[loss=0.2058, simple_loss=0.2776, pruned_loss=0.06696, over 10860.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2724, pruned_loss=0.07751, over 2425689.18 frames. ], batch size: 16, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:46:01,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=426574.5, ans=0.1 2024-06-21 17:46:03,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=426574.5, ans=0.1 2024-06-21 17:48:14,255 INFO [train.py:1028] (1/2) Epoch 24, batch 0, loss[loss=0.1647, simple_loss=0.2246, pruned_loss=0.05236, over 12966.00 frames. ], tot_loss[loss=0.1647, simple_loss=0.2246, pruned_loss=0.05236, over 12966.00 frames. ], batch size: 36, lr: 2.44e-03, grad_scale: 32.0 2024-06-21 17:48:14,256 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 17:48:21,353 INFO [train.py:1060] (1/2) Epoch 24, validation: loss=0.189, simple_loss=0.252, pruned_loss=0.06296, over 351949.00 frames. 2024-06-21 17:48:21,353 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 17:48:55,163 INFO [train.py:1028] (1/2) Epoch 24, batch 50, loss[loss=0.1891, simple_loss=0.2563, pruned_loss=0.06094, over 12594.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2547, pruned_loss=0.07045, over 574314.81 frames. ], batch size: 29, lr: 2.44e-03, grad_scale: 32.0 2024-06-21 17:49:02,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=426715.6666666667, ans=0.025 2024-06-21 17:49:02,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=426715.6666666667, ans=0.1 2024-06-21 17:49:10,706 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.162e+02 2.257e+02 2.402e+02 2.871e+02, threshold=4.515e+02, percent-clipped=0.0 2024-06-21 17:49:14,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=426752.3333333333, ans=0.125 2024-06-21 17:49:16,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=426752.3333333333, ans=0.2 2024-06-21 17:49:29,104 INFO [train.py:1028] (1/2) Epoch 24, batch 100, loss[loss=0.1846, simple_loss=0.2526, pruned_loss=0.05827, over 13293.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2534, pruned_loss=0.07037, over 1017978.71 frames. ], batch size: 46, lr: 2.44e-03, grad_scale: 32.0 2024-06-21 17:49:34,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=426789.0, ans=0.0 2024-06-21 17:49:51,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=426825.6666666667, ans=0.0 2024-06-21 17:49:55,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=426844.0, ans=0.0 2024-06-21 17:50:02,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=426862.3333333333, ans=0.125 2024-06-21 17:50:05,784 INFO [train.py:1028] (1/2) Epoch 24, batch 150, loss[loss=0.2021, simple_loss=0.2603, pruned_loss=0.07195, over 12572.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2538, pruned_loss=0.06971, over 1366446.50 frames. ], batch size: 29, lr: 2.44e-03, grad_scale: 32.0 2024-06-21 17:50:06,208 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.71 vs. limit=22.5 2024-06-21 17:50:10,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=426880.6666666667, ans=0.09899494936611666 2024-06-21 17:50:21,744 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.187e+02 2.331e+02 2.568e+02 3.088e+02, threshold=4.663e+02, percent-clipped=0.0 2024-06-21 17:50:23,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=426917.3333333333, ans=0.1 2024-06-21 17:50:26,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=426935.6666666667, ans=0.1 2024-06-21 17:50:29,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=426935.6666666667, ans=0.125 2024-06-21 17:50:30,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=426954.0, ans=0.125 2024-06-21 17:50:34,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=426954.0, ans=0.125 2024-06-21 17:50:37,820 INFO [train.py:1028] (1/2) Epoch 24, batch 200, loss[loss=0.204, simple_loss=0.2562, pruned_loss=0.07588, over 12520.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2546, pruned_loss=0.07008, over 1636439.03 frames. ], batch size: 202, lr: 2.44e-03, grad_scale: 32.0 2024-06-21 17:50:37,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=426972.3333333333, ans=0.125 2024-06-21 17:51:05,015 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.28 vs. limit=15.0 2024-06-21 17:51:09,669 INFO [train.py:1028] (1/2) Epoch 24, batch 250, loss[loss=0.1924, simple_loss=0.2421, pruned_loss=0.07134, over 13013.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2545, pruned_loss=0.06994, over 1847214.74 frames. ], batch size: 144, lr: 2.44e-03, grad_scale: 32.0 2024-06-21 17:51:14,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=427064.0, ans=0.125 2024-06-21 17:51:17,278 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.00 vs. limit=22.5 2024-06-21 17:51:26,390 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.204e+02 2.311e+02 2.506e+02 3.068e+02, threshold=4.622e+02, percent-clipped=0.0 2024-06-21 17:51:48,455 INFO [train.py:1028] (1/2) Epoch 24, batch 300, loss[loss=0.188, simple_loss=0.2458, pruned_loss=0.06513, over 13163.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2548, pruned_loss=0.0701, over 2010322.26 frames. ], batch size: 112, lr: 2.43e-03, grad_scale: 32.0 2024-06-21 17:51:52,792 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.57 vs. limit=15.0 2024-06-21 17:51:55,721 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.99 vs. limit=15.0 2024-06-21 17:51:55,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=427174.0, ans=0.0 2024-06-21 17:51:56,376 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2024-06-21 17:51:57,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=427174.0, ans=0.1 2024-06-21 17:51:59,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=427174.0, ans=0.125 2024-06-21 17:52:04,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=427192.3333333333, ans=0.025 2024-06-21 17:52:04,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=427192.3333333333, ans=0.1 2024-06-21 17:52:06,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=427192.3333333333, ans=0.0 2024-06-21 17:52:06,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.00 vs. limit=15.0 2024-06-21 17:52:20,307 INFO [train.py:1028] (1/2) Epoch 24, batch 350, loss[loss=0.208, simple_loss=0.2691, pruned_loss=0.07345, over 12994.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2533, pruned_loss=0.06922, over 2139355.53 frames. ], batch size: 33, lr: 2.43e-03, grad_scale: 32.0 2024-06-21 17:52:21,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=427247.3333333333, ans=0.2 2024-06-21 17:52:33,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=427284.0, ans=0.0 2024-06-21 17:52:34,981 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.00 vs. limit=15.0 2024-06-21 17:52:36,322 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.136e+02 2.248e+02 2.490e+02 3.097e+02, threshold=4.495e+02, percent-clipped=0.0 2024-06-21 17:52:49,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=427320.6666666667, ans=0.1 2024-06-21 17:52:52,007 INFO [train.py:1028] (1/2) Epoch 24, batch 400, loss[loss=0.191, simple_loss=0.2515, pruned_loss=0.06526, over 13223.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2532, pruned_loss=0.06894, over 2239871.72 frames. ], batch size: 63, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:52:56,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=427339.0, ans=0.125 2024-06-21 17:53:01,571 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=15.0 2024-06-21 17:53:04,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=427357.3333333333, ans=10.0 2024-06-21 17:53:06,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=427375.6666666667, ans=0.125 2024-06-21 17:53:14,303 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=12.0 2024-06-21 17:53:21,339 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.84 vs. limit=22.5 2024-06-21 17:53:23,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=427412.3333333333, ans=0.2 2024-06-21 17:53:24,129 INFO [train.py:1028] (1/2) Epoch 24, batch 450, loss[loss=0.1811, simple_loss=0.2431, pruned_loss=0.05956, over 13204.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2529, pruned_loss=0.06873, over 2313969.06 frames. ], batch size: 67, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:53:31,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.28 vs. limit=15.0 2024-06-21 17:53:40,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=427467.3333333333, ans=0.1 2024-06-21 17:53:43,025 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.139e+02 2.272e+02 2.402e+02 2.964e+02, threshold=4.544e+02, percent-clipped=0.0 2024-06-21 17:53:50,422 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.39 vs. limit=6.0 2024-06-21 17:53:51,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=427485.6666666667, ans=0.1 2024-06-21 17:53:58,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=427504.0, ans=0.125 2024-06-21 17:54:00,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=427504.0, ans=0.125 2024-06-21 17:54:02,095 INFO [train.py:1028] (1/2) Epoch 24, batch 500, loss[loss=0.1883, simple_loss=0.2462, pruned_loss=0.06521, over 13108.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2532, pruned_loss=0.06879, over 2375550.72 frames. ], batch size: 121, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:54:11,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=427540.6666666667, ans=0.125 2024-06-21 17:54:23,219 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.80 vs. limit=22.5 2024-06-21 17:54:23,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=427577.3333333333, ans=0.02 2024-06-21 17:54:28,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=427595.6666666667, ans=0.125 2024-06-21 17:54:34,233 INFO [train.py:1028] (1/2) Epoch 24, batch 550, loss[loss=0.1957, simple_loss=0.2456, pruned_loss=0.07291, over 12951.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2532, pruned_loss=0.06891, over 2420272.28 frames. ], batch size: 158, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:54:49,969 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.213e+02 2.331e+02 2.537e+02 3.094e+02, threshold=4.661e+02, percent-clipped=0.0 2024-06-21 17:54:54,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=427669.0, ans=0.125 2024-06-21 17:54:55,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=427669.0, ans=0.125 2024-06-21 17:55:03,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=427687.3333333333, ans=0.5 2024-06-21 17:55:05,773 INFO [train.py:1028] (1/2) Epoch 24, batch 600, loss[loss=0.1829, simple_loss=0.2328, pruned_loss=0.06653, over 13049.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.253, pruned_loss=0.06865, over 2459051.26 frames. ], batch size: 144, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:55:13,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=427724.0, ans=0.0 2024-06-21 17:55:35,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=427779.0, ans=0.0 2024-06-21 17:55:41,207 INFO [train.py:1028] (1/2) Epoch 24, batch 650, loss[loss=0.1865, simple_loss=0.2553, pruned_loss=0.05884, over 13185.00 frames. ], tot_loss[loss=0.1947, simple_loss=0.253, pruned_loss=0.06823, over 2490020.59 frames. ], batch size: 59, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:55:57,076 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.080e+02 2.204e+02 2.343e+02 2.882e+02, threshold=4.408e+02, percent-clipped=0.0 2024-06-21 17:56:05,430 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.88 vs. limit=15.0 2024-06-21 17:56:08,327 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2024-06-21 17:56:11,627 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.45 vs. limit=15.0 2024-06-21 17:56:11,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=427870.6666666667, ans=0.125 2024-06-21 17:56:15,668 INFO [train.py:1028] (1/2) Epoch 24, batch 700, loss[loss=0.1835, simple_loss=0.2496, pruned_loss=0.05877, over 13291.00 frames. ], tot_loss[loss=0.194, simple_loss=0.252, pruned_loss=0.06803, over 2512529.34 frames. ], batch size: 46, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:56:23,484 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=427907.3333333333, ans=0.125 2024-06-21 17:56:30,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=427925.6666666667, ans=0.035 2024-06-21 17:56:31,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=427925.6666666667, ans=0.125 2024-06-21 17:56:48,134 INFO [train.py:1028] (1/2) Epoch 24, batch 750, loss[loss=0.1886, simple_loss=0.2512, pruned_loss=0.06306, over 13250.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2522, pruned_loss=0.06779, over 2528344.37 frames. ], batch size: 63, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:56:49,787 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.98 vs. limit=15.0 2024-06-21 17:56:56,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=427999.0, ans=0.0 2024-06-21 17:56:59,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=427999.0, ans=0.0 2024-06-21 17:57:03,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=428017.3333333333, ans=0.2 2024-06-21 17:57:04,496 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.150e+02 2.249e+02 2.402e+02 2.853e+02, threshold=4.498e+02, percent-clipped=0.0 2024-06-21 17:57:20,428 INFO [train.py:1028] (1/2) Epoch 24, batch 800, loss[loss=0.2075, simple_loss=0.2616, pruned_loss=0.0767, over 12946.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2522, pruned_loss=0.06802, over 2541362.26 frames. ], batch size: 36, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:57:25,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=428072.3333333333, ans=0.0 2024-06-21 17:57:27,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=428090.6666666667, ans=0.125 2024-06-21 17:57:27,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=428090.6666666667, ans=0.125 2024-06-21 17:57:37,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=428109.0, ans=0.1 2024-06-21 17:57:37,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=428109.0, ans=0.125 2024-06-21 17:57:41,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=428127.3333333333, ans=0.0 2024-06-21 17:57:42,031 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.14 vs. limit=10.0 2024-06-21 17:57:42,783 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=8.0 2024-06-21 17:57:50,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=428145.6666666667, ans=0.1 2024-06-21 17:57:54,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=428145.6666666667, ans=0.125 2024-06-21 17:57:56,205 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.32 vs. limit=12.0 2024-06-21 17:57:56,427 INFO [train.py:1028] (1/2) Epoch 24, batch 850, loss[loss=0.1887, simple_loss=0.2427, pruned_loss=0.06736, over 13116.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.2517, pruned_loss=0.06772, over 2552069.34 frames. ], batch size: 95, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:57:58,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=428164.0, ans=0.0 2024-06-21 17:58:03,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=428182.3333333333, ans=0.2 2024-06-21 17:58:15,214 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.188e+02 2.338e+02 2.577e+02 3.264e+02, threshold=4.675e+02, percent-clipped=0.0 2024-06-21 17:58:17,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=428200.6666666667, ans=0.0 2024-06-21 17:58:18,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=428219.0, ans=0.07 2024-06-21 17:58:28,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=428237.3333333333, ans=0.125 2024-06-21 17:58:31,199 INFO [train.py:1028] (1/2) Epoch 24, batch 900, loss[loss=0.1826, simple_loss=0.2427, pruned_loss=0.06122, over 12964.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.252, pruned_loss=0.06809, over 2557177.71 frames. ], batch size: 36, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:58:59,749 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:59:03,573 INFO [train.py:1028] (1/2) Epoch 24, batch 950, loss[loss=0.169, simple_loss=0.2334, pruned_loss=0.05233, over 12935.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.252, pruned_loss=0.06785, over 2561103.37 frames. ], batch size: 39, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:59:04,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=428347.3333333333, ans=0.125 2024-06-21 17:59:07,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=428347.3333333333, ans=0.1 2024-06-21 17:59:09,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=428365.6666666667, ans=0.125 2024-06-21 17:59:19,416 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 2.193e+02 2.334e+02 2.505e+02 3.278e+02, threshold=4.668e+02, percent-clipped=0.0 2024-06-21 17:59:34,718 INFO [train.py:1028] (1/2) Epoch 24, batch 1000, loss[loss=0.1957, simple_loss=0.258, pruned_loss=0.06673, over 12989.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2517, pruned_loss=0.06812, over 2563062.67 frames. ], batch size: 48, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:59:46,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=428457.3333333333, ans=0.125 2024-06-21 17:59:51,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=428475.6666666667, ans=0.025 2024-06-21 17:59:53,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=428475.6666666667, ans=0.0 2024-06-21 17:59:57,112 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.32 vs. limit=22.5 2024-06-21 17:59:59,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=428494.0, ans=0.0 2024-06-21 18:00:09,010 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.19 vs. limit=15.0 2024-06-21 18:00:14,762 INFO [train.py:1028] (1/2) Epoch 24, batch 1050, loss[loss=0.1837, simple_loss=0.2465, pruned_loss=0.06041, over 13212.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2524, pruned_loss=0.06831, over 2565980.16 frames. ], batch size: 77, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:00:19,960 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=12.0 2024-06-21 18:00:30,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=428567.3333333333, ans=0.1 2024-06-21 18:00:31,196 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.118e+02 2.255e+02 2.442e+02 3.113e+02, threshold=4.509e+02, percent-clipped=0.0 2024-06-21 18:00:32,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=428567.3333333333, ans=0.125 2024-06-21 18:00:48,191 INFO [train.py:1028] (1/2) Epoch 24, batch 1100, loss[loss=0.1835, simple_loss=0.2465, pruned_loss=0.06025, over 13250.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2531, pruned_loss=0.06861, over 2570919.09 frames. ], batch size: 52, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:00:48,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=428622.3333333333, ans=0.0 2024-06-21 18:00:59,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=428640.6666666667, ans=0.025 2024-06-21 18:01:02,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=428659.0, ans=0.2 2024-06-21 18:01:12,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=428677.3333333333, ans=0.0 2024-06-21 18:01:21,322 INFO [train.py:1028] (1/2) Epoch 24, batch 1150, loss[loss=0.1947, simple_loss=0.2534, pruned_loss=0.06803, over 13234.00 frames. ], tot_loss[loss=0.195, simple_loss=0.253, pruned_loss=0.06853, over 2571493.26 frames. ], batch size: 52, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:01:23,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=428714.0, ans=0.0 2024-06-21 18:01:34,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=428750.6666666667, ans=0.125 2024-06-21 18:01:37,329 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.164e+02 2.379e+02 2.600e+02 3.386e+02, threshold=4.758e+02, percent-clipped=0.0 2024-06-21 18:01:38,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=428750.6666666667, ans=0.125 2024-06-21 18:01:46,225 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.36 vs. limit=22.5 2024-06-21 18:01:51,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=428787.3333333333, ans=0.0 2024-06-21 18:01:55,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=428787.3333333333, ans=0.125 2024-06-21 18:01:56,177 INFO [train.py:1028] (1/2) Epoch 24, batch 1200, loss[loss=0.1691, simple_loss=0.2273, pruned_loss=0.05539, over 13156.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.253, pruned_loss=0.06869, over 2574296.32 frames. ], batch size: 77, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:02:06,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=428824.0, ans=0.0 2024-06-21 18:02:30,772 INFO [train.py:1028] (1/2) Epoch 24, batch 1250, loss[loss=0.1832, simple_loss=0.2392, pruned_loss=0.06363, over 13139.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2531, pruned_loss=0.06866, over 2584327.93 frames. ], batch size: 112, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:02:31,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=428897.3333333333, ans=0.125 2024-06-21 18:02:31,757 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.69 vs. limit=6.0 2024-06-21 18:02:46,794 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.135e+02 2.257e+02 2.419e+02 3.120e+02, threshold=4.515e+02, percent-clipped=0.0 2024-06-21 18:03:00,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=428970.6666666667, ans=0.0 2024-06-21 18:03:03,086 INFO [train.py:1028] (1/2) Epoch 24, batch 1300, loss[loss=0.2026, simple_loss=0.254, pruned_loss=0.07556, over 12815.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2538, pruned_loss=0.06896, over 2584199.36 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:03:04,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=428989.0, ans=0.125 2024-06-21 18:03:05,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=428989.0, ans=0.125 2024-06-21 18:03:07,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=428989.0, ans=0.0 2024-06-21 18:03:21,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=429025.6666666667, ans=0.2 2024-06-21 18:03:36,150 INFO [train.py:1028] (1/2) Epoch 24, batch 1350, loss[loss=0.1851, simple_loss=0.2465, pruned_loss=0.06189, over 13184.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2538, pruned_loss=0.06884, over 2586550.46 frames. ], batch size: 59, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:03:37,226 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2024-06-21 18:03:46,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=429099.0, ans=0.2 2024-06-21 18:03:51,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=429117.3333333333, ans=0.125 2024-06-21 18:03:55,690 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.126e+02 2.280e+02 2.421e+02 3.286e+02, threshold=4.559e+02, percent-clipped=0.0 2024-06-21 18:04:11,388 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.11 vs. limit=15.0 2024-06-21 18:04:11,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=429154.0, ans=22.5 2024-06-21 18:04:15,457 INFO [train.py:1028] (1/2) Epoch 24, batch 1400, loss[loss=0.1938, simple_loss=0.261, pruned_loss=0.06336, over 12907.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2537, pruned_loss=0.06883, over 2587382.99 frames. ], batch size: 26, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:04:17,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=429172.3333333333, ans=0.0 2024-06-21 18:04:24,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=429190.6666666667, ans=0.09899494936611666 2024-06-21 18:04:30,680 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.51 vs. limit=22.5 2024-06-21 18:04:31,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=429209.0, ans=0.125 2024-06-21 18:04:32,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=429209.0, ans=0.125 2024-06-21 18:04:34,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=429227.3333333333, ans=0.05 2024-06-21 18:04:36,731 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.04 vs. limit=15.0 2024-06-21 18:04:45,170 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:04:45,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=429245.6666666667, ans=0.02 2024-06-21 18:04:47,691 INFO [train.py:1028] (1/2) Epoch 24, batch 1450, loss[loss=0.1884, simple_loss=0.2411, pruned_loss=0.06783, over 13137.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2539, pruned_loss=0.06898, over 2587346.56 frames. ], batch size: 121, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:04:50,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=429264.0, ans=0.125 2024-06-21 18:04:53,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=429264.0, ans=0.1 2024-06-21 18:04:58,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=429282.3333333333, ans=0.0 2024-06-21 18:05:03,441 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.51 vs. limit=15.0 2024-06-21 18:05:04,180 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.150e+02 2.242e+02 2.405e+02 2.827e+02, threshold=4.484e+02, percent-clipped=0.0 2024-06-21 18:05:04,683 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.75 vs. limit=6.0 2024-06-21 18:05:17,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=429337.3333333333, ans=0.125 2024-06-21 18:05:20,539 INFO [train.py:1028] (1/2) Epoch 24, batch 1500, loss[loss=0.2029, simple_loss=0.2587, pruned_loss=0.07354, over 13249.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2535, pruned_loss=0.06884, over 2589353.57 frames. ], batch size: 83, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:05:21,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=429355.6666666667, ans=0.125 2024-06-21 18:05:26,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=429374.0, ans=0.125 2024-06-21 18:05:36,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=429392.3333333333, ans=0.125 2024-06-21 18:05:52,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=429429.0, ans=0.0 2024-06-21 18:05:56,850 INFO [train.py:1028] (1/2) Epoch 24, batch 1550, loss[loss=0.1808, simple_loss=0.2324, pruned_loss=0.06458, over 13078.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2538, pruned_loss=0.06916, over 2584549.37 frames. ], batch size: 102, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:06:06,167 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.61 vs. limit=22.5 2024-06-21 18:06:16,388 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.208e+02 2.302e+02 2.469e+02 3.333e+02, threshold=4.604e+02, percent-clipped=0.0 2024-06-21 18:06:17,466 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.11 vs. limit=6.0 2024-06-21 18:06:17,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=429484.0, ans=0.0 2024-06-21 18:06:23,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=429502.3333333333, ans=0.0 2024-06-21 18:06:33,063 INFO [train.py:1028] (1/2) Epoch 24, batch 1600, loss[loss=0.1795, simple_loss=0.2401, pruned_loss=0.0595, over 13109.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2541, pruned_loss=0.06945, over 2579988.76 frames. ], batch size: 77, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:06:35,517 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.05 vs. limit=6.0 2024-06-21 18:06:38,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.64 vs. limit=15.0 2024-06-21 18:06:41,467 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.43 vs. limit=22.5 2024-06-21 18:06:59,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=429612.3333333333, ans=0.125 2024-06-21 18:07:02,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=429612.3333333333, ans=0.0 2024-06-21 18:07:05,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=429630.6666666667, ans=15.0 2024-06-21 18:07:05,638 INFO [train.py:1028] (1/2) Epoch 24, batch 1650, loss[loss=0.1973, simple_loss=0.2512, pruned_loss=0.07169, over 13170.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2544, pruned_loss=0.06988, over 2576341.17 frames. ], batch size: 95, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:07:11,636 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.92 vs. limit=22.5 2024-06-21 18:07:19,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=429667.3333333333, ans=15.0 2024-06-21 18:07:21,608 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.179e+02 2.299e+02 2.474e+02 3.086e+02, threshold=4.598e+02, percent-clipped=0.0 2024-06-21 18:07:26,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=429685.6666666667, ans=0.0 2024-06-21 18:07:30,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=429685.6666666667, ans=0.125 2024-06-21 18:07:37,665 INFO [train.py:1028] (1/2) Epoch 24, batch 1700, loss[loss=0.2097, simple_loss=0.2719, pruned_loss=0.07373, over 12716.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2543, pruned_loss=0.06948, over 2581594.71 frames. ], batch size: 26, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:07:39,960 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2024-06-21 18:07:48,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=429740.6666666667, ans=0.1 2024-06-21 18:08:01,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=429777.3333333333, ans=0.125 2024-06-21 18:08:06,518 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.22 vs. limit=15.0 2024-06-21 18:08:14,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=429814.0, ans=0.125 2024-06-21 18:08:15,273 INFO [train.py:1028] (1/2) Epoch 24, batch 1750, loss[loss=0.2053, simple_loss=0.2701, pruned_loss=0.07029, over 12654.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2536, pruned_loss=0.06922, over 2582123.74 frames. ], batch size: 22, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:08:18,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=429814.0, ans=0.125 2024-06-21 18:08:28,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=429850.6666666667, ans=0.125 2024-06-21 18:08:31,342 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.148e+02 2.317e+02 2.451e+02 4.030e+02, threshold=4.633e+02, percent-clipped=0.0 2024-06-21 18:08:36,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=429869.0, ans=0.04949747468305833 2024-06-21 18:08:39,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=429869.0, ans=0.125 2024-06-21 18:08:40,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=429887.3333333333, ans=0.1 2024-06-21 18:08:44,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=429887.3333333333, ans=0.125 2024-06-21 18:08:47,389 INFO [train.py:1028] (1/2) Epoch 24, batch 1800, loss[loss=0.1915, simple_loss=0.2553, pruned_loss=0.06381, over 13229.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2534, pruned_loss=0.06883, over 2582446.16 frames. ], batch size: 67, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:08:54,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=429924.0, ans=0.125 2024-06-21 18:09:10,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=429960.6666666667, ans=0.2 2024-06-21 18:09:16,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=429979.0, ans=0.2 2024-06-21 18:09:17,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=429979.0, ans=0.2 2024-06-21 18:09:19,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=429997.3333333333, ans=0.0 2024-06-21 18:09:20,247 INFO [train.py:1028] (1/2) Epoch 24, batch 1850, loss[loss=0.1983, simple_loss=0.2515, pruned_loss=0.07251, over 13233.00 frames. ], tot_loss[loss=0.195, simple_loss=0.253, pruned_loss=0.06851, over 2583828.44 frames. ], batch size: 83, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:09:22,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=429997.3333333333, ans=0.0 2024-06-21 18:09:31,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=430015.6666666667, ans=0.95 2024-06-21 18:09:31,968 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:09:36,134 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.152e+02 2.261e+02 2.429e+02 2.923e+02, threshold=4.523e+02, percent-clipped=0.0 2024-06-21 18:09:36,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=430034.0, ans=0.125 2024-06-21 18:09:52,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=430070.6666666667, ans=0.025 2024-06-21 18:09:56,667 INFO [train.py:1028] (1/2) Epoch 24, batch 1900, loss[loss=0.1957, simple_loss=0.2572, pruned_loss=0.06707, over 13128.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2527, pruned_loss=0.0685, over 2586755.47 frames. ], batch size: 95, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:10:08,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=430107.3333333333, ans=0.125 2024-06-21 18:10:11,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=430107.3333333333, ans=0.0 2024-06-21 18:10:14,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=430125.6666666667, ans=0.2 2024-06-21 18:10:14,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=430125.6666666667, ans=0.025 2024-06-21 18:10:21,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=430144.0, ans=0.125 2024-06-21 18:10:30,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=430162.3333333333, ans=0.0 2024-06-21 18:10:32,070 INFO [train.py:1028] (1/2) Epoch 24, batch 1950, loss[loss=0.1888, simple_loss=0.2475, pruned_loss=0.06507, over 13297.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2526, pruned_loss=0.06888, over 2592100.00 frames. ], batch size: 52, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:10:40,886 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.73 vs. limit=15.0 2024-06-21 18:10:45,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=430217.3333333333, ans=0.125 2024-06-21 18:10:48,243 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.210e+02 2.339e+02 2.462e+02 3.423e+02, threshold=4.679e+02, percent-clipped=0.0 2024-06-21 18:10:49,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=430217.3333333333, ans=0.0 2024-06-21 18:10:50,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=430235.6666666667, ans=0.125 2024-06-21 18:10:52,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=430235.6666666667, ans=0.0 2024-06-21 18:11:00,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=430254.0, ans=0.125 2024-06-21 18:11:04,160 INFO [train.py:1028] (1/2) Epoch 24, batch 2000, loss[loss=0.2064, simple_loss=0.2703, pruned_loss=0.07128, over 12623.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2532, pruned_loss=0.06936, over 2588650.19 frames. ], batch size: 22, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:11:08,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=430272.3333333333, ans=0.09899494936611666 2024-06-21 18:11:20,279 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.42 vs. limit=15.0 2024-06-21 18:11:26,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=430327.3333333333, ans=0.0 2024-06-21 18:11:35,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=430345.6666666667, ans=0.0 2024-06-21 18:11:36,824 INFO [train.py:1028] (1/2) Epoch 24, batch 2050, loss[loss=0.2007, simple_loss=0.2688, pruned_loss=0.06628, over 12785.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2542, pruned_loss=0.0699, over 2583595.79 frames. ], batch size: 29, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:11:38,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=430364.0, ans=0.125 2024-06-21 18:11:46,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=430382.3333333333, ans=0.0 2024-06-21 18:11:48,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=430382.3333333333, ans=0.125 2024-06-21 18:11:55,984 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.195e+02 2.324e+02 2.549e+02 3.134e+02, threshold=4.648e+02, percent-clipped=0.0 2024-06-21 18:11:56,088 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:12:00,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=430400.6666666667, ans=0.125 2024-06-21 18:12:03,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=430419.0, ans=0.125 2024-06-21 18:12:04,393 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.04 vs. limit=15.0 2024-06-21 18:12:14,702 INFO [train.py:1028] (1/2) Epoch 24, batch 2100, loss[loss=0.1869, simple_loss=0.25, pruned_loss=0.06192, over 13192.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2535, pruned_loss=0.06934, over 2586203.30 frames. ], batch size: 59, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:12:15,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=430455.6666666667, ans=0.125 2024-06-21 18:12:33,824 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:12:42,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=430529.0, ans=0.125 2024-06-21 18:12:43,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=430529.0, ans=0.1 2024-06-21 18:12:47,440 INFO [train.py:1028] (1/2) Epoch 24, batch 2150, loss[loss=0.1784, simple_loss=0.2389, pruned_loss=0.05898, over 13221.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2534, pruned_loss=0.06887, over 2589341.45 frames. ], batch size: 52, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:12:55,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=430565.6666666667, ans=0.1 2024-06-21 18:13:03,662 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.150e+02 2.252e+02 2.394e+02 3.073e+02, threshold=4.504e+02, percent-clipped=0.0 2024-06-21 18:13:11,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=430602.3333333333, ans=0.1 2024-06-21 18:13:17,264 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2024-06-21 18:13:17,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=430620.6666666667, ans=0.2 2024-06-21 18:13:20,287 INFO [train.py:1028] (1/2) Epoch 24, batch 2200, loss[loss=0.2267, simple_loss=0.2804, pruned_loss=0.08653, over 13170.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2542, pruned_loss=0.06959, over 2588543.79 frames. ], batch size: 83, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:13:26,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=430657.3333333333, ans=0.0 2024-06-21 18:13:35,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=430675.6666666667, ans=0.125 2024-06-21 18:13:38,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=430675.6666666667, ans=0.125 2024-06-21 18:13:41,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=430694.0, ans=0.0 2024-06-21 18:13:57,054 INFO [train.py:1028] (1/2) Epoch 24, batch 2250, loss[loss=0.1942, simple_loss=0.2565, pruned_loss=0.06596, over 13256.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2545, pruned_loss=0.06979, over 2586885.41 frames. ], batch size: 63, lr: 2.42e-03, grad_scale: 64.0 2024-06-21 18:14:00,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=430730.6666666667, ans=0.125 2024-06-21 18:14:06,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=430749.0, ans=0.0 2024-06-21 18:14:16,315 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 2.186e+02 2.359e+02 2.509e+02 3.108e+02, threshold=4.718e+02, percent-clipped=0.0 2024-06-21 18:14:20,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=430785.6666666667, ans=0.025 2024-06-21 18:14:24,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=430785.6666666667, ans=0.1 2024-06-21 18:14:32,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=430804.0, ans=0.125 2024-06-21 18:14:33,304 INFO [train.py:1028] (1/2) Epoch 24, batch 2300, loss[loss=0.1896, simple_loss=0.2468, pruned_loss=0.06615, over 12938.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2542, pruned_loss=0.06984, over 2581118.17 frames. ], batch size: 33, lr: 2.42e-03, grad_scale: 64.0 2024-06-21 18:14:39,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=430840.6666666667, ans=0.125 2024-06-21 18:14:45,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=430840.6666666667, ans=0.125 2024-06-21 18:14:54,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=430877.3333333333, ans=0.125 2024-06-21 18:15:00,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=430895.6666666667, ans=0.125 2024-06-21 18:15:06,396 INFO [train.py:1028] (1/2) Epoch 24, batch 2350, loss[loss=0.1926, simple_loss=0.253, pruned_loss=0.06612, over 13218.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2547, pruned_loss=0.07008, over 2584544.25 frames. ], batch size: 67, lr: 2.42e-03, grad_scale: 64.0 2024-06-21 18:15:11,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=430914.0, ans=0.125 2024-06-21 18:15:22,957 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.173e+02 2.304e+02 2.444e+02 3.454e+02, threshold=4.608e+02, percent-clipped=0.0 2024-06-21 18:15:39,747 INFO [train.py:1028] (1/2) Epoch 24, batch 2400, loss[loss=0.2084, simple_loss=0.2691, pruned_loss=0.0738, over 13289.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2538, pruned_loss=0.06989, over 2587021.54 frames. ], batch size: 46, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:15:41,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=431005.6666666667, ans=0.2 2024-06-21 18:15:50,765 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2024-06-21 18:15:54,447 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.17 vs. limit=15.0 2024-06-21 18:15:54,988 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.84 vs. limit=15.0 2024-06-21 18:16:13,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=431079.0, ans=0.125 2024-06-21 18:16:15,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=431079.0, ans=0.125 2024-06-21 18:16:15,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=431079.0, ans=0.0 2024-06-21 18:16:19,189 INFO [train.py:1028] (1/2) Epoch 24, batch 2450, loss[loss=0.1835, simple_loss=0.2413, pruned_loss=0.06287, over 13260.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2533, pruned_loss=0.06992, over 2584508.85 frames. ], batch size: 63, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:16:20,892 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.20 vs. limit=15.0 2024-06-21 18:16:31,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=431134.0, ans=0.125 2024-06-21 18:16:34,881 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.173e+02 2.332e+02 2.539e+02 3.140e+02, threshold=4.664e+02, percent-clipped=0.0 2024-06-21 18:16:37,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=431134.0, ans=0.125 2024-06-21 18:16:43,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=431152.3333333333, ans=0.125 2024-06-21 18:16:51,289 INFO [train.py:1028] (1/2) Epoch 24, batch 2500, loss[loss=0.1801, simple_loss=0.2365, pruned_loss=0.06182, over 13153.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2522, pruned_loss=0.06936, over 2586416.35 frames. ], batch size: 83, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:16:54,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=431189.0, ans=0.125 2024-06-21 18:16:57,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=431207.3333333333, ans=0.125 2024-06-21 18:16:58,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=431207.3333333333, ans=0.125 2024-06-21 18:17:02,440 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:17:07,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.72 vs. limit=15.0 2024-06-21 18:17:11,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=431244.0, ans=0.07 2024-06-21 18:17:13,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=431244.0, ans=0.0 2024-06-21 18:17:13,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=431244.0, ans=0.125 2024-06-21 18:17:23,637 INFO [train.py:1028] (1/2) Epoch 24, batch 2550, loss[loss=0.209, simple_loss=0.2617, pruned_loss=0.0781, over 12661.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2513, pruned_loss=0.0692, over 2585546.14 frames. ], batch size: 22, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:17:28,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=431280.6666666667, ans=0.09899494936611666 2024-06-21 18:17:31,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=431299.0, ans=0.125 2024-06-21 18:17:33,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=431299.0, ans=10.0 2024-06-21 18:17:34,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=431299.0, ans=0.0 2024-06-21 18:17:42,296 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.135e+02 2.254e+02 2.392e+02 3.018e+02, threshold=4.509e+02, percent-clipped=0.0 2024-06-21 18:17:46,077 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.72 vs. limit=15.0 2024-06-21 18:17:59,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=431354.0, ans=0.0 2024-06-21 18:18:01,534 INFO [train.py:1028] (1/2) Epoch 24, batch 2600, loss[loss=0.1869, simple_loss=0.2488, pruned_loss=0.06251, over 13257.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.25, pruned_loss=0.06883, over 2586999.72 frames. ], batch size: 52, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:18:22,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=431427.3333333333, ans=0.025 2024-06-21 18:18:23,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=431427.3333333333, ans=0.0 2024-06-21 18:18:25,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=431427.3333333333, ans=0.125 2024-06-21 18:18:27,734 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.66 vs. limit=15.0 2024-06-21 18:18:33,771 INFO [train.py:1028] (1/2) Epoch 24, batch 2650, loss[loss=0.184, simple_loss=0.2337, pruned_loss=0.06713, over 13015.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.249, pruned_loss=0.06827, over 2588325.19 frames. ], batch size: 144, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:18:34,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=431464.0, ans=0.125 2024-06-21 18:18:39,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=431482.3333333333, ans=0.04949747468305833 2024-06-21 18:18:49,876 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.198e+02 2.343e+02 2.646e+02 3.229e+02, threshold=4.685e+02, percent-clipped=0.0 2024-06-21 18:18:51,043 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.82 vs. limit=15.0 2024-06-21 18:19:06,175 INFO [train.py:1028] (1/2) Epoch 24, batch 2700, loss[loss=0.1983, simple_loss=0.2467, pruned_loss=0.07497, over 13254.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.248, pruned_loss=0.06819, over 2585055.19 frames. ], batch size: 89, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:19:09,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=431555.6666666667, ans=0.125 2024-06-21 18:19:12,120 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.90 vs. limit=15.0 2024-06-21 18:19:29,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=431610.6666666667, ans=0.0 2024-06-21 18:19:40,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=431629.0, ans=0.05 2024-06-21 18:19:42,977 INFO [train.py:1028] (1/2) Epoch 24, batch 2750, loss[loss=0.1961, simple_loss=0.2422, pruned_loss=0.07503, over 13327.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.2466, pruned_loss=0.06733, over 2582744.40 frames. ], batch size: 43, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:19:48,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=431647.3333333333, ans=0.05 2024-06-21 18:19:59,903 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.175e+02 2.264e+02 2.535e+02 3.560e+02, threshold=4.528e+02, percent-clipped=0.0 2024-06-21 18:20:06,976 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:20:10,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=431720.6666666667, ans=10.0 2024-06-21 18:20:16,517 INFO [train.py:1028] (1/2) Epoch 24, batch 2800, loss[loss=0.1957, simple_loss=0.2405, pruned_loss=0.07545, over 10861.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2463, pruned_loss=0.0673, over 2580002.31 frames. ], batch size: 304, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:20:20,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=431739.0, ans=0.2 2024-06-21 18:20:32,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=431775.6666666667, ans=0.125 2024-06-21 18:20:38,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=431794.0, ans=0.125 2024-06-21 18:20:48,787 INFO [train.py:1028] (1/2) Epoch 24, batch 2850, loss[loss=0.1917, simple_loss=0.2519, pruned_loss=0.06577, over 13052.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2452, pruned_loss=0.06697, over 2577467.27 frames. ], batch size: 48, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:20:49,886 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.39 vs. limit=15.0 2024-06-21 18:20:51,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=431830.6666666667, ans=0.125 2024-06-21 18:20:54,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=431849.0, ans=0.0 2024-06-21 18:20:56,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=431849.0, ans=0.0 2024-06-21 18:21:03,821 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.35 vs. limit=22.5 2024-06-21 18:21:05,182 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.139e+02 2.287e+02 2.457e+02 3.072e+02, threshold=4.575e+02, percent-clipped=0.0 2024-06-21 18:21:08,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431885.6666666667, ans=0.1 2024-06-21 18:21:12,716 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2024-06-21 18:21:21,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=431904.0, ans=0.0 2024-06-21 18:21:24,373 INFO [train.py:1028] (1/2) Epoch 24, batch 2900, loss[loss=0.1825, simple_loss=0.2393, pruned_loss=0.06283, over 13128.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2437, pruned_loss=0.06646, over 2585744.46 frames. ], batch size: 55, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:21:35,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=431940.6666666667, ans=0.125 2024-06-21 18:21:38,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=431940.6666666667, ans=0.125 2024-06-21 18:21:41,753 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.79 vs. limit=15.0 2024-06-21 18:22:01,223 INFO [train.py:1028] (1/2) Epoch 24, batch 2950, loss[loss=0.1639, simple_loss=0.2231, pruned_loss=0.05242, over 13258.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2437, pruned_loss=0.06673, over 2579327.73 frames. ], batch size: 43, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:22:01,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=432014.0, ans=0.0 2024-06-21 18:22:03,233 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:22:05,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=432014.0, ans=0.0 2024-06-21 18:22:05,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=432014.0, ans=0.0 2024-06-21 18:22:06,889 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.86 vs. limit=22.5 2024-06-21 18:22:08,385 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.14 vs. limit=15.0 2024-06-21 18:22:08,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=432032.3333333333, ans=0.125 2024-06-21 18:22:12,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432032.3333333333, ans=0.1 2024-06-21 18:22:13,824 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.59 vs. limit=15.0 2024-06-21 18:22:14,942 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432050.6666666667, ans=0.1 2024-06-21 18:22:16,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=432050.6666666667, ans=0.2 2024-06-21 18:22:17,827 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.158e+02 2.332e+02 2.540e+02 3.517e+02, threshold=4.663e+02, percent-clipped=0.0 2024-06-21 18:22:33,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=432105.6666666667, ans=0.0 2024-06-21 18:22:34,084 INFO [train.py:1028] (1/2) Epoch 24, batch 3000, loss[loss=0.1933, simple_loss=0.2488, pruned_loss=0.06891, over 13143.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2428, pruned_loss=0.06624, over 2577673.98 frames. ], batch size: 59, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:22:34,085 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 18:22:42,122 INFO [train.py:1060] (1/2) Epoch 24, validation: loss=0.1881, simple_loss=0.2507, pruned_loss=0.0627, over 351949.00 frames. 2024-06-21 18:22:42,123 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 18:22:51,235 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.78 vs. limit=15.0 2024-06-21 18:23:06,965 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=22.5 2024-06-21 18:23:09,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=432179.0, ans=0.125 2024-06-21 18:23:14,065 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.11 vs. limit=15.0 2024-06-21 18:23:15,065 INFO [train.py:1028] (1/2) Epoch 24, batch 3050, loss[loss=0.1658, simple_loss=0.23, pruned_loss=0.0508, over 13273.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2421, pruned_loss=0.06613, over 2578081.42 frames. ], batch size: 46, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:23:22,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=432197.3333333333, ans=0.125 2024-06-21 18:23:24,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=432215.6666666667, ans=0.0 2024-06-21 18:23:25,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=432215.6666666667, ans=0.0 2024-06-21 18:23:26,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=432215.6666666667, ans=0.125 2024-06-21 18:23:31,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=432215.6666666667, ans=0.025 2024-06-21 18:23:33,189 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.19 vs. limit=15.0 2024-06-21 18:23:37,554 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.131e+02 2.236e+02 2.419e+02 2.947e+02, threshold=4.472e+02, percent-clipped=0.0 2024-06-21 18:23:38,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=432234.0, ans=0.0 2024-06-21 18:23:53,595 INFO [train.py:1028] (1/2) Epoch 24, batch 3100, loss[loss=0.1889, simple_loss=0.238, pruned_loss=0.06991, over 13063.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2412, pruned_loss=0.06551, over 2579109.52 frames. ], batch size: 144, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:23:55,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=432289.0, ans=0.0 2024-06-21 18:23:56,588 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2024-06-21 18:24:13,132 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.46 vs. limit=15.0 2024-06-21 18:24:18,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=432344.0, ans=0.0 2024-06-21 18:24:23,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432362.3333333333, ans=0.1 2024-06-21 18:24:26,244 INFO [train.py:1028] (1/2) Epoch 24, batch 3150, loss[loss=0.186, simple_loss=0.2399, pruned_loss=0.06603, over 12891.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2401, pruned_loss=0.06527, over 2581033.41 frames. ], batch size: 158, lr: 2.42e-03, grad_scale: 64.0 2024-06-21 18:24:27,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=432380.6666666667, ans=0.2 2024-06-21 18:24:32,649 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.94 vs. limit=10.0 2024-06-21 18:24:35,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432399.0, ans=0.1 2024-06-21 18:24:43,123 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.771e+02 2.115e+02 2.287e+02 2.443e+02 3.354e+02, threshold=4.574e+02, percent-clipped=0.0 2024-06-21 18:24:47,926 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:24:50,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=432435.6666666667, ans=0.0 2024-06-21 18:24:54,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=432454.0, ans=0.125 2024-06-21 18:24:58,603 INFO [train.py:1028] (1/2) Epoch 24, batch 3200, loss[loss=0.184, simple_loss=0.2379, pruned_loss=0.06507, over 13143.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2399, pruned_loss=0.06522, over 2580594.04 frames. ], batch size: 55, lr: 2.42e-03, grad_scale: 64.0 2024-06-21 18:24:58,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=432472.3333333333, ans=0.125 2024-06-21 18:25:20,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=432527.3333333333, ans=0.125 2024-06-21 18:25:21,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432527.3333333333, ans=0.1 2024-06-21 18:25:21,560 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.79 vs. limit=22.5 2024-06-21 18:25:37,041 INFO [train.py:1028] (1/2) Epoch 24, batch 3250, loss[loss=0.1656, simple_loss=0.2248, pruned_loss=0.05321, over 13280.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2392, pruned_loss=0.06515, over 2584798.65 frames. ], batch size: 72, lr: 2.42e-03, grad_scale: 64.0 2024-06-21 18:25:48,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.35 vs. limit=15.0 2024-06-21 18:25:50,593 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:25:52,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432600.6666666667, ans=0.1 2024-06-21 18:25:55,125 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.158e+02 2.271e+02 2.513e+02 4.932e+02, threshold=4.543e+02, percent-clipped=1.0 2024-06-21 18:26:10,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=432655.6666666667, ans=0.2 2024-06-21 18:26:10,911 INFO [train.py:1028] (1/2) Epoch 24, batch 3300, loss[loss=0.1779, simple_loss=0.2349, pruned_loss=0.06043, over 12720.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2386, pruned_loss=0.06477, over 2581759.30 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 64.0 2024-06-21 18:26:11,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=432655.6666666667, ans=0.2 2024-06-21 18:26:11,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=432655.6666666667, ans=0.125 2024-06-21 18:26:12,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=432655.6666666667, ans=0.125 2024-06-21 18:26:20,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=432655.6666666667, ans=0.0 2024-06-21 18:26:22,321 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.25 vs. limit=15.0 2024-06-21 18:26:31,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432692.3333333333, ans=0.1 2024-06-21 18:26:36,609 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.15 vs. limit=10.0 2024-06-21 18:26:37,915 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.96 vs. limit=15.0 2024-06-21 18:26:38,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=432710.6666666667, ans=0.0 2024-06-21 18:26:48,250 INFO [train.py:1028] (1/2) Epoch 24, batch 3350, loss[loss=0.1853, simple_loss=0.2381, pruned_loss=0.06619, over 12894.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2382, pruned_loss=0.06505, over 2577759.05 frames. ], batch size: 158, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:26:50,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=432747.3333333333, ans=0.0 2024-06-21 18:27:05,789 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.187e+02 2.430e+02 2.629e+02 3.288e+02, threshold=4.860e+02, percent-clipped=0.0 2024-06-21 18:27:06,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=432784.0, ans=0.125 2024-06-21 18:27:09,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=432802.3333333333, ans=0.2 2024-06-21 18:27:12,858 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.70 vs. limit=22.5 2024-06-21 18:27:27,259 INFO [train.py:1028] (1/2) Epoch 24, batch 3400, loss[loss=0.1955, simple_loss=0.2415, pruned_loss=0.07473, over 12693.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.2373, pruned_loss=0.06486, over 2576804.80 frames. ], batch size: 22, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:27:33,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432857.3333333333, ans=0.1 2024-06-21 18:27:45,079 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.55 vs. limit=6.0 2024-06-21 18:27:50,539 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.55 vs. limit=15.0 2024-06-21 18:27:50,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=432894.0, ans=0.125 2024-06-21 18:28:00,670 INFO [train.py:1028] (1/2) Epoch 24, batch 3450, loss[loss=0.1894, simple_loss=0.2366, pruned_loss=0.0711, over 12728.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2371, pruned_loss=0.06482, over 2577787.12 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:28:08,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=432949.0, ans=0.0 2024-06-21 18:28:17,889 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.120e+02 2.236e+02 2.448e+02 3.660e+02, threshold=4.472e+02, percent-clipped=0.0 2024-06-21 18:28:21,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=432985.6666666667, ans=0.1 2024-06-21 18:28:27,850 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:28:33,081 INFO [train.py:1028] (1/2) Epoch 24, batch 3500, loss[loss=0.1504, simple_loss=0.2076, pruned_loss=0.04662, over 12920.00 frames. ], tot_loss[loss=0.1822, simple_loss=0.2364, pruned_loss=0.06402, over 2576232.81 frames. ], batch size: 33, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:28:36,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=433022.3333333333, ans=0.035 2024-06-21 18:28:48,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=433059.0, ans=0.125 2024-06-21 18:28:55,183 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.91 vs. limit=15.0 2024-06-21 18:29:00,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=433095.6666666667, ans=0.125 2024-06-21 18:29:06,621 INFO [train.py:1028] (1/2) Epoch 24, batch 3550, loss[loss=0.1833, simple_loss=0.2322, pruned_loss=0.06717, over 13173.00 frames. ], tot_loss[loss=0.1817, simple_loss=0.236, pruned_loss=0.0637, over 2578928.87 frames. ], batch size: 95, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:29:26,451 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2024-06-21 18:29:30,595 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.095e+02 2.211e+02 2.403e+02 3.107e+02, threshold=4.422e+02, percent-clipped=0.0 2024-06-21 18:29:30,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=433150.6666666667, ans=0.0 2024-06-21 18:29:31,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=433150.6666666667, ans=0.125 2024-06-21 18:29:32,248 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2024-06-21 18:29:34,192 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2024-06-21 18:29:45,813 INFO [train.py:1028] (1/2) Epoch 24, batch 3600, loss[loss=0.1923, simple_loss=0.2488, pruned_loss=0.0679, over 13096.00 frames. ], tot_loss[loss=0.1818, simple_loss=0.2359, pruned_loss=0.06388, over 2581915.47 frames. ], batch size: 48, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:29:49,542 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.00 vs. limit=6.0 2024-06-21 18:30:03,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=433242.3333333333, ans=0.025 2024-06-21 18:30:19,164 INFO [train.py:1028] (1/2) Epoch 24, batch 3650, loss[loss=0.1688, simple_loss=0.2177, pruned_loss=0.05996, over 13064.00 frames. ], tot_loss[loss=0.1816, simple_loss=0.2358, pruned_loss=0.06371, over 2580188.59 frames. ], batch size: 102, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:30:23,589 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.76 vs. limit=22.5 2024-06-21 18:30:26,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=433315.6666666667, ans=0.125 2024-06-21 18:30:28,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=433315.6666666667, ans=0.0 2024-06-21 18:30:30,494 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.80 vs. limit=15.0 2024-06-21 18:30:32,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=433334.0, ans=0.1 2024-06-21 18:30:34,431 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2024-06-21 18:30:37,302 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.089e+02 2.208e+02 2.378e+02 3.146e+02, threshold=4.415e+02, percent-clipped=0.0 2024-06-21 18:30:40,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=433352.3333333333, ans=0.125 2024-06-21 18:30:42,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=433352.3333333333, ans=0.5 2024-06-21 18:30:42,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=433352.3333333333, ans=0.125 2024-06-21 18:30:46,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=433370.6666666667, ans=0.1 2024-06-21 18:30:49,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=433370.6666666667, ans=0.0 2024-06-21 18:30:51,208 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:30:53,735 INFO [train.py:1028] (1/2) Epoch 24, batch 3700, loss[loss=0.1728, simple_loss=0.2353, pruned_loss=0.05519, over 13239.00 frames. ], tot_loss[loss=0.1802, simple_loss=0.2344, pruned_loss=0.06302, over 2585056.14 frames. ], batch size: 72, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:30:53,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=433389.0, ans=0.125 2024-06-21 18:31:03,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=433407.3333333333, ans=0.07 2024-06-21 18:31:16,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=433444.0, ans=0.035 2024-06-21 18:31:23,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=433444.0, ans=0.125 2024-06-21 18:31:35,100 INFO [train.py:1028] (1/2) Epoch 24, batch 3750, loss[loss=0.1818, simple_loss=0.2475, pruned_loss=0.05802, over 12610.00 frames. ], tot_loss[loss=0.1801, simple_loss=0.2342, pruned_loss=0.06296, over 2587014.88 frames. ], batch size: 22, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:31:52,425 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.118e+02 2.251e+02 2.479e+02 3.160e+02, threshold=4.502e+02, percent-clipped=0.0 2024-06-21 18:31:56,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=433535.6666666667, ans=0.125 2024-06-21 18:32:07,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=433572.3333333333, ans=0.0 2024-06-21 18:32:08,319 INFO [train.py:1028] (1/2) Epoch 24, batch 3800, loss[loss=0.1862, simple_loss=0.2383, pruned_loss=0.06702, over 13250.00 frames. ], tot_loss[loss=0.1803, simple_loss=0.2344, pruned_loss=0.06305, over 2586157.79 frames. ], batch size: 83, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:32:09,954 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.65 vs. limit=15.0 2024-06-21 18:32:26,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=433609.0, ans=0.2 2024-06-21 18:32:41,385 INFO [train.py:1028] (1/2) Epoch 24, batch 3850, loss[loss=0.1752, simple_loss=0.2221, pruned_loss=0.06419, over 13082.00 frames. ], tot_loss[loss=0.18, simple_loss=0.2341, pruned_loss=0.06298, over 2584719.54 frames. ], batch size: 144, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:32:44,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=433664.0, ans=0.025 2024-06-21 18:32:48,539 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=15.0 2024-06-21 18:32:59,053 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.108e+02 2.245e+02 2.477e+02 3.503e+02, threshold=4.489e+02, percent-clipped=0.0 2024-06-21 18:33:00,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=433719.0, ans=0.1 2024-06-21 18:33:02,426 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:33:10,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=433737.3333333333, ans=0.2 2024-06-21 18:33:13,991 INFO [train.py:1028] (1/2) Epoch 24, batch 3900, loss[loss=0.186, simple_loss=0.2392, pruned_loss=0.06639, over 13212.00 frames. ], tot_loss[loss=0.1799, simple_loss=0.234, pruned_loss=0.06292, over 2587656.60 frames. ], batch size: 83, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:33:28,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=433792.3333333333, ans=0.125 2024-06-21 18:33:39,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=433810.6666666667, ans=0.1 2024-06-21 18:33:40,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=433810.6666666667, ans=0.1 2024-06-21 18:33:50,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=433829.0, ans=0.025 2024-06-21 18:33:51,874 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.00 vs. limit=6.0 2024-06-21 18:33:53,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=433829.0, ans=0.0 2024-06-21 18:33:54,693 INFO [train.py:1028] (1/2) Epoch 24, batch 3950, loss[loss=0.1785, simple_loss=0.2313, pruned_loss=0.06282, over 13106.00 frames. ], tot_loss[loss=0.1791, simple_loss=0.2332, pruned_loss=0.06254, over 2590080.11 frames. ], batch size: 132, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:33:55,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=433847.3333333333, ans=0.125 2024-06-21 18:34:12,579 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.067e+02 2.149e+02 2.262e+02 2.872e+02, threshold=4.297e+02, percent-clipped=0.0 2024-06-21 18:34:28,118 INFO [train.py:1028] (1/2) Epoch 24, batch 4000, loss[loss=0.1839, simple_loss=0.2448, pruned_loss=0.06146, over 12916.00 frames. ], tot_loss[loss=0.1793, simple_loss=0.2331, pruned_loss=0.06281, over 2584823.61 frames. ], batch size: 39, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:34:28,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=433939.0, ans=0.0 2024-06-21 18:34:37,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=433957.3333333333, ans=0.125 2024-06-21 18:34:43,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=433975.6666666667, ans=0.0 2024-06-21 18:34:53,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=433994.0, ans=0.0 2024-06-21 18:34:56,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=434012.3333333333, ans=0.2 2024-06-21 18:34:57,682 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.56 vs. limit=10.0 2024-06-21 18:35:00,949 INFO [train.py:1028] (1/2) Epoch 24, batch 4050, loss[loss=0.1815, simple_loss=0.2304, pruned_loss=0.06635, over 11030.00 frames. ], tot_loss[loss=0.1795, simple_loss=0.2331, pruned_loss=0.06295, over 2580873.42 frames. ], batch size: 304, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:35:18,305 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.072e+02 2.235e+02 2.378e+02 3.027e+02, threshold=4.470e+02, percent-clipped=0.0 2024-06-21 18:35:29,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=434104.0, ans=0.2 2024-06-21 18:35:33,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=434122.3333333333, ans=0.125 2024-06-21 18:35:33,633 INFO [train.py:1028] (1/2) Epoch 24, batch 4100, loss[loss=0.1777, simple_loss=0.2199, pruned_loss=0.06771, over 13010.00 frames. ], tot_loss[loss=0.1794, simple_loss=0.233, pruned_loss=0.06285, over 2576961.28 frames. ], batch size: 102, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:35:57,776 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.90 vs. limit=10.0 2024-06-21 18:35:59,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=434177.3333333333, ans=0.125 2024-06-21 18:36:02,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=434177.3333333333, ans=0.125 2024-06-21 18:36:07,059 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.46 vs. limit=10.0 2024-06-21 18:36:10,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=434195.6666666667, ans=0.125 2024-06-21 18:36:13,411 INFO [train.py:1028] (1/2) Epoch 24, batch 4150, loss[loss=0.1784, simple_loss=0.2275, pruned_loss=0.06465, over 13109.00 frames. ], tot_loss[loss=0.1785, simple_loss=0.2322, pruned_loss=0.06246, over 2574951.52 frames. ], batch size: 55, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:36:22,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=434232.3333333333, ans=0.125 2024-06-21 18:36:31,305 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.063e+02 2.268e+02 2.474e+02 3.628e+02, threshold=4.536e+02, percent-clipped=0.0 2024-06-21 18:36:46,591 INFO [train.py:1028] (1/2) Epoch 24, batch 4200, loss[loss=0.1724, simple_loss=0.2232, pruned_loss=0.06085, over 13045.00 frames. ], tot_loss[loss=0.1785, simple_loss=0.2319, pruned_loss=0.06259, over 2577128.71 frames. ], batch size: 102, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:36:49,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=434305.6666666667, ans=15.0 2024-06-21 18:36:50,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=434305.6666666667, ans=0.125 2024-06-21 18:36:52,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=434324.0, ans=0.0 2024-06-21 18:37:01,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=434342.3333333333, ans=0.0 2024-06-21 18:37:02,441 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.74 vs. limit=15.0 2024-06-21 18:37:08,067 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.14 vs. limit=12.0 2024-06-21 18:37:11,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=434360.6666666667, ans=0.125 2024-06-21 18:37:12,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=434379.0, ans=0.0 2024-06-21 18:37:18,401 INFO [train.py:1028] (1/2) Epoch 24, batch 4250, loss[loss=0.1771, simple_loss=0.2335, pruned_loss=0.06038, over 13280.00 frames. ], tot_loss[loss=0.178, simple_loss=0.2316, pruned_loss=0.06219, over 2580276.72 frames. ], batch size: 46, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:37:26,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=434415.6666666667, ans=0.2 2024-06-21 18:37:39,490 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.817e+02 2.073e+02 2.207e+02 2.347e+02 4.161e+02, threshold=4.413e+02, percent-clipped=0.0 2024-06-21 18:37:44,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=434452.3333333333, ans=0.2 2024-06-21 18:37:49,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=434452.3333333333, ans=0.0 2024-06-21 18:37:58,258 INFO [train.py:1028] (1/2) Epoch 24, batch 4300, loss[loss=0.1802, simple_loss=0.2349, pruned_loss=0.06278, over 13178.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.231, pruned_loss=0.06179, over 2580823.95 frames. ], batch size: 59, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:38:04,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=434507.3333333333, ans=0.0 2024-06-21 18:38:06,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=434507.3333333333, ans=0.0 2024-06-21 18:38:07,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=434507.3333333333, ans=0.2 2024-06-21 18:38:14,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=434525.6666666667, ans=0.1 2024-06-21 18:38:18,182 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.81 vs. limit=15.0 2024-06-21 18:38:18,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=434544.0, ans=0.09899494936611666 2024-06-21 18:38:30,241 INFO [train.py:1028] (1/2) Epoch 24, batch 4350, loss[loss=0.1818, simple_loss=0.2345, pruned_loss=0.06453, over 13196.00 frames. ], tot_loss[loss=0.1774, simple_loss=0.2311, pruned_loss=0.06184, over 2585715.18 frames. ], batch size: 59, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:38:32,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=434580.6666666667, ans=0.0 2024-06-21 18:38:45,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=434617.3333333333, ans=0.025 2024-06-21 18:38:47,886 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.086e+02 2.187e+02 2.344e+02 2.916e+02, threshold=4.373e+02, percent-clipped=0.0 2024-06-21 18:39:02,913 INFO [train.py:1028] (1/2) Epoch 24, batch 4400, loss[loss=0.1842, simple_loss=0.2322, pruned_loss=0.06812, over 13162.00 frames. ], tot_loss[loss=0.1776, simple_loss=0.231, pruned_loss=0.06213, over 2585402.60 frames. ], batch size: 83, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:39:10,596 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.97 vs. limit=22.5 2024-06-21 18:39:16,712 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2024-06-21 18:39:28,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=434727.3333333333, ans=0.125 2024-06-21 18:39:32,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=434745.6666666667, ans=0.125 2024-06-21 18:39:39,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=434745.6666666667, ans=0.0 2024-06-21 18:39:42,755 INFO [train.py:1028] (1/2) Epoch 24, batch 4450, loss[loss=0.2078, simple_loss=0.2641, pruned_loss=0.07577, over 12919.00 frames. ], tot_loss[loss=0.1784, simple_loss=0.2316, pruned_loss=0.06257, over 2579833.99 frames. ], batch size: 33, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:39:51,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=434782.3333333333, ans=0.0 2024-06-21 18:39:59,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=434800.6666666667, ans=0.0 2024-06-21 18:40:00,005 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.054e+02 2.175e+02 2.325e+02 3.144e+02, threshold=4.351e+02, percent-clipped=0.0 2024-06-21 18:40:00,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434800.6666666667, ans=0.1 2024-06-21 18:40:00,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=434800.6666666667, ans=0.0 2024-06-21 18:40:01,723 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2024-06-21 18:40:04,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=434819.0, ans=0.125 2024-06-21 18:40:08,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=434837.3333333333, ans=0.95 2024-06-21 18:40:11,033 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2024-06-21 18:40:14,919 INFO [train.py:1028] (1/2) Epoch 24, batch 4500, loss[loss=0.1582, simple_loss=0.2157, pruned_loss=0.05036, over 13274.00 frames. ], tot_loss[loss=0.1774, simple_loss=0.2307, pruned_loss=0.06204, over 2584513.86 frames. ], batch size: 89, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:40:18,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=434855.6666666667, ans=0.125 2024-06-21 18:40:21,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=434874.0, ans=0.0 2024-06-21 18:40:24,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=434874.0, ans=0.125 2024-06-21 18:40:29,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=434892.3333333333, ans=0.07 2024-06-21 18:40:30,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=434892.3333333333, ans=0.0 2024-06-21 18:40:34,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=434910.6666666667, ans=0.2 2024-06-21 18:40:37,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=434910.6666666667, ans=0.05 2024-06-21 18:40:40,510 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2024-06-21 18:40:48,030 INFO [train.py:1028] (1/2) Epoch 24, batch 4550, loss[loss=0.1556, simple_loss=0.2188, pruned_loss=0.04623, over 13268.00 frames. ], tot_loss[loss=0.1772, simple_loss=0.2307, pruned_loss=0.0618, over 2588776.51 frames. ], batch size: 52, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:40:48,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=434947.3333333333, ans=0.2 2024-06-21 18:40:54,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=434965.6666666667, ans=0.125 2024-06-21 18:40:58,311 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.86 vs. limit=15.0 2024-06-21 18:40:58,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=434965.6666666667, ans=0.125 2024-06-21 18:41:05,842 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.057e+02 2.156e+02 2.362e+02 3.371e+02, threshold=4.313e+02, percent-clipped=0.0 2024-06-21 18:41:08,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=435002.3333333333, ans=0.125 2024-06-21 18:41:14,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=435020.6666666667, ans=0.2 2024-06-21 18:41:20,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=435039.0, ans=0.0 2024-06-21 18:41:21,349 INFO [train.py:1028] (1/2) Epoch 24, batch 4600, loss[loss=0.1908, simple_loss=0.2393, pruned_loss=0.07116, over 12475.00 frames. ], tot_loss[loss=0.177, simple_loss=0.2307, pruned_loss=0.06167, over 2583899.77 frames. ], batch size: 202, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:41:26,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=435039.0, ans=0.2 2024-06-21 18:41:26,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=435039.0, ans=0.0 2024-06-21 18:41:31,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=435057.3333333333, ans=0.125 2024-06-21 18:41:41,476 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:41:41,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=435075.6666666667, ans=0.125 2024-06-21 18:41:49,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=435094.0, ans=0.0 2024-06-21 18:41:53,417 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.47 vs. limit=22.5 2024-06-21 18:41:55,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=435112.3333333333, ans=0.125 2024-06-21 18:41:59,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=435112.3333333333, ans=0.0 2024-06-21 18:42:00,481 INFO [train.py:1028] (1/2) Epoch 24, batch 4650, loss[loss=0.1804, simple_loss=0.2304, pruned_loss=0.06519, over 13093.00 frames. ], tot_loss[loss=0.1768, simple_loss=0.2302, pruned_loss=0.06168, over 2587479.92 frames. ], batch size: 132, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:42:05,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=435130.6666666667, ans=0.2 2024-06-21 18:42:05,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=435130.6666666667, ans=0.125 2024-06-21 18:42:15,273 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:42:18,514 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.077e+02 2.210e+02 2.490e+02 3.095e+02, threshold=4.419e+02, percent-clipped=0.0 2024-06-21 18:42:19,609 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.95 vs. limit=22.5 2024-06-21 18:42:20,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=435185.6666666667, ans=0.2 2024-06-21 18:42:27,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=435204.0, ans=0.2 2024-06-21 18:42:34,287 INFO [train.py:1028] (1/2) Epoch 24, batch 4700, loss[loss=0.1725, simple_loss=0.2367, pruned_loss=0.05418, over 12397.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.2306, pruned_loss=0.06197, over 2583323.00 frames. ], batch size: 25, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:42:37,887 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:42:40,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=435222.3333333333, ans=0.125 2024-06-21 18:42:44,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=435240.6666666667, ans=0.025 2024-06-21 18:43:04,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=435295.6666666667, ans=0.125 2024-06-21 18:43:07,908 INFO [train.py:1028] (1/2) Epoch 24, batch 4750, loss[loss=0.1904, simple_loss=0.241, pruned_loss=0.06989, over 12442.00 frames. ], tot_loss[loss=0.1768, simple_loss=0.2299, pruned_loss=0.06182, over 2580271.72 frames. ], batch size: 202, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:43:10,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=435314.0, ans=0.125 2024-06-21 18:43:17,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=435332.3333333333, ans=0.125 2024-06-21 18:43:19,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=435332.3333333333, ans=0.2 2024-06-21 18:43:25,303 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+02 2.149e+02 2.272e+02 2.512e+02 3.513e+02, threshold=4.544e+02, percent-clipped=0.0 2024-06-21 18:43:30,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=435369.0, ans=0.1 2024-06-21 18:43:30,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=435369.0, ans=0.0 2024-06-21 18:43:39,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=435387.3333333333, ans=0.2 2024-06-21 18:43:42,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=435387.3333333333, ans=0.125 2024-06-21 18:43:43,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=435387.3333333333, ans=0.125 2024-06-21 18:43:44,258 INFO [train.py:1028] (1/2) Epoch 24, batch 4800, loss[loss=0.1785, simple_loss=0.2363, pruned_loss=0.06036, over 13345.00 frames. ], tot_loss[loss=0.1767, simple_loss=0.2301, pruned_loss=0.06165, over 2577315.30 frames. ], batch size: 63, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:43:54,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=435424.0, ans=0.05 2024-06-21 18:44:09,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=435460.6666666667, ans=0.0 2024-06-21 18:44:13,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=435479.0, ans=0.125 2024-06-21 18:44:15,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=435479.0, ans=0.025 2024-06-21 18:44:20,649 INFO [train.py:1028] (1/2) Epoch 24, batch 4850, loss[loss=0.1687, simple_loss=0.2195, pruned_loss=0.05902, over 13255.00 frames. ], tot_loss[loss=0.1764, simple_loss=0.2296, pruned_loss=0.06158, over 2574282.60 frames. ], batch size: 89, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:44:34,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=435534.0, ans=0.2 2024-06-21 18:44:38,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=435534.0, ans=0.025 2024-06-21 18:44:38,665 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.043e+02 2.175e+02 2.367e+02 3.258e+02, threshold=4.350e+02, percent-clipped=0.0 2024-06-21 18:44:49,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=435570.6666666667, ans=0.125 2024-06-21 18:44:54,758 INFO [train.py:1028] (1/2) Epoch 24, batch 4900, loss[loss=0.1621, simple_loss=0.2165, pruned_loss=0.0539, over 13185.00 frames. ], tot_loss[loss=0.1765, simple_loss=0.2299, pruned_loss=0.06159, over 2573865.38 frames. ], batch size: 59, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:44:56,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=435589.0, ans=0.125 2024-06-21 18:44:57,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.31 vs. limit=15.0 2024-06-21 18:45:02,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=435607.3333333333, ans=0.1 2024-06-21 18:45:31,455 INFO [train.py:1028] (1/2) Epoch 24, batch 4950, loss[loss=0.2035, simple_loss=0.2401, pruned_loss=0.08347, over 11089.00 frames. ], tot_loss[loss=0.1772, simple_loss=0.2301, pruned_loss=0.06217, over 2569507.49 frames. ], batch size: 303, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:45:38,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=435699.0, ans=0.0 2024-06-21 18:45:46,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=435699.0, ans=0.125 2024-06-21 18:45:49,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=435717.3333333333, ans=0.2 2024-06-21 18:45:51,918 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.109e+02 2.238e+02 2.396e+02 2.990e+02, threshold=4.476e+02, percent-clipped=0.0 2024-06-21 18:45:53,649 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.53 vs. limit=10.0 2024-06-21 18:46:03,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=435754.0, ans=0.025 2024-06-21 18:46:06,974 INFO [train.py:1028] (1/2) Epoch 24, batch 5000, loss[loss=0.1747, simple_loss=0.221, pruned_loss=0.0642, over 13109.00 frames. ], tot_loss[loss=0.1772, simple_loss=0.2301, pruned_loss=0.06215, over 2572415.28 frames. ], batch size: 95, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:46:07,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=435772.3333333333, ans=0.125 2024-06-21 18:46:11,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=435772.3333333333, ans=0.0 2024-06-21 18:46:20,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=435809.0, ans=0.1 2024-06-21 18:46:20,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=435809.0, ans=0.0 2024-06-21 18:46:24,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=435809.0, ans=0.07 2024-06-21 18:46:24,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=435809.0, ans=0.1 2024-06-21 18:46:28,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=435827.3333333333, ans=0.125 2024-06-21 18:46:33,701 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=15.0 2024-06-21 18:46:34,153 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=435845.6666666667, ans=0.0 2024-06-21 18:46:36,826 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.84 vs. limit=6.0 2024-06-21 18:46:37,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=435845.6666666667, ans=0.125 2024-06-21 18:46:40,187 INFO [train.py:1028] (1/2) Epoch 24, batch 5050, loss[loss=0.148, simple_loss=0.2058, pruned_loss=0.04506, over 12898.00 frames. ], tot_loss[loss=0.1779, simple_loss=0.231, pruned_loss=0.06239, over 2571778.11 frames. ], batch size: 36, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:46:45,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=435864.0, ans=0.5 2024-06-21 18:46:54,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=435900.6666666667, ans=0.1 2024-06-21 18:46:57,535 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.127e+02 2.254e+02 2.556e+02 3.101e+02, threshold=4.507e+02, percent-clipped=0.0 2024-06-21 18:47:03,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=435919.0, ans=0.125 2024-06-21 18:47:05,861 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2024-06-21 18:47:06,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=435937.3333333333, ans=0.1 2024-06-21 18:47:12,816 INFO [train.py:1028] (1/2) Epoch 24, batch 5100, loss[loss=0.1884, simple_loss=0.2468, pruned_loss=0.06499, over 12907.00 frames. ], tot_loss[loss=0.1778, simple_loss=0.2305, pruned_loss=0.06254, over 2568731.88 frames. ], batch size: 39, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:47:13,896 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.30 vs. limit=15.0 2024-06-21 18:47:17,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=435955.6666666667, ans=0.04949747468305833 2024-06-21 18:47:18,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=435955.6666666667, ans=0.0 2024-06-21 18:47:19,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=435974.0, ans=0.0 2024-06-21 18:47:21,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=435974.0, ans=0.125 2024-06-21 18:47:26,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=435974.0, ans=0.5 2024-06-21 18:47:31,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=435992.3333333333, ans=0.1 2024-06-21 18:47:34,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=435992.3333333333, ans=0.125 2024-06-21 18:47:51,120 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.71 vs. limit=15.0 2024-06-21 18:47:53,305 INFO [train.py:1028] (1/2) Epoch 24, batch 5150, loss[loss=0.1788, simple_loss=0.2222, pruned_loss=0.06767, over 13128.00 frames. ], tot_loss[loss=0.1777, simple_loss=0.2303, pruned_loss=0.06253, over 2571634.01 frames. ], batch size: 132, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:47:55,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=436047.3333333333, ans=0.125 2024-06-21 18:47:56,286 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.25 vs. limit=22.5 2024-06-21 18:48:02,277 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.65 vs. limit=22.5 2024-06-21 18:48:09,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=436084.0, ans=0.125 2024-06-21 18:48:10,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=436084.0, ans=0.0 2024-06-21 18:48:11,133 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.126e+02 2.304e+02 2.475e+02 3.617e+02, threshold=4.607e+02, percent-clipped=0.0 2024-06-21 18:48:22,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=436120.6666666667, ans=0.125 2024-06-21 18:48:26,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=436139.0, ans=0.1 2024-06-21 18:48:26,731 INFO [train.py:1028] (1/2) Epoch 24, batch 5200, loss[loss=0.1657, simple_loss=0.2192, pruned_loss=0.05609, over 13210.00 frames. ], tot_loss[loss=0.1774, simple_loss=0.2302, pruned_loss=0.06229, over 2575048.83 frames. ], batch size: 95, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:48:34,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=436157.3333333333, ans=0.125 2024-06-21 18:48:41,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=436175.6666666667, ans=0.025 2024-06-21 18:48:47,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=436194.0, ans=0.2 2024-06-21 18:48:49,982 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=12.0 2024-06-21 18:49:00,066 INFO [train.py:1028] (1/2) Epoch 24, batch 5250, loss[loss=0.1643, simple_loss=0.2246, pruned_loss=0.05197, over 13312.00 frames. ], tot_loss[loss=0.1775, simple_loss=0.2301, pruned_loss=0.06245, over 2570338.30 frames. ], batch size: 52, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:49:05,451 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.14 vs. limit=12.0 2024-06-21 18:49:09,724 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.14 vs. limit=22.5 2024-06-21 18:49:17,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=436267.3333333333, ans=0.125 2024-06-21 18:49:17,992 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.067e+02 2.139e+02 2.296e+02 2.879e+02, threshold=4.278e+02, percent-clipped=0.0 2024-06-21 18:49:20,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=436285.6666666667, ans=0.0 2024-06-21 18:49:36,733 INFO [train.py:1028] (1/2) Epoch 24, batch 5300, loss[loss=0.1906, simple_loss=0.2371, pruned_loss=0.072, over 13064.00 frames. ], tot_loss[loss=0.177, simple_loss=0.2296, pruned_loss=0.06225, over 2566642.06 frames. ], batch size: 144, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:49:40,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=436322.3333333333, ans=0.04949747468305833 2024-06-21 18:49:51,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=436340.6666666667, ans=0.2 2024-06-21 18:50:00,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436359.0, ans=0.1 2024-06-21 18:50:02,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=436377.3333333333, ans=0.125 2024-06-21 18:50:10,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=436395.6666666667, ans=0.1 2024-06-21 18:50:15,236 INFO [train.py:1028] (1/2) Epoch 24, batch 5350, loss[loss=0.1855, simple_loss=0.2358, pruned_loss=0.06762, over 11860.00 frames. ], tot_loss[loss=0.177, simple_loss=0.2295, pruned_loss=0.06229, over 2573628.50 frames. ], batch size: 17, lr: 2.41e-03, grad_scale: 64.0 2024-06-21 18:50:16,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=436414.0, ans=0.125 2024-06-21 18:50:22,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=436432.3333333333, ans=0.07 2024-06-21 18:50:27,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=436432.3333333333, ans=0.09899494936611666 2024-06-21 18:50:28,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=436450.6666666667, ans=0.1 2024-06-21 18:50:29,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=436450.6666666667, ans=0.015 2024-06-21 18:50:32,893 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.071e+02 2.179e+02 2.326e+02 2.934e+02, threshold=4.357e+02, percent-clipped=0.0 2024-06-21 18:50:37,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436469.0, ans=0.1 2024-06-21 18:50:37,341 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.29 vs. limit=22.5 2024-06-21 18:50:44,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=436487.3333333333, ans=0.125 2024-06-21 18:50:47,625 INFO [train.py:1028] (1/2) Epoch 24, batch 5400, loss[loss=0.2133, simple_loss=0.2563, pruned_loss=0.08516, over 12218.00 frames. ], tot_loss[loss=0.1771, simple_loss=0.2292, pruned_loss=0.06252, over 2566694.02 frames. ], batch size: 240, lr: 2.41e-03, grad_scale: 64.0 2024-06-21 18:50:48,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=436505.6666666667, ans=0.2 2024-06-21 18:50:55,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=436524.0, ans=0.125 2024-06-21 18:50:56,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=436524.0, ans=0.0 2024-06-21 18:51:00,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=436542.3333333333, ans=0.125 2024-06-21 18:51:04,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436542.3333333333, ans=0.1 2024-06-21 18:51:07,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=436560.6666666667, ans=0.125 2024-06-21 18:51:25,426 INFO [train.py:1028] (1/2) Epoch 24, batch 5450, loss[loss=0.1795, simple_loss=0.2325, pruned_loss=0.06323, over 12365.00 frames. ], tot_loss[loss=0.1774, simple_loss=0.2297, pruned_loss=0.06258, over 2571403.98 frames. ], batch size: 25, lr: 2.41e-03, grad_scale: 64.0 2024-06-21 18:51:27,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=436597.3333333333, ans=0.125 2024-06-21 18:51:36,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=436615.6666666667, ans=0.125 2024-06-21 18:51:36,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=436615.6666666667, ans=0.125 2024-06-21 18:51:46,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=436634.0, ans=0.0 2024-06-21 18:51:46,568 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.073e+02 2.187e+02 2.390e+02 3.210e+02, threshold=4.375e+02, percent-clipped=0.0 2024-06-21 18:51:50,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=436652.3333333333, ans=0.0 2024-06-21 18:51:52,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=436652.3333333333, ans=0.125 2024-06-21 18:51:55,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=436670.6666666667, ans=0.125 2024-06-21 18:52:01,740 INFO [train.py:1028] (1/2) Epoch 24, batch 5500, loss[loss=0.1955, simple_loss=0.236, pruned_loss=0.07747, over 12190.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.2296, pruned_loss=0.06252, over 2563367.94 frames. ], batch size: 240, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:52:04,594 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.20 vs. limit=15.0 2024-06-21 18:52:05,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=436689.0, ans=0.2 2024-06-21 18:52:06,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=436689.0, ans=15.0 2024-06-21 18:52:13,986 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:52:16,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=436725.6666666667, ans=0.125 2024-06-21 18:52:16,169 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2024-06-21 18:52:19,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=436725.6666666667, ans=0.0 2024-06-21 18:52:21,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=436744.0, ans=0.1 2024-06-21 18:52:23,424 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=10.0 2024-06-21 18:52:28,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=436762.3333333333, ans=0.125 2024-06-21 18:52:29,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=436762.3333333333, ans=0.1 2024-06-21 18:52:31,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=436762.3333333333, ans=0.07 2024-06-21 18:52:35,156 INFO [train.py:1028] (1/2) Epoch 24, batch 5550, loss[loss=0.1796, simple_loss=0.2377, pruned_loss=0.06077, over 13236.00 frames. ], tot_loss[loss=0.177, simple_loss=0.2295, pruned_loss=0.06224, over 2567122.10 frames. ], batch size: 43, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:52:53,218 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=15.0 2024-06-21 18:52:53,311 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.068e+02 2.190e+02 2.424e+02 3.307e+02, threshold=4.379e+02, percent-clipped=0.0 2024-06-21 18:52:56,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=436835.6666666667, ans=0.125 2024-06-21 18:52:57,683 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.62 vs. limit=15.0 2024-06-21 18:53:03,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=436854.0, ans=0.125 2024-06-21 18:53:03,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=436854.0, ans=0.0 2024-06-21 18:53:07,630 INFO [train.py:1028] (1/2) Epoch 24, batch 5600, loss[loss=0.181, simple_loss=0.2284, pruned_loss=0.06679, over 13304.00 frames. ], tot_loss[loss=0.1765, simple_loss=0.2289, pruned_loss=0.06204, over 2569544.42 frames. ], batch size: 89, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:53:26,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=436909.0, ans=0.2 2024-06-21 18:53:48,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=436945.6666666667, ans=0.125 2024-06-21 18:53:49,585 INFO [train.py:1028] (1/2) Epoch 24, batch 5650, loss[loss=0.183, simple_loss=0.2283, pruned_loss=0.06889, over 12573.00 frames. ], tot_loss[loss=0.1764, simple_loss=0.2289, pruned_loss=0.0619, over 2575148.74 frames. ], batch size: 202, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:54:08,348 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.854e+02 2.103e+02 2.231e+02 2.388e+02 2.995e+02, threshold=4.462e+02, percent-clipped=0.0 2024-06-21 18:54:12,316 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:54:13,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=437019.0, ans=0.95 2024-06-21 18:54:17,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=437037.3333333333, ans=0.1 2024-06-21 18:54:21,077 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.87 vs. limit=12.0 2024-06-21 18:54:22,668 INFO [train.py:1028] (1/2) Epoch 24, batch 5700, loss[loss=0.1868, simple_loss=0.2341, pruned_loss=0.06978, over 13303.00 frames. ], tot_loss[loss=0.1764, simple_loss=0.229, pruned_loss=0.06192, over 2578838.49 frames. ], batch size: 63, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:54:29,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=437074.0, ans=0.2 2024-06-21 18:54:36,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=437092.3333333333, ans=0.1 2024-06-21 18:54:43,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=437110.6666666667, ans=0.125 2024-06-21 18:54:46,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=437110.6666666667, ans=0.125 2024-06-21 18:54:55,706 INFO [train.py:1028] (1/2) Epoch 24, batch 5750, loss[loss=0.1973, simple_loss=0.239, pruned_loss=0.07782, over 12742.00 frames. ], tot_loss[loss=0.1768, simple_loss=0.2293, pruned_loss=0.06215, over 2579469.42 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:54:56,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=437147.3333333333, ans=0.0 2024-06-21 18:54:58,390 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=437147.3333333333, ans=0.125 2024-06-21 18:55:00,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=437147.3333333333, ans=0.125 2024-06-21 18:55:03,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=437165.6666666667, ans=0.0 2024-06-21 18:55:06,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=437165.6666666667, ans=0.09899494936611666 2024-06-21 18:55:08,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=437165.6666666667, ans=0.2 2024-06-21 18:55:13,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=437184.0, ans=0.02 2024-06-21 18:55:14,457 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.794e+02 2.072e+02 2.176e+02 2.316e+02 2.997e+02, threshold=4.353e+02, percent-clipped=0.0 2024-06-21 18:55:25,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=437220.6666666667, ans=0.2 2024-06-21 18:55:27,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=437220.6666666667, ans=0.125 2024-06-21 18:55:28,212 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.87 vs. limit=6.0 2024-06-21 18:55:32,051 INFO [train.py:1028] (1/2) Epoch 24, batch 5800, loss[loss=0.2026, simple_loss=0.2476, pruned_loss=0.07878, over 12755.00 frames. ], tot_loss[loss=0.1787, simple_loss=0.231, pruned_loss=0.06323, over 2577856.24 frames. ], batch size: 177, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:55:35,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=437239.0, ans=0.1 2024-06-21 18:55:47,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=437275.6666666667, ans=0.125 2024-06-21 18:55:52,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=437275.6666666667, ans=0.125 2024-06-21 18:55:54,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=437294.0, ans=0.025 2024-06-21 18:55:57,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=437294.0, ans=0.125 2024-06-21 18:55:59,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=437294.0, ans=0.2 2024-06-21 18:56:00,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=437294.0, ans=0.125 2024-06-21 18:56:02,707 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.01 vs. limit=22.5 2024-06-21 18:56:06,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=437312.3333333333, ans=0.1 2024-06-21 18:56:08,264 INFO [train.py:1028] (1/2) Epoch 24, batch 5850, loss[loss=0.1904, simple_loss=0.2421, pruned_loss=0.06934, over 12605.00 frames. ], tot_loss[loss=0.18, simple_loss=0.2322, pruned_loss=0.06386, over 2575293.33 frames. ], batch size: 202, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:56:20,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=437349.0, ans=0.1 2024-06-21 18:56:26,887 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.873e+02 2.176e+02 2.343e+02 2.576e+02 3.487e+02, threshold=4.686e+02, percent-clipped=0.0 2024-06-21 18:56:28,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=437385.6666666667, ans=0.2 2024-06-21 18:56:29,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=437385.6666666667, ans=0.0 2024-06-21 18:56:31,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=437385.6666666667, ans=0.1 2024-06-21 18:56:41,109 INFO [train.py:1028] (1/2) Epoch 24, batch 5900, loss[loss=0.1801, simple_loss=0.2313, pruned_loss=0.06449, over 13093.00 frames. ], tot_loss[loss=0.1811, simple_loss=0.2338, pruned_loss=0.0642, over 2576182.72 frames. ], batch size: 121, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:56:46,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=437422.3333333333, ans=0.125 2024-06-21 18:56:52,735 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:56:53,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=437459.0, ans=0.0 2024-06-21 18:56:56,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=437459.0, ans=0.125 2024-06-21 18:56:57,517 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.67 vs. limit=10.0 2024-06-21 18:57:14,167 INFO [train.py:1028] (1/2) Epoch 24, batch 5950, loss[loss=0.1695, simple_loss=0.2206, pruned_loss=0.05924, over 13125.00 frames. ], tot_loss[loss=0.1822, simple_loss=0.2352, pruned_loss=0.0646, over 2580454.74 frames. ], batch size: 121, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:57:14,709 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.17 vs. limit=8.0 2024-06-21 18:57:15,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=437514.0, ans=0.0 2024-06-21 18:57:18,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=437514.0, ans=0.125 2024-06-21 18:57:32,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=437550.6666666667, ans=0.2 2024-06-21 18:57:39,232 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.181e+02 2.407e+02 2.594e+02 3.795e+02, threshold=4.814e+02, percent-clipped=0.0 2024-06-21 18:57:44,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=437569.0, ans=0.0 2024-06-21 18:57:44,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=437569.0, ans=0.0 2024-06-21 18:57:49,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=437587.3333333333, ans=0.1 2024-06-21 18:57:49,478 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.85 vs. limit=15.0 2024-06-21 18:57:53,721 INFO [train.py:1028] (1/2) Epoch 24, batch 6000, loss[loss=0.2248, simple_loss=0.2718, pruned_loss=0.08892, over 12352.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.237, pruned_loss=0.06523, over 2574823.97 frames. ], batch size: 241, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:57:53,722 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 18:58:01,877 INFO [train.py:1060] (1/2) Epoch 24, validation: loss=0.1893, simple_loss=0.2514, pruned_loss=0.06354, over 351949.00 frames. 2024-06-21 18:58:01,878 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 18:58:06,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=437605.6666666667, ans=0.0 2024-06-21 18:58:21,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=437660.6666666667, ans=0.025 2024-06-21 18:58:29,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=437679.0, ans=0.125 2024-06-21 18:58:34,931 INFO [train.py:1028] (1/2) Epoch 24, batch 6050, loss[loss=0.1769, simple_loss=0.238, pruned_loss=0.05785, over 12981.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2381, pruned_loss=0.0651, over 2578057.78 frames. ], batch size: 39, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:58:38,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=437697.3333333333, ans=0.2 2024-06-21 18:58:53,836 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.170e+02 2.300e+02 2.500e+02 4.242e+02, threshold=4.599e+02, percent-clipped=0.0 2024-06-21 18:59:08,371 INFO [train.py:1028] (1/2) Epoch 24, batch 6100, loss[loss=0.1561, simple_loss=0.213, pruned_loss=0.04957, over 13083.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2396, pruned_loss=0.06561, over 2579330.49 frames. ], batch size: 121, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:59:14,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=437807.3333333333, ans=0.0 2024-06-21 18:59:15,440 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.24 vs. limit=15.0 2024-06-21 18:59:16,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=437807.3333333333, ans=0.0 2024-06-21 18:59:18,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=437807.3333333333, ans=0.0 2024-06-21 18:59:48,713 INFO [train.py:1028] (1/2) Epoch 24, batch 6150, loss[loss=0.2077, simple_loss=0.2485, pruned_loss=0.08347, over 11051.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.241, pruned_loss=0.06609, over 2579343.50 frames. ], batch size: 304, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:59:51,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=437880.6666666667, ans=0.025 2024-06-21 19:00:06,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=437917.3333333333, ans=0.0 2024-06-21 19:00:07,533 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.211e+02 2.349e+02 2.648e+02 4.121e+02, threshold=4.697e+02, percent-clipped=0.0 2024-06-21 19:00:10,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=437935.6666666667, ans=0.125 2024-06-21 19:00:22,184 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2024-06-21 19:00:23,086 INFO [train.py:1028] (1/2) Epoch 24, batch 6200, loss[loss=0.1965, simple_loss=0.2595, pruned_loss=0.06676, over 13260.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2421, pruned_loss=0.06655, over 2577203.16 frames. ], batch size: 89, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:00:25,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=437972.3333333333, ans=0.125 2024-06-21 19:00:26,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=437972.3333333333, ans=0.0 2024-06-21 19:00:32,404 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.87 vs. limit=22.5 2024-06-21 19:00:37,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=438009.0, ans=0.125 2024-06-21 19:00:38,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=438009.0, ans=0.125 2024-06-21 19:00:59,515 INFO [train.py:1028] (1/2) Epoch 24, batch 6250, loss[loss=0.1907, simple_loss=0.2407, pruned_loss=0.07033, over 13224.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2432, pruned_loss=0.0669, over 2570374.25 frames. ], batch size: 83, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:00:59,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=438064.0, ans=0.125 2024-06-21 19:01:01,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=438064.0, ans=0.125 2024-06-21 19:01:02,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=438064.0, ans=0.125 2024-06-21 19:01:16,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=438100.6666666667, ans=0.125 2024-06-21 19:01:18,532 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.784e+02 2.200e+02 2.375e+02 2.548e+02 4.082e+02, threshold=4.750e+02, percent-clipped=0.0 2024-06-21 19:01:22,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=438119.0, ans=0.125 2024-06-21 19:01:34,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=438137.3333333333, ans=0.125 2024-06-21 19:01:36,509 INFO [train.py:1028] (1/2) Epoch 24, batch 6300, loss[loss=0.1852, simple_loss=0.2386, pruned_loss=0.06596, over 11238.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2452, pruned_loss=0.06756, over 2565146.47 frames. ], batch size: 16, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:01:37,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=438155.6666666667, ans=0.1 2024-06-21 19:01:45,509 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2024-06-21 19:01:53,261 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.91 vs. limit=22.5 2024-06-21 19:01:55,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=438192.3333333333, ans=0.125 2024-06-21 19:02:07,985 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.26 vs. limit=22.5 2024-06-21 19:02:13,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=438247.3333333333, ans=0.0 2024-06-21 19:02:14,279 INFO [train.py:1028] (1/2) Epoch 24, batch 6350, loss[loss=0.2135, simple_loss=0.2694, pruned_loss=0.07877, over 12504.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2466, pruned_loss=0.06788, over 2575403.22 frames. ], batch size: 202, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:02:16,584 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2024-06-21 19:02:32,205 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.174e+02 2.374e+02 2.694e+02 3.876e+02, threshold=4.748e+02, percent-clipped=0.0 2024-06-21 19:02:33,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=438302.3333333333, ans=0.07 2024-06-21 19:02:34,833 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.30 vs. limit=22.5 2024-06-21 19:02:40,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=438320.6666666667, ans=0.125 2024-06-21 19:02:46,257 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.74 vs. limit=10.0 2024-06-21 19:02:47,058 INFO [train.py:1028] (1/2) Epoch 24, batch 6400, loss[loss=0.1706, simple_loss=0.2334, pruned_loss=0.05393, over 13268.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2484, pruned_loss=0.06859, over 2576576.41 frames. ], batch size: 67, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:02:54,882 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=15.0 2024-06-21 19:02:59,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=438357.3333333333, ans=0.125 2024-06-21 19:03:08,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=438394.0, ans=0.95 2024-06-21 19:03:10,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=438394.0, ans=0.125 2024-06-21 19:03:15,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=438412.3333333333, ans=0.1 2024-06-21 19:03:16,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=438412.3333333333, ans=0.09899494936611666 2024-06-21 19:03:16,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=438412.3333333333, ans=0.125 2024-06-21 19:03:18,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=438412.3333333333, ans=0.125 2024-06-21 19:03:19,869 INFO [train.py:1028] (1/2) Epoch 24, batch 6450, loss[loss=0.236, simple_loss=0.2892, pruned_loss=0.09142, over 12524.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2496, pruned_loss=0.06904, over 2581648.20 frames. ], batch size: 202, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:03:21,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=438430.6666666667, ans=0.0 2024-06-21 19:03:29,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=438449.0, ans=0.025 2024-06-21 19:03:33,793 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2024-06-21 19:03:34,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=438467.3333333333, ans=0.0 2024-06-21 19:03:41,251 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.263e+02 2.564e+02 2.841e+02 4.670e+02, threshold=5.128e+02, percent-clipped=0.0 2024-06-21 19:03:46,442 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.11 vs. limit=15.0 2024-06-21 19:03:48,872 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.85 vs. limit=15.0 2024-06-21 19:03:51,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=438504.0, ans=0.0 2024-06-21 19:03:53,406 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2024-06-21 19:03:55,779 INFO [train.py:1028] (1/2) Epoch 24, batch 6500, loss[loss=0.2218, simple_loss=0.2598, pruned_loss=0.09193, over 10842.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2515, pruned_loss=0.06945, over 2584448.85 frames. ], batch size: 304, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:03:57,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=438522.3333333333, ans=0.125 2024-06-21 19:04:31,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=438595.6666666667, ans=0.125 2024-06-21 19:04:31,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=438595.6666666667, ans=0.0 2024-06-21 19:04:32,371 INFO [train.py:1028] (1/2) Epoch 24, batch 6550, loss[loss=0.182, simple_loss=0.251, pruned_loss=0.05648, over 12404.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2519, pruned_loss=0.06901, over 2588075.39 frames. ], batch size: 22, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:04:33,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=438614.0, ans=0.025 2024-06-21 19:04:43,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=438632.3333333333, ans=0.0 2024-06-21 19:04:50,802 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 2.206e+02 2.383e+02 2.546e+02 3.191e+02, threshold=4.766e+02, percent-clipped=0.0 2024-06-21 19:04:54,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=438669.0, ans=0.2 2024-06-21 19:04:54,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=438669.0, ans=0.2 2024-06-21 19:04:54,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=438669.0, ans=0.1 2024-06-21 19:05:04,995 INFO [train.py:1028] (1/2) Epoch 24, batch 6600, loss[loss=0.2053, simple_loss=0.2671, pruned_loss=0.07174, over 13184.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2523, pruned_loss=0.0692, over 2591082.51 frames. ], batch size: 72, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:05:06,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=438705.6666666667, ans=0.125 2024-06-21 19:05:09,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=438705.6666666667, ans=0.95 2024-06-21 19:05:17,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=438724.0, ans=0.0 2024-06-21 19:05:18,707 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.67 vs. limit=15.0 2024-06-21 19:05:38,362 INFO [train.py:1028] (1/2) Epoch 24, batch 6650, loss[loss=0.2033, simple_loss=0.2529, pruned_loss=0.07683, over 12940.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.254, pruned_loss=0.06986, over 2584842.83 frames. ], batch size: 158, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:06:04,009 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.238e+02 2.444e+02 2.679e+02 4.277e+02, threshold=4.887e+02, percent-clipped=0.0 2024-06-21 19:06:06,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=438852.3333333333, ans=0.2 2024-06-21 19:06:08,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=438852.3333333333, ans=0.04949747468305833 2024-06-21 19:06:16,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=438870.6666666667, ans=0.025 2024-06-21 19:06:18,435 INFO [train.py:1028] (1/2) Epoch 24, batch 6700, loss[loss=0.2081, simple_loss=0.2585, pruned_loss=0.07882, over 12784.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.255, pruned_loss=0.07059, over 2583851.04 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:06:28,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=438907.3333333333, ans=0.1 2024-06-21 19:06:32,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=438925.6666666667, ans=0.1 2024-06-21 19:06:34,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=438925.6666666667, ans=0.125 2024-06-21 19:06:37,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=438925.6666666667, ans=0.0 2024-06-21 19:06:50,960 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.52 vs. limit=15.0 2024-06-21 19:06:51,788 INFO [train.py:1028] (1/2) Epoch 24, batch 6750, loss[loss=0.27, simple_loss=0.3069, pruned_loss=0.1166, over 12141.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2559, pruned_loss=0.07113, over 2579340.72 frames. ], batch size: 240, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:06:53,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=438980.6666666667, ans=0.125 2024-06-21 19:06:53,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=438980.6666666667, ans=10.0 2024-06-21 19:06:55,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=438980.6666666667, ans=0.125 2024-06-21 19:06:56,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=438980.6666666667, ans=0.125 2024-06-21 19:06:57,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=438999.0, ans=0.0 2024-06-21 19:07:07,847 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:07:09,587 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.275e+02 2.404e+02 2.555e+02 3.190e+02, threshold=4.807e+02, percent-clipped=0.0 2024-06-21 19:07:24,086 INFO [train.py:1028] (1/2) Epoch 24, batch 6800, loss[loss=0.1674, simple_loss=0.2299, pruned_loss=0.05249, over 13260.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2572, pruned_loss=0.07139, over 2580601.52 frames. ], batch size: 67, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:07:28,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=439072.3333333333, ans=0.0 2024-06-21 19:07:30,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=439090.6666666667, ans=0.125 2024-06-21 19:07:31,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=439090.6666666667, ans=0.125 2024-06-21 19:07:34,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=439090.6666666667, ans=0.125 2024-06-21 19:07:42,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=439109.0, ans=0.125 2024-06-21 19:07:54,473 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:07:56,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=439145.6666666667, ans=0.0 2024-06-21 19:07:56,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=439145.6666666667, ans=0.125 2024-06-21 19:08:00,960 INFO [train.py:1028] (1/2) Epoch 24, batch 6850, loss[loss=0.2017, simple_loss=0.2697, pruned_loss=0.06681, over 13320.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2573, pruned_loss=0.07109, over 2583559.01 frames. ], batch size: 63, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:08:22,850 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.329e+02 2.497e+02 2.818e+02 3.585e+02, threshold=4.994e+02, percent-clipped=0.0 2024-06-21 19:08:30,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=439237.3333333333, ans=0.0 2024-06-21 19:08:33,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=439237.3333333333, ans=0.125 2024-06-21 19:08:37,655 INFO [train.py:1028] (1/2) Epoch 24, batch 6900, loss[loss=0.1912, simple_loss=0.2522, pruned_loss=0.06511, over 12982.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2582, pruned_loss=0.07127, over 2585752.41 frames. ], batch size: 48, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:08:59,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=439310.6666666667, ans=0.1 2024-06-21 19:08:59,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=439310.6666666667, ans=0.125 2024-06-21 19:09:01,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=439310.6666666667, ans=0.125 2024-06-21 19:09:05,482 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.84 vs. limit=15.0 2024-06-21 19:09:07,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=439329.0, ans=0.1 2024-06-21 19:09:11,173 INFO [train.py:1028] (1/2) Epoch 24, batch 6950, loss[loss=0.1884, simple_loss=0.2563, pruned_loss=0.06032, over 11506.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2585, pruned_loss=0.07104, over 2579745.87 frames. ], batch size: 17, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:09:12,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=439347.3333333333, ans=0.0 2024-06-21 19:09:15,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=439347.3333333333, ans=0.025 2024-06-21 19:09:18,763 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.57 vs. limit=22.5 2024-06-21 19:09:28,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=439384.0, ans=0.09899494936611666 2024-06-21 19:09:29,562 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.233e+02 2.405e+02 2.651e+02 3.460e+02, threshold=4.809e+02, percent-clipped=0.0 2024-06-21 19:09:30,867 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.21 vs. limit=10.0 2024-06-21 19:09:40,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=439420.6666666667, ans=0.025 2024-06-21 19:09:43,919 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.61 vs. limit=15.0 2024-06-21 19:09:44,204 INFO [train.py:1028] (1/2) Epoch 24, batch 7000, loss[loss=0.1927, simple_loss=0.2479, pruned_loss=0.06872, over 12942.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2585, pruned_loss=0.07095, over 2576176.60 frames. ], batch size: 158, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:09:59,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=439457.3333333333, ans=0.2 2024-06-21 19:10:03,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=439475.6666666667, ans=0.0 2024-06-21 19:10:07,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=439494.0, ans=0.0 2024-06-21 19:10:08,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=439494.0, ans=0.125 2024-06-21 19:10:11,404 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.62 vs. limit=22.5 2024-06-21 19:10:19,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=439512.3333333333, ans=0.125 2024-06-21 19:10:24,339 INFO [train.py:1028] (1/2) Epoch 24, batch 7050, loss[loss=0.2225, simple_loss=0.2773, pruned_loss=0.08382, over 12773.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2598, pruned_loss=0.07173, over 2583239.09 frames. ], batch size: 177, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:10:28,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=439530.6666666667, ans=0.125 2024-06-21 19:10:34,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=439549.0, ans=0.125 2024-06-21 19:10:34,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=439549.0, ans=0.125 2024-06-21 19:10:42,072 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.248e+02 2.420e+02 2.628e+02 3.867e+02, threshold=4.841e+02, percent-clipped=0.0 2024-06-21 19:10:46,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=439585.6666666667, ans=0.0 2024-06-21 19:10:53,827 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.38 vs. limit=12.0 2024-06-21 19:10:54,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=439604.0, ans=0.0 2024-06-21 19:10:56,698 INFO [train.py:1028] (1/2) Epoch 24, batch 7100, loss[loss=0.2385, simple_loss=0.2942, pruned_loss=0.09141, over 13176.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2605, pruned_loss=0.07231, over 2574291.13 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:11:06,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=439640.6666666667, ans=0.0 2024-06-21 19:11:25,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=439695.6666666667, ans=0.125 2024-06-21 19:11:30,975 INFO [train.py:1028] (1/2) Epoch 24, batch 7150, loss[loss=0.2228, simple_loss=0.2745, pruned_loss=0.08555, over 12519.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2606, pruned_loss=0.07223, over 2573589.81 frames. ], batch size: 203, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:11:31,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=439714.0, ans=0.125 2024-06-21 19:11:31,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=439714.0, ans=0.0 2024-06-21 19:11:33,326 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.09 vs. limit=22.5 2024-06-21 19:11:35,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=439714.0, ans=0.125 2024-06-21 19:11:39,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=439732.3333333333, ans=0.125 2024-06-21 19:11:43,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=439750.6666666667, ans=0.125 2024-06-21 19:11:46,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=439750.6666666667, ans=0.125 2024-06-21 19:11:47,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=439750.6666666667, ans=0.125 2024-06-21 19:11:49,313 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.248e+02 2.454e+02 2.638e+02 3.580e+02, threshold=4.908e+02, percent-clipped=0.0 2024-06-21 19:12:01,250 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.68 vs. limit=6.0 2024-06-21 19:12:07,977 INFO [train.py:1028] (1/2) Epoch 24, batch 7200, loss[loss=0.1975, simple_loss=0.2542, pruned_loss=0.07038, over 13151.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2607, pruned_loss=0.07205, over 2579131.11 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:12:08,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=439805.6666666667, ans=0.125 2024-06-21 19:12:17,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=439824.0, ans=0.0 2024-06-21 19:12:28,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=439842.3333333333, ans=0.125 2024-06-21 19:12:33,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=439860.6666666667, ans=0.2 2024-06-21 19:12:33,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=439860.6666666667, ans=0.125 2024-06-21 19:12:33,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=439860.6666666667, ans=0.125 2024-06-21 19:12:34,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=439860.6666666667, ans=0.2 2024-06-21 19:12:40,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=439879.0, ans=0.2 2024-06-21 19:12:40,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=439879.0, ans=0.125 2024-06-21 19:12:43,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=439879.0, ans=0.125 2024-06-21 19:12:44,297 INFO [train.py:1028] (1/2) Epoch 24, batch 7250, loss[loss=0.1836, simple_loss=0.2552, pruned_loss=0.05597, over 12892.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2622, pruned_loss=0.07246, over 2579636.08 frames. ], batch size: 36, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:12:49,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=439897.3333333333, ans=0.2 2024-06-21 19:12:58,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=439934.0, ans=0.07 2024-06-21 19:13:02,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=439934.0, ans=0.125 2024-06-21 19:13:02,714 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.253e+02 2.371e+02 2.585e+02 4.023e+02, threshold=4.742e+02, percent-clipped=0.0 2024-06-21 19:13:16,891 INFO [train.py:1028] (1/2) Epoch 24, batch 7300, loss[loss=0.2071, simple_loss=0.2726, pruned_loss=0.07083, over 12894.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2638, pruned_loss=0.07302, over 2579533.04 frames. ], batch size: 36, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:13:30,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=440007.3333333333, ans=0.0 2024-06-21 19:13:31,235 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.43 vs. limit=15.0 2024-06-21 19:13:33,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=440007.3333333333, ans=0.125 2024-06-21 19:13:36,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=440025.6666666667, ans=0.0 2024-06-21 19:13:37,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=440025.6666666667, ans=0.025 2024-06-21 19:13:52,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=440062.3333333333, ans=0.125 2024-06-21 19:13:55,516 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.34 vs. limit=15.0 2024-06-21 19:13:55,782 INFO [train.py:1028] (1/2) Epoch 24, batch 7350, loss[loss=0.2197, simple_loss=0.277, pruned_loss=0.08124, over 13246.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2649, pruned_loss=0.07374, over 2580341.34 frames. ], batch size: 46, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:13:59,982 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.47 vs. limit=22.5 2024-06-21 19:14:10,369 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.18 vs. limit=12.0 2024-06-21 19:14:17,622 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.269e+02 2.422e+02 2.630e+02 3.965e+02, threshold=4.845e+02, percent-clipped=0.0 2024-06-21 19:14:23,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=440135.6666666667, ans=0.0 2024-06-21 19:14:35,645 INFO [train.py:1028] (1/2) Epoch 24, batch 7400, loss[loss=0.2232, simple_loss=0.2847, pruned_loss=0.08088, over 13231.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2654, pruned_loss=0.07375, over 2585459.51 frames. ], batch size: 63, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:14:39,297 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.99 vs. limit=15.0 2024-06-21 19:14:39,480 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.95 vs. limit=15.0 2024-06-21 19:14:51,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=440209.0, ans=0.125 2024-06-21 19:14:53,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=440209.0, ans=0.1 2024-06-21 19:14:57,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=440227.3333333333, ans=0.125 2024-06-21 19:14:59,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=440227.3333333333, ans=0.125 2024-06-21 19:15:09,020 INFO [train.py:1028] (1/2) Epoch 24, batch 7450, loss[loss=0.2257, simple_loss=0.2846, pruned_loss=0.08337, over 12527.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2654, pruned_loss=0.07363, over 2578765.46 frames. ], batch size: 29, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:15:09,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=440264.0, ans=0.125 2024-06-21 19:15:10,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=440264.0, ans=0.035 2024-06-21 19:15:10,453 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=440264.0, ans=0.1 2024-06-21 19:15:11,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=440264.0, ans=0.125 2024-06-21 19:15:13,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=440264.0, ans=0.0 2024-06-21 19:15:13,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=440264.0, ans=0.125 2024-06-21 19:15:26,912 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.67 vs. limit=15.0 2024-06-21 19:15:27,882 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.236e+02 2.388e+02 2.547e+02 3.014e+02, threshold=4.777e+02, percent-clipped=0.0 2024-06-21 19:15:28,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=440319.0, ans=0.035 2024-06-21 19:15:33,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=440319.0, ans=0.07 2024-06-21 19:15:42,794 INFO [train.py:1028] (1/2) Epoch 24, batch 7500, loss[loss=0.2249, simple_loss=0.2755, pruned_loss=0.08717, over 10741.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2654, pruned_loss=0.07337, over 2576660.18 frames. ], batch size: 304, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:15:54,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2024-06-21 19:16:08,290 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.87 vs. limit=15.0 2024-06-21 19:16:14,525 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.85 vs. limit=6.0 2024-06-21 19:16:19,239 INFO [train.py:1028] (1/2) Epoch 24, batch 7550, loss[loss=0.2215, simple_loss=0.2815, pruned_loss=0.08079, over 12950.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2663, pruned_loss=0.07374, over 2576015.67 frames. ], batch size: 158, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:16:40,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=440484.0, ans=0.0 2024-06-21 19:16:41,400 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.386e+02 2.562e+02 2.861e+02 3.776e+02, threshold=5.123e+02, percent-clipped=0.0 2024-06-21 19:16:46,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=440502.3333333333, ans=0.0 2024-06-21 19:16:55,781 INFO [train.py:1028] (1/2) Epoch 24, batch 7600, loss[loss=0.2189, simple_loss=0.2758, pruned_loss=0.08101, over 13202.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.267, pruned_loss=0.07412, over 2576656.00 frames. ], batch size: 83, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:17:07,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=440557.3333333333, ans=0.125 2024-06-21 19:17:12,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=440575.6666666667, ans=0.125 2024-06-21 19:17:12,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=440575.6666666667, ans=0.125 2024-06-21 19:17:18,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=440594.0, ans=0.1 2024-06-21 19:17:20,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=440594.0, ans=0.1 2024-06-21 19:17:24,693 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.40 vs. limit=15.0 2024-06-21 19:17:29,687 INFO [train.py:1028] (1/2) Epoch 24, batch 7650, loss[loss=0.1995, simple_loss=0.2614, pruned_loss=0.0688, over 12952.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2669, pruned_loss=0.07394, over 2571164.57 frames. ], batch size: 33, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:17:48,846 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.282e+02 2.496e+02 2.783e+02 3.716e+02, threshold=4.992e+02, percent-clipped=0.0 2024-06-21 19:17:51,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=440685.6666666667, ans=0.5 2024-06-21 19:17:55,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=440685.6666666667, ans=0.2 2024-06-21 19:17:59,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=440704.0, ans=0.0 2024-06-21 19:18:06,832 INFO [train.py:1028] (1/2) Epoch 24, batch 7700, loss[loss=0.2066, simple_loss=0.2675, pruned_loss=0.07288, over 13271.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2677, pruned_loss=0.07422, over 2569226.78 frames. ], batch size: 63, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:18:07,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=440722.3333333333, ans=0.1 2024-06-21 19:18:14,551 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.63 vs. limit=15.0 2024-06-21 19:18:16,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=440740.6666666667, ans=0.125 2024-06-21 19:18:28,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=440777.3333333333, ans=0.125 2024-06-21 19:18:29,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=440777.3333333333, ans=0.125 2024-06-21 19:18:38,247 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=440795.6666666667, ans=0.0 2024-06-21 19:18:40,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=440795.6666666667, ans=0.125 2024-06-21 19:18:40,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=440795.6666666667, ans=0.0 2024-06-21 19:18:41,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=440795.6666666667, ans=0.0 2024-06-21 19:18:43,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=440795.6666666667, ans=10.0 2024-06-21 19:18:45,227 INFO [train.py:1028] (1/2) Epoch 24, batch 7750, loss[loss=0.1942, simple_loss=0.2612, pruned_loss=0.06357, over 13094.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2687, pruned_loss=0.07514, over 2572891.31 frames. ], batch size: 71, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:18:46,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=440814.0, ans=0.125 2024-06-21 19:18:54,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=440832.3333333333, ans=0.0 2024-06-21 19:18:54,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=440832.3333333333, ans=0.125 2024-06-21 19:18:55,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=440832.3333333333, ans=0.125 2024-06-21 19:18:56,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=440832.3333333333, ans=0.05 2024-06-21 19:18:57,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=440832.3333333333, ans=0.025 2024-06-21 19:18:58,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=440850.6666666667, ans=0.2 2024-06-21 19:19:02,450 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=15.0 2024-06-21 19:19:03,976 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.321e+02 2.505e+02 2.717e+02 3.883e+02, threshold=5.009e+02, percent-clipped=0.0 2024-06-21 19:19:06,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=440869.0, ans=0.125 2024-06-21 19:19:18,789 INFO [train.py:1028] (1/2) Epoch 24, batch 7800, loss[loss=0.1996, simple_loss=0.2537, pruned_loss=0.07281, over 13171.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2687, pruned_loss=0.07492, over 2576897.83 frames. ], batch size: 95, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:19:25,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=440924.0, ans=0.125 2024-06-21 19:19:32,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=440942.3333333333, ans=0.125 2024-06-21 19:19:46,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=440979.0, ans=0.125 2024-06-21 19:19:49,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=440979.0, ans=0.125 2024-06-21 19:19:52,747 INFO [train.py:1028] (1/2) Epoch 24, batch 7850, loss[loss=0.1802, simple_loss=0.2461, pruned_loss=0.05713, over 11235.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2694, pruned_loss=0.07529, over 2570354.90 frames. ], batch size: 16, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:19:56,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=440997.3333333333, ans=0.125 2024-06-21 19:20:06,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=441015.6666666667, ans=0.125 2024-06-21 19:20:12,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=441034.0, ans=0.125 2024-06-21 19:20:14,036 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.311e+02 2.472e+02 2.673e+02 3.323e+02, threshold=4.944e+02, percent-clipped=0.0 2024-06-21 19:20:14,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=441034.0, ans=0.2 2024-06-21 19:20:17,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=441052.3333333333, ans=0.1 2024-06-21 19:20:18,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=441052.3333333333, ans=0.025 2024-06-21 19:20:20,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=441052.3333333333, ans=0.0 2024-06-21 19:20:24,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=441070.6666666667, ans=0.09899494936611666 2024-06-21 19:20:32,122 INFO [train.py:1028] (1/2) Epoch 24, batch 7900, loss[loss=0.1937, simple_loss=0.2551, pruned_loss=0.06617, over 13166.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2701, pruned_loss=0.07555, over 2569478.83 frames. ], batch size: 77, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:20:33,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=441089.0, ans=0.2 2024-06-21 19:20:43,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=441107.3333333333, ans=0.2 2024-06-21 19:21:05,969 INFO [train.py:1028] (1/2) Epoch 24, batch 7950, loss[loss=0.2269, simple_loss=0.2759, pruned_loss=0.08889, over 10720.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2708, pruned_loss=0.07595, over 2572828.39 frames. ], batch size: 304, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:21:08,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=441180.6666666667, ans=0.125 2024-06-21 19:21:12,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=441199.0, ans=0.125 2024-06-21 19:21:13,104 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.32 vs. limit=15.0 2024-06-21 19:21:14,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=441199.0, ans=0.0 2024-06-21 19:21:17,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=441199.0, ans=0.0 2024-06-21 19:21:20,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=441217.3333333333, ans=0.2 2024-06-21 19:21:20,640 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.04 vs. limit=15.0 2024-06-21 19:21:22,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=441217.3333333333, ans=0.125 2024-06-21 19:21:23,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=441217.3333333333, ans=0.125 2024-06-21 19:21:24,538 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.362e+02 2.522e+02 2.856e+02 3.555e+02, threshold=5.043e+02, percent-clipped=0.0 2024-06-21 19:21:25,845 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.86 vs. limit=10.0 2024-06-21 19:21:26,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=441235.6666666667, ans=0.0 2024-06-21 19:21:33,163 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=441254.0, ans=0.0 2024-06-21 19:21:36,790 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.05 vs. limit=22.5 2024-06-21 19:21:38,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=441272.3333333333, ans=0.125 2024-06-21 19:21:39,125 INFO [train.py:1028] (1/2) Epoch 24, batch 8000, loss[loss=0.1983, simple_loss=0.2614, pruned_loss=0.06763, over 12724.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2712, pruned_loss=0.07601, over 2569863.98 frames. ], batch size: 29, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:21:44,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=441272.3333333333, ans=0.125 2024-06-21 19:21:55,046 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.87 vs. limit=10.0 2024-06-21 19:22:02,257 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.07 vs. limit=15.0 2024-06-21 19:22:05,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=441327.3333333333, ans=0.125 2024-06-21 19:22:14,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=441364.0, ans=0.125 2024-06-21 19:22:15,010 INFO [train.py:1028] (1/2) Epoch 24, batch 8050, loss[loss=0.1997, simple_loss=0.2602, pruned_loss=0.06961, over 13185.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2706, pruned_loss=0.07576, over 2570507.83 frames. ], batch size: 83, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:22:19,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=441364.0, ans=0.125 2024-06-21 19:22:24,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=441382.3333333333, ans=0.125 2024-06-21 19:22:25,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=441382.3333333333, ans=0.07 2024-06-21 19:22:25,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=441382.3333333333, ans=0.0 2024-06-21 19:22:35,741 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:22:36,845 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.270e+02 2.388e+02 2.672e+02 3.699e+02, threshold=4.775e+02, percent-clipped=0.0 2024-06-21 19:22:38,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=441419.0, ans=0.1 2024-06-21 19:22:50,889 INFO [train.py:1028] (1/2) Epoch 24, batch 8100, loss[loss=0.2126, simple_loss=0.2748, pruned_loss=0.07515, over 13113.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2711, pruned_loss=0.07603, over 2575648.40 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:22:51,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=441455.6666666667, ans=0.2 2024-06-21 19:22:56,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=441455.6666666667, ans=0.125 2024-06-21 19:22:59,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=441474.0, ans=0.0 2024-06-21 19:23:06,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=441492.3333333333, ans=0.125 2024-06-21 19:23:17,407 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.29 vs. limit=12.0 2024-06-21 19:23:19,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=441529.0, ans=0.125 2024-06-21 19:23:24,517 INFO [train.py:1028] (1/2) Epoch 24, batch 8150, loss[loss=0.2166, simple_loss=0.2676, pruned_loss=0.0828, over 13117.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2711, pruned_loss=0.07576, over 2578588.53 frames. ], batch size: 121, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:23:32,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=441565.6666666667, ans=0.0 2024-06-21 19:23:33,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=441565.6666666667, ans=0.125 2024-06-21 19:23:34,864 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.97 vs. limit=10.0 2024-06-21 19:23:35,536 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.56 vs. limit=22.5 2024-06-21 19:23:40,139 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.79 vs. limit=15.0 2024-06-21 19:23:43,109 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.292e+02 2.415e+02 2.546e+02 3.613e+02, threshold=4.829e+02, percent-clipped=0.0 2024-06-21 19:23:43,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=441584.0, ans=0.5 2024-06-21 19:23:57,685 INFO [train.py:1028] (1/2) Epoch 24, batch 8200, loss[loss=0.2298, simple_loss=0.2886, pruned_loss=0.08552, over 13171.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2712, pruned_loss=0.07574, over 2582800.94 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:23:59,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=441639.0, ans=0.05 2024-06-21 19:24:38,406 INFO [train.py:1028] (1/2) Epoch 24, batch 8250, loss[loss=0.1969, simple_loss=0.2618, pruned_loss=0.06595, over 13227.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2717, pruned_loss=0.07589, over 2581823.84 frames. ], batch size: 52, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:24:53,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=441767.3333333333, ans=0.125 2024-06-21 19:24:56,419 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.287e+02 2.417e+02 2.601e+02 3.005e+02, threshold=4.834e+02, percent-clipped=0.0 2024-06-21 19:25:10,783 INFO [train.py:1028] (1/2) Epoch 24, batch 8300, loss[loss=0.2328, simple_loss=0.294, pruned_loss=0.08578, over 13028.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2708, pruned_loss=0.07542, over 2579635.47 frames. ], batch size: 102, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:25:18,745 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:25:21,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=441840.6666666667, ans=0.125 2024-06-21 19:25:24,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=441859.0, ans=0.1 2024-06-21 19:25:27,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=441859.0, ans=0.025 2024-06-21 19:25:28,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=441859.0, ans=0.125 2024-06-21 19:25:30,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=441877.3333333333, ans=0.95 2024-06-21 19:25:37,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=441895.6666666667, ans=0.125 2024-06-21 19:25:37,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=441895.6666666667, ans=0.0 2024-06-21 19:25:44,443 INFO [train.py:1028] (1/2) Epoch 24, batch 8350, loss[loss=0.1969, simple_loss=0.2554, pruned_loss=0.06919, over 13176.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2701, pruned_loss=0.07492, over 2579157.28 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:26:00,268 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.63 vs. limit=6.0 2024-06-21 19:26:06,388 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.337e+02 2.510e+02 2.765e+02 3.654e+02, threshold=5.020e+02, percent-clipped=0.0 2024-06-21 19:26:08,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=441969.0, ans=0.0 2024-06-21 19:26:12,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=441969.0, ans=0.2 2024-06-21 19:26:17,493 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.20 vs. limit=15.0 2024-06-21 19:26:21,061 INFO [train.py:1028] (1/2) Epoch 24, batch 8400, loss[loss=0.192, simple_loss=0.2536, pruned_loss=0.06518, over 12950.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2703, pruned_loss=0.07509, over 2575840.10 frames. ], batch size: 39, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:26:23,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442005.6666666667, ans=0.1 2024-06-21 19:26:36,352 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.40 vs. limit=6.0 2024-06-21 19:26:38,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=442042.3333333333, ans=0.125 2024-06-21 19:26:49,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=442060.6666666667, ans=15.0 2024-06-21 19:26:56,816 INFO [train.py:1028] (1/2) Epoch 24, batch 8450, loss[loss=0.2241, simple_loss=0.2895, pruned_loss=0.07932, over 13199.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2713, pruned_loss=0.0751, over 2577768.30 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:27:11,403 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2024-06-21 19:27:14,701 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.350e+02 2.526e+02 2.716e+02 3.213e+02, threshold=5.052e+02, percent-clipped=0.0 2024-06-21 19:27:16,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=442152.3333333333, ans=0.125 2024-06-21 19:27:17,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=442152.3333333333, ans=0.0 2024-06-21 19:27:18,394 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.81 vs. limit=15.0 2024-06-21 19:27:25,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=442170.6666666667, ans=0.0 2024-06-21 19:27:29,613 INFO [train.py:1028] (1/2) Epoch 24, batch 8500, loss[loss=0.1951, simple_loss=0.2616, pruned_loss=0.06426, over 12728.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2722, pruned_loss=0.07545, over 2577003.91 frames. ], batch size: 29, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:27:30,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=442189.0, ans=0.0 2024-06-21 19:27:34,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=442189.0, ans=0.0 2024-06-21 19:27:38,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=442207.3333333333, ans=0.025 2024-06-21 19:27:44,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=442225.6666666667, ans=0.1 2024-06-21 19:27:45,628 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.21 vs. limit=22.5 2024-06-21 19:27:50,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=442244.0, ans=0.0 2024-06-21 19:28:03,159 INFO [train.py:1028] (1/2) Epoch 24, batch 8550, loss[loss=0.2, simple_loss=0.2672, pruned_loss=0.06636, over 12611.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.272, pruned_loss=0.07528, over 2574739.47 frames. ], batch size: 22, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:28:12,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=442299.0, ans=0.0 2024-06-21 19:28:23,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=442317.3333333333, ans=0.0 2024-06-21 19:28:24,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=442317.3333333333, ans=0.125 2024-06-21 19:28:25,091 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.373e+02 2.480e+02 2.671e+02 3.416e+02, threshold=4.961e+02, percent-clipped=0.0 2024-06-21 19:28:32,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442354.0, ans=0.1 2024-06-21 19:28:43,585 INFO [train.py:1028] (1/2) Epoch 24, batch 8600, loss[loss=0.2127, simple_loss=0.2739, pruned_loss=0.07572, over 13087.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2722, pruned_loss=0.07512, over 2572704.40 frames. ], batch size: 121, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:28:48,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=442372.3333333333, ans=0.0 2024-06-21 19:28:53,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442390.6666666667, ans=0.1 2024-06-21 19:28:57,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=442409.0, ans=0.125 2024-06-21 19:28:58,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.63 vs. limit=22.5 2024-06-21 19:29:00,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=442409.0, ans=0.2 2024-06-21 19:29:01,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=442409.0, ans=0.125 2024-06-21 19:29:01,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=442409.0, ans=0.02 2024-06-21 19:29:10,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442445.6666666667, ans=0.1 2024-06-21 19:29:17,667 INFO [train.py:1028] (1/2) Epoch 24, batch 8650, loss[loss=0.2127, simple_loss=0.2705, pruned_loss=0.07751, over 13002.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2727, pruned_loss=0.07525, over 2577104.24 frames. ], batch size: 102, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:29:22,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=442464.0, ans=0.125 2024-06-21 19:29:25,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=442482.3333333333, ans=0.125 2024-06-21 19:29:33,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=442500.6666666667, ans=0.125 2024-06-21 19:29:36,078 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.348e+02 2.523e+02 2.682e+02 3.493e+02, threshold=5.046e+02, percent-clipped=0.0 2024-06-21 19:29:40,468 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.84 vs. limit=10.0 2024-06-21 19:29:41,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=442519.0, ans=0.2 2024-06-21 19:29:44,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=442537.3333333333, ans=0.025 2024-06-21 19:29:49,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=442537.3333333333, ans=0.0 2024-06-21 19:29:50,438 INFO [train.py:1028] (1/2) Epoch 24, batch 8700, loss[loss=0.2152, simple_loss=0.2837, pruned_loss=0.07333, over 13161.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2728, pruned_loss=0.07527, over 2574826.64 frames. ], batch size: 59, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:29:52,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=442555.6666666667, ans=0.125 2024-06-21 19:30:00,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=442574.0, ans=0.125 2024-06-21 19:30:12,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=442592.3333333333, ans=0.1 2024-06-21 19:30:14,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=442610.6666666667, ans=0.125 2024-06-21 19:30:16,484 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2024-06-21 19:30:18,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=442610.6666666667, ans=0.0 2024-06-21 19:30:25,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=442629.0, ans=0.0 2024-06-21 19:30:28,356 INFO [train.py:1028] (1/2) Epoch 24, batch 8750, loss[loss=0.2215, simple_loss=0.2778, pruned_loss=0.08258, over 13143.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2736, pruned_loss=0.07584, over 2569679.47 frames. ], batch size: 121, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:30:39,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=442665.6666666667, ans=0.0 2024-06-21 19:30:39,455 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.90 vs. limit=6.0 2024-06-21 19:30:50,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=442684.0, ans=0.0 2024-06-21 19:30:51,105 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.348e+02 2.544e+02 2.717e+02 3.609e+02, threshold=5.088e+02, percent-clipped=0.0 2024-06-21 19:30:54,321 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.61 vs. limit=22.5 2024-06-21 19:31:02,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=442720.6666666667, ans=0.125 2024-06-21 19:31:06,652 INFO [train.py:1028] (1/2) Epoch 24, batch 8800, loss[loss=0.2042, simple_loss=0.263, pruned_loss=0.07268, over 13247.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2738, pruned_loss=0.07609, over 2574673.41 frames. ], batch size: 72, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:31:16,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442757.3333333333, ans=0.1 2024-06-21 19:31:30,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=442794.0, ans=0.125 2024-06-21 19:31:35,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=442812.3333333333, ans=0.0 2024-06-21 19:31:40,875 INFO [train.py:1028] (1/2) Epoch 24, batch 8850, loss[loss=0.2162, simple_loss=0.2797, pruned_loss=0.07634, over 12499.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2734, pruned_loss=0.07626, over 2564433.05 frames. ], batch size: 202, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:32:02,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=442867.3333333333, ans=0.0 2024-06-21 19:32:03,225 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.320e+02 2.490e+02 2.681e+02 3.797e+02, threshold=4.979e+02, percent-clipped=0.0 2024-06-21 19:32:03,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=442867.3333333333, ans=0.0 2024-06-21 19:32:12,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=442904.0, ans=0.0 2024-06-21 19:32:14,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=442904.0, ans=0.125 2024-06-21 19:32:17,992 INFO [train.py:1028] (1/2) Epoch 24, batch 8900, loss[loss=0.2178, simple_loss=0.2868, pruned_loss=0.07439, over 12993.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2738, pruned_loss=0.07649, over 2563018.31 frames. ], batch size: 33, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:32:21,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=442922.3333333333, ans=0.07 2024-06-21 19:32:31,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=442959.0, ans=0.125 2024-06-21 19:32:31,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=442959.0, ans=0.0 2024-06-21 19:32:49,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=442995.6666666667, ans=0.0 2024-06-21 19:32:55,393 INFO [train.py:1028] (1/2) Epoch 24, batch 8950, loss[loss=0.2389, simple_loss=0.2931, pruned_loss=0.09238, over 12536.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2742, pruned_loss=0.07629, over 2562852.83 frames. ], batch size: 202, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:32:56,966 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:32:56,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=443014.0, ans=0.2 2024-06-21 19:32:59,340 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.29 vs. limit=12.0 2024-06-21 19:33:07,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=443032.3333333333, ans=0.025 2024-06-21 19:33:08,498 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.23 vs. limit=22.5 2024-06-21 19:33:13,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=443050.6666666667, ans=0.125 2024-06-21 19:33:14,112 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.407e+02 2.540e+02 2.749e+02 4.410e+02, threshold=5.081e+02, percent-clipped=0.0 2024-06-21 19:33:29,179 INFO [train.py:1028] (1/2) Epoch 24, batch 9000, loss[loss=0.2182, simple_loss=0.2777, pruned_loss=0.07938, over 13259.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2746, pruned_loss=0.07607, over 2568652.60 frames. ], batch size: 46, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:33:29,179 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 19:33:37,067 INFO [train.py:1060] (1/2) Epoch 24, validation: loss=0.1894, simple_loss=0.2515, pruned_loss=0.06369, over 351949.00 frames. 2024-06-21 19:33:37,067 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 19:33:42,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=443105.6666666667, ans=0.015 2024-06-21 19:33:42,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=443105.6666666667, ans=0.2 2024-06-21 19:33:45,977 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.16 vs. limit=15.0 2024-06-21 19:33:47,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=443124.0, ans=0.2 2024-06-21 19:34:06,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=443179.0, ans=0.2 2024-06-21 19:34:09,197 INFO [train.py:1028] (1/2) Epoch 24, batch 9050, loss[loss=0.1928, simple_loss=0.2592, pruned_loss=0.06317, over 10641.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2754, pruned_loss=0.07639, over 2566962.04 frames. ], batch size: 16, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:34:09,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=443197.3333333333, ans=0.1 2024-06-21 19:34:15,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=443215.6666666667, ans=0.2 2024-06-21 19:34:21,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=443234.0, ans=0.2 2024-06-21 19:34:22,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=443234.0, ans=0.125 2024-06-21 19:34:26,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=443234.0, ans=0.0 2024-06-21 19:34:26,975 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 2.374e+02 2.536e+02 2.733e+02 3.653e+02, threshold=5.072e+02, percent-clipped=0.0 2024-06-21 19:34:30,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=443252.3333333333, ans=15.0 2024-06-21 19:34:31,912 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.17 vs. limit=10.0 2024-06-21 19:34:35,384 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.93 vs. limit=22.5 2024-06-21 19:34:35,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=443270.6666666667, ans=0.0 2024-06-21 19:34:36,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=443270.6666666667, ans=0.125 2024-06-21 19:34:38,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=443270.6666666667, ans=0.025 2024-06-21 19:34:44,600 INFO [train.py:1028] (1/2) Epoch 24, batch 9100, loss[loss=0.2298, simple_loss=0.2948, pruned_loss=0.08237, over 13300.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2752, pruned_loss=0.07617, over 2566864.78 frames. ], batch size: 72, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:34:51,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=443307.3333333333, ans=0.0 2024-06-21 19:34:52,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=443307.3333333333, ans=0.0 2024-06-21 19:35:07,393 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=22.5 2024-06-21 19:35:09,367 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.09 vs. limit=22.5 2024-06-21 19:35:16,380 INFO [train.py:1028] (1/2) Epoch 24, batch 9150, loss[loss=0.2037, simple_loss=0.2682, pruned_loss=0.06965, over 13151.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2756, pruned_loss=0.07642, over 2568093.07 frames. ], batch size: 77, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:35:27,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=443399.0, ans=0.1 2024-06-21 19:35:29,758 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.25 vs. limit=22.5 2024-06-21 19:35:30,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=443417.3333333333, ans=0.09899494936611666 2024-06-21 19:35:33,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=443417.3333333333, ans=0.0 2024-06-21 19:35:33,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=443417.3333333333, ans=0.125 2024-06-21 19:35:35,038 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.280e+02 2.401e+02 2.548e+02 3.542e+02, threshold=4.801e+02, percent-clipped=0.0 2024-06-21 19:35:36,132 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.76 vs. limit=6.0 2024-06-21 19:35:40,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=443435.6666666667, ans=0.1 2024-06-21 19:35:42,517 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.46 vs. limit=15.0 2024-06-21 19:35:51,267 INFO [train.py:1028] (1/2) Epoch 24, batch 9200, loss[loss=0.2081, simple_loss=0.2706, pruned_loss=0.07284, over 12928.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2751, pruned_loss=0.07599, over 2571506.84 frames. ], batch size: 36, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:35:59,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=443490.6666666667, ans=0.04949747468305833 2024-06-21 19:36:00,270 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:36:12,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=443527.3333333333, ans=0.1 2024-06-21 19:36:22,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=443564.0, ans=0.0 2024-06-21 19:36:22,652 INFO [train.py:1028] (1/2) Epoch 24, batch 9250, loss[loss=0.1892, simple_loss=0.2601, pruned_loss=0.05916, over 13206.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2745, pruned_loss=0.07573, over 2571952.27 frames. ], batch size: 67, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:36:30,300 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.69 vs. limit=15.0 2024-06-21 19:36:36,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=443600.6666666667, ans=0.125 2024-06-21 19:36:41,022 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.307e+02 2.462e+02 2.633e+02 3.050e+02, threshold=4.924e+02, percent-clipped=0.0 2024-06-21 19:36:54,464 INFO [train.py:1028] (1/2) Epoch 24, batch 9300, loss[loss=0.2009, simple_loss=0.2609, pruned_loss=0.07046, over 12979.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2744, pruned_loss=0.07572, over 2569026.19 frames. ], batch size: 39, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:37:11,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=443692.3333333333, ans=0.025 2024-06-21 19:37:12,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=443692.3333333333, ans=0.125 2024-06-21 19:37:18,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=443710.6666666667, ans=0.125 2024-06-21 19:37:22,618 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:37:23,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=443729.0, ans=0.125 2024-06-21 19:37:25,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=443747.3333333333, ans=0.125 2024-06-21 19:37:26,190 INFO [train.py:1028] (1/2) Epoch 24, batch 9350, loss[loss=0.2394, simple_loss=0.3065, pruned_loss=0.08614, over 12783.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2749, pruned_loss=0.07632, over 2567647.86 frames. ], batch size: 22, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:37:27,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=443747.3333333333, ans=0.05 2024-06-21 19:37:33,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=443765.6666666667, ans=0.0 2024-06-21 19:37:35,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=443765.6666666667, ans=0.0 2024-06-21 19:37:39,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=443784.0, ans=0.0 2024-06-21 19:37:44,049 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.333e+02 2.515e+02 2.682e+02 3.328e+02, threshold=5.029e+02, percent-clipped=0.0 2024-06-21 19:37:48,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=443802.3333333333, ans=0.125 2024-06-21 19:37:49,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=443802.3333333333, ans=0.125 2024-06-21 19:37:51,013 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:37:56,641 INFO [train.py:1028] (1/2) Epoch 24, batch 9400, loss[loss=0.2224, simple_loss=0.285, pruned_loss=0.07985, over 13251.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2756, pruned_loss=0.07676, over 2567178.17 frames. ], batch size: 52, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:37:59,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=443839.0, ans=0.125 2024-06-21 19:38:07,642 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.98 vs. limit=6.0 2024-06-21 19:38:20,310 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.37 vs. limit=10.0 2024-06-21 19:38:25,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=443912.3333333333, ans=0.125 2024-06-21 19:38:29,763 INFO [train.py:1028] (1/2) Epoch 24, batch 9450, loss[loss=0.2208, simple_loss=0.2905, pruned_loss=0.07558, over 12694.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2762, pruned_loss=0.07731, over 2567165.45 frames. ], batch size: 22, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:38:29,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=443930.6666666667, ans=0.025 2024-06-21 19:38:32,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=443930.6666666667, ans=0.2 2024-06-21 19:38:37,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=443949.0, ans=0.125 2024-06-21 19:38:47,471 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.355e+02 2.491e+02 2.682e+02 3.504e+02, threshold=4.982e+02, percent-clipped=0.0 2024-06-21 19:38:48,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=443985.6666666667, ans=0.125 2024-06-21 19:39:02,350 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.38 vs. limit=15.0 2024-06-21 19:39:02,560 INFO [train.py:1028] (1/2) Epoch 24, batch 9500, loss[loss=0.2037, simple_loss=0.2661, pruned_loss=0.07064, over 13207.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2752, pruned_loss=0.07636, over 2576168.32 frames. ], batch size: 43, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:39:04,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=444022.3333333333, ans=0.1 2024-06-21 19:39:14,684 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.12 vs. limit=22.5 2024-06-21 19:39:15,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=444059.0, ans=0.125 2024-06-21 19:39:19,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=444059.0, ans=0.0 2024-06-21 19:39:28,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=444095.6666666667, ans=0.02 2024-06-21 19:39:33,301 INFO [train.py:1028] (1/2) Epoch 24, batch 9550, loss[loss=0.1955, simple_loss=0.2645, pruned_loss=0.06329, over 12895.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2758, pruned_loss=0.07664, over 2571221.21 frames. ], batch size: 39, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:39:42,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=444132.3333333333, ans=0.125 2024-06-21 19:39:44,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=444132.3333333333, ans=0.125 2024-06-21 19:39:51,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=444150.6666666667, ans=0.025 2024-06-21 19:39:51,588 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 2.316e+02 2.414e+02 2.591e+02 3.190e+02, threshold=4.829e+02, percent-clipped=0.0 2024-06-21 19:39:54,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=444169.0, ans=0.025 2024-06-21 19:39:56,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=444169.0, ans=0.125 2024-06-21 19:40:04,497 INFO [train.py:1028] (1/2) Epoch 24, batch 9600, loss[loss=0.2437, simple_loss=0.2862, pruned_loss=0.1006, over 10605.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.275, pruned_loss=0.07639, over 2570592.50 frames. ], batch size: 304, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:40:08,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=444205.6666666667, ans=0.0 2024-06-21 19:40:10,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=444224.0, ans=0.0 2024-06-21 19:40:17,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=444242.3333333333, ans=0.0 2024-06-21 19:40:20,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=444242.3333333333, ans=0.125 2024-06-21 19:40:24,887 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.16 vs. limit=15.0 2024-06-21 19:40:26,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=444260.6666666667, ans=0.0 2024-06-21 19:40:35,438 INFO [train.py:1028] (1/2) Epoch 24, batch 9650, loss[loss=0.2017, simple_loss=0.2534, pruned_loss=0.07497, over 13099.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.275, pruned_loss=0.07694, over 2561014.21 frames. ], batch size: 132, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:40:41,753 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:40:46,583 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.77 vs. limit=12.0 2024-06-21 19:40:46,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=444315.6666666667, ans=0.1 2024-06-21 19:40:55,248 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.348e+02 2.570e+02 2.816e+02 3.829e+02, threshold=5.139e+02, percent-clipped=0.0 2024-06-21 19:40:57,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=444352.3333333333, ans=0.0 2024-06-21 19:41:08,327 INFO [train.py:1028] (1/2) Epoch 24, batch 9700, loss[loss=0.2258, simple_loss=0.2778, pruned_loss=0.08684, over 12987.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2744, pruned_loss=0.07678, over 2556276.42 frames. ], batch size: 144, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:41:09,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=444389.0, ans=0.2 2024-06-21 19:41:12,367 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.43 vs. limit=6.0 2024-06-21 19:41:12,950 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.79 vs. limit=10.0 2024-06-21 19:41:29,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=444444.0, ans=0.0 2024-06-21 19:41:33,784 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.99 vs. limit=10.0 2024-06-21 19:41:41,429 INFO [train.py:1028] (1/2) Epoch 24, batch 9750, loss[loss=0.1962, simple_loss=0.2495, pruned_loss=0.0715, over 13059.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.273, pruned_loss=0.07588, over 2552456.22 frames. ], batch size: 132, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:41:43,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=444480.6666666667, ans=0.125 2024-06-21 19:41:52,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=444499.0, ans=0.125 2024-06-21 19:41:57,768 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.12 vs. limit=15.0 2024-06-21 19:41:58,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=444517.3333333333, ans=0.0 2024-06-21 19:41:59,674 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.253e+02 2.395e+02 2.635e+02 3.379e+02, threshold=4.790e+02, percent-clipped=0.0 2024-06-21 19:42:03,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=444535.6666666667, ans=0.2 2024-06-21 19:42:04,162 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:42:09,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=444554.0, ans=0.125 2024-06-21 19:42:12,638 INFO [train.py:1028] (1/2) Epoch 24, batch 9800, loss[loss=0.2104, simple_loss=0.2604, pruned_loss=0.08025, over 12924.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2721, pruned_loss=0.07522, over 2544387.66 frames. ], batch size: 39, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:42:14,231 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-21 19:42:23,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=444590.6666666667, ans=10.0 2024-06-21 19:42:30,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=444609.0, ans=0.0 2024-06-21 19:42:30,871 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.67 vs. limit=15.0 2024-06-21 19:42:37,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=444645.6666666667, ans=0.0 2024-06-21 19:42:43,082 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.79 vs. limit=22.5 2024-06-21 19:42:43,910 INFO [train.py:1028] (1/2) Epoch 24, batch 9850, loss[loss=0.1958, simple_loss=0.2573, pruned_loss=0.06715, over 13008.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2718, pruned_loss=0.07511, over 2536528.66 frames. ], batch size: 102, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:42:44,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=444664.0, ans=0.125 2024-06-21 19:42:45,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=444664.0, ans=0.125 2024-06-21 19:42:50,550 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.36 vs. limit=15.0 2024-06-21 19:42:55,984 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.963e+00 2024-06-21 19:43:02,982 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.325e+02 2.452e+02 2.627e+02 3.314e+02, threshold=4.903e+02, percent-clipped=0.0 2024-06-21 19:43:04,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=444719.0, ans=0.0 2024-06-21 19:43:15,393 INFO [train.py:1028] (1/2) Epoch 24, batch 9900, loss[loss=0.1994, simple_loss=0.2622, pruned_loss=0.06832, over 12890.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2709, pruned_loss=0.07509, over 2529052.99 frames. ], batch size: 39, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:43:46,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=444829.0, ans=0.125 2024-06-21 19:43:47,506 INFO [train.py:1028] (1/2) Epoch 24, batch 9950, loss[loss=0.2268, simple_loss=0.2835, pruned_loss=0.08507, over 12753.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2696, pruned_loss=0.07523, over 2522204.71 frames. ], batch size: 29, lr: 2.39e-03, grad_scale: 16.0 2024-06-21 19:43:49,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.87 vs. limit=15.0 2024-06-21 19:43:50,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=444847.3333333333, ans=0.0 2024-06-21 19:43:59,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=444884.0, ans=0.0 2024-06-21 19:44:06,213 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.304e+02 2.432e+02 2.633e+02 3.200e+02, threshold=4.863e+02, percent-clipped=0.0 2024-06-21 19:44:09,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=444902.3333333333, ans=0.025 2024-06-21 19:44:13,180 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:44:19,366 INFO [train.py:1028] (1/2) Epoch 24, batch 10000, loss[loss=0.206, simple_loss=0.2754, pruned_loss=0.06833, over 12554.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2706, pruned_loss=0.07603, over 2485357.98 frames. ], batch size: 22, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:44:20,280 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.04 vs. limit=15.0 2024-06-21 19:44:28,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=444957.3333333333, ans=0.125 2024-06-21 19:44:34,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=444975.6666666667, ans=0.2 2024-06-21 19:44:36,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=444975.6666666667, ans=0.125 2024-06-21 19:44:39,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=444994.0, ans=0.125 2024-06-21 19:44:40,440 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:44:50,223 INFO [train.py:1028] (1/2) Epoch 24, batch 10050, loss[loss=0.1859, simple_loss=0.249, pruned_loss=0.06137, over 12592.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2711, pruned_loss=0.07697, over 2441496.14 frames. ], batch size: 22, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:45:04,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=445067.3333333333, ans=0.0 2024-06-21 19:45:06,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=445067.3333333333, ans=0.125 2024-06-21 19:45:07,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=445085.6666666667, ans=0.0 2024-06-21 19:45:08,039 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.338e+02 2.538e+02 2.839e+02 4.307e+02, threshold=5.076e+02, percent-clipped=0.0 2024-06-21 19:45:13,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=445104.0, ans=0.2 2024-06-21 19:45:20,740 INFO [train.py:1028] (1/2) Epoch 24, batch 10100, loss[loss=0.1743, simple_loss=0.2326, pruned_loss=0.05802, over 11564.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.271, pruned_loss=0.07676, over 2423522.77 frames. ], batch size: 17, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:45:21,744 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.43 vs. limit=10.0 2024-06-21 19:47:37,737 INFO [train.py:1028] (1/2) Epoch 25, batch 0, loss[loss=0.1607, simple_loss=0.223, pruned_loss=0.04915, over 13053.00 frames. ], tot_loss[loss=0.1607, simple_loss=0.223, pruned_loss=0.04915, over 13053.00 frames. ], batch size: 36, lr: 2.34e-03, grad_scale: 32.0 2024-06-21 19:47:37,738 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 19:47:44,610 INFO [train.py:1060] (1/2) Epoch 25, validation: loss=0.1898, simple_loss=0.2523, pruned_loss=0.06367, over 351949.00 frames. 2024-06-21 19:47:44,611 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 19:48:07,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=445210.3333333333, ans=0.125 2024-06-21 19:48:08,052 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=15.0 2024-06-21 19:48:15,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=445228.6666666667, ans=0.025 2024-06-21 19:48:21,489 INFO [train.py:1028] (1/2) Epoch 25, batch 50, loss[loss=0.209, simple_loss=0.2778, pruned_loss=0.07004, over 12539.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2514, pruned_loss=0.06873, over 574508.54 frames. ], batch size: 29, lr: 2.34e-03, grad_scale: 32.0 2024-06-21 19:48:23,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=445247.0, ans=15.0 2024-06-21 19:48:29,104 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.162e+02 2.321e+02 2.491e+02 3.123e+02, threshold=4.643e+02, percent-clipped=0.0 2024-06-21 19:48:30,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=445265.3333333333, ans=0.2 2024-06-21 19:48:36,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=445283.6666666667, ans=0.1 2024-06-21 19:48:43,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=445302.0, ans=0.1 2024-06-21 19:48:54,830 INFO [train.py:1028] (1/2) Epoch 25, batch 100, loss[loss=0.186, simple_loss=0.2509, pruned_loss=0.06054, over 13301.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2508, pruned_loss=0.06795, over 1018503.09 frames. ], batch size: 46, lr: 2.34e-03, grad_scale: 32.0 2024-06-21 19:49:26,567 INFO [train.py:1028] (1/2) Epoch 25, batch 150, loss[loss=0.1824, simple_loss=0.2414, pruned_loss=0.06168, over 13093.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2505, pruned_loss=0.06703, over 1365756.42 frames. ], batch size: 30, lr: 2.34e-03, grad_scale: 32.0 2024-06-21 19:49:33,972 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.146e+02 2.262e+02 2.491e+02 3.058e+02, threshold=4.524e+02, percent-clipped=0.0 2024-06-21 19:49:34,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=445448.6666666667, ans=0.1 2024-06-21 19:49:35,193 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.94 vs. limit=8.0 2024-06-21 19:49:36,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=445448.6666666667, ans=0.2 2024-06-21 19:49:46,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=445467.0, ans=0.0 2024-06-21 19:49:47,659 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.03 vs. limit=15.0 2024-06-21 19:50:01,358 INFO [train.py:1028] (1/2) Epoch 25, batch 200, loss[loss=0.2084, simple_loss=0.2612, pruned_loss=0.07774, over 12531.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2511, pruned_loss=0.06755, over 1635384.73 frames. ], batch size: 202, lr: 2.34e-03, grad_scale: 32.0 2024-06-21 19:50:13,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=445540.3333333333, ans=0.125 2024-06-21 19:50:16,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=445558.6666666667, ans=0.0 2024-06-21 19:50:18,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=445558.6666666667, ans=0.125 2024-06-21 19:50:19,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=445558.6666666667, ans=0.125 2024-06-21 19:50:19,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=445558.6666666667, ans=0.125 2024-06-21 19:50:20,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=445577.0, ans=0.05 2024-06-21 19:50:30,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=445595.3333333333, ans=0.1 2024-06-21 19:50:33,266 INFO [train.py:1028] (1/2) Epoch 25, batch 250, loss[loss=0.1846, simple_loss=0.2327, pruned_loss=0.06827, over 13053.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2507, pruned_loss=0.06733, over 1846382.71 frames. ], batch size: 144, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:50:35,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=445613.6666666667, ans=0.04949747468305833 2024-06-21 19:50:35,580 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2024-06-21 19:50:40,875 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.168e+02 2.263e+02 2.467e+02 3.296e+02, threshold=4.526e+02, percent-clipped=0.0 2024-06-21 19:50:43,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=445632.0, ans=0.125 2024-06-21 19:50:53,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=445668.6666666667, ans=0.2 2024-06-21 19:50:55,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=445668.6666666667, ans=0.125 2024-06-21 19:51:09,410 INFO [train.py:1028] (1/2) Epoch 25, batch 300, loss[loss=0.1978, simple_loss=0.2533, pruned_loss=0.07121, over 13228.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2512, pruned_loss=0.06761, over 2010260.60 frames. ], batch size: 112, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:51:13,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=445705.3333333333, ans=0.125 2024-06-21 19:51:26,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=445742.0, ans=0.025 2024-06-21 19:51:30,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=445760.3333333333, ans=0.1 2024-06-21 19:51:32,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=445760.3333333333, ans=0.125 2024-06-21 19:51:34,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=445778.6666666667, ans=0.125 2024-06-21 19:51:38,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=445778.6666666667, ans=0.0 2024-06-21 19:51:41,716 INFO [train.py:1028] (1/2) Epoch 25, batch 350, loss[loss=0.1701, simple_loss=0.2358, pruned_loss=0.05219, over 12849.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.251, pruned_loss=0.06762, over 2138899.29 frames. ], batch size: 33, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:51:49,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=445797.0, ans=0.1 2024-06-21 19:51:52,148 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.191e+02 2.350e+02 2.550e+02 3.526e+02, threshold=4.700e+02, percent-clipped=0.0 2024-06-21 19:52:16,154 INFO [train.py:1028] (1/2) Epoch 25, batch 400, loss[loss=0.1927, simple_loss=0.2478, pruned_loss=0.06875, over 13226.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2506, pruned_loss=0.06734, over 2239679.38 frames. ], batch size: 63, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:52:19,890 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.69 vs. limit=15.0 2024-06-21 19:52:22,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=445907.0, ans=0.2 2024-06-21 19:52:26,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=445907.0, ans=0.125 2024-06-21 19:52:29,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=445925.3333333333, ans=0.125 2024-06-21 19:52:33,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=445925.3333333333, ans=0.125 2024-06-21 19:52:34,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=445925.3333333333, ans=0.125 2024-06-21 19:52:38,070 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.945e+00 2024-06-21 19:52:43,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=445962.0, ans=0.2 2024-06-21 19:52:43,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=445962.0, ans=0.125 2024-06-21 19:52:48,275 INFO [train.py:1028] (1/2) Epoch 25, batch 450, loss[loss=0.1937, simple_loss=0.261, pruned_loss=0.0632, over 13238.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2509, pruned_loss=0.06712, over 2312620.48 frames. ], batch size: 67, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:52:56,060 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.152e+02 2.270e+02 2.439e+02 3.581e+02, threshold=4.541e+02, percent-clipped=0.0 2024-06-21 19:53:08,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=446017.0, ans=0.1 2024-06-21 19:53:14,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=446035.3333333333, ans=0.0 2024-06-21 19:53:16,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=446053.6666666667, ans=10.0 2024-06-21 19:53:23,769 INFO [train.py:1028] (1/2) Epoch 25, batch 500, loss[loss=0.2027, simple_loss=0.2553, pruned_loss=0.07506, over 13073.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2518, pruned_loss=0.06718, over 2375985.14 frames. ], batch size: 121, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:53:28,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.20 vs. limit=15.0 2024-06-21 19:53:36,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=446108.6666666667, ans=0.1 2024-06-21 19:53:49,266 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.53 vs. limit=6.0 2024-06-21 19:53:58,301 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:53:58,820 INFO [train.py:1028] (1/2) Epoch 25, batch 550, loss[loss=0.1873, simple_loss=0.2506, pruned_loss=0.06207, over 12958.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2517, pruned_loss=0.06705, over 2420910.00 frames. ], batch size: 158, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:54:06,547 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.233e+02 2.339e+02 2.564e+02 3.148e+02, threshold=4.678e+02, percent-clipped=0.0 2024-06-21 19:54:08,143 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.72 vs. limit=22.5 2024-06-21 19:54:20,392 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.35 vs. limit=15.0 2024-06-21 19:54:30,300 INFO [train.py:1028] (1/2) Epoch 25, batch 600, loss[loss=0.1729, simple_loss=0.2228, pruned_loss=0.06154, over 13052.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2521, pruned_loss=0.06746, over 2457944.81 frames. ], batch size: 144, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:54:31,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=446255.3333333333, ans=0.125 2024-06-21 19:54:35,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=446255.3333333333, ans=0.125 2024-06-21 19:54:36,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=446273.6666666667, ans=0.0 2024-06-21 19:54:40,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=446273.6666666667, ans=0.125 2024-06-21 19:54:46,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=446292.0, ans=0.025 2024-06-21 19:54:51,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=446310.3333333333, ans=0.0 2024-06-21 19:54:57,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=446328.6666666667, ans=0.1 2024-06-21 19:55:02,410 INFO [train.py:1028] (1/2) Epoch 25, batch 650, loss[loss=0.1783, simple_loss=0.2416, pruned_loss=0.0575, over 13111.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2512, pruned_loss=0.06662, over 2489777.67 frames. ], batch size: 59, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:55:08,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=446347.0, ans=0.0 2024-06-21 19:55:12,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=446365.3333333333, ans=0.125 2024-06-21 19:55:12,893 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=446365.3333333333, ans=0.125 2024-06-21 19:55:12,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=446365.3333333333, ans=0.125 2024-06-21 19:55:14,016 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.166e+02 2.281e+02 2.405e+02 3.067e+02, threshold=4.562e+02, percent-clipped=0.0 2024-06-21 19:55:16,247 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=446365.3333333333, ans=0.1 2024-06-21 19:55:22,811 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2024-06-21 19:55:30,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=446402.0, ans=0.0 2024-06-21 19:55:31,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=446420.3333333333, ans=0.125 2024-06-21 19:55:38,480 INFO [train.py:1028] (1/2) Epoch 25, batch 700, loss[loss=0.2058, simple_loss=0.2643, pruned_loss=0.07369, over 13356.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2517, pruned_loss=0.06711, over 2511851.49 frames. ], batch size: 46, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:55:38,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=446438.6666666667, ans=0.0 2024-06-21 19:55:42,703 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.09 vs. limit=15.0 2024-06-21 19:55:43,978 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2024-06-21 19:55:55,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=446475.3333333333, ans=0.07 2024-06-21 19:55:59,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=446493.6666666667, ans=0.125 2024-06-21 19:56:13,167 INFO [train.py:1028] (1/2) Epoch 25, batch 750, loss[loss=0.1687, simple_loss=0.2277, pruned_loss=0.05482, over 13316.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2514, pruned_loss=0.06659, over 2527483.51 frames. ], batch size: 63, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:56:14,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=446530.3333333333, ans=0.0 2024-06-21 19:56:19,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=446548.6666666667, ans=0.125 2024-06-21 19:56:20,589 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.134e+02 2.285e+02 2.445e+02 2.856e+02, threshold=4.570e+02, percent-clipped=0.0 2024-06-21 19:56:26,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=446567.0, ans=0.125 2024-06-21 19:56:30,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=446567.0, ans=0.125 2024-06-21 19:56:33,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=446585.3333333333, ans=0.0 2024-06-21 19:56:45,040 INFO [train.py:1028] (1/2) Epoch 25, batch 800, loss[loss=0.2022, simple_loss=0.2586, pruned_loss=0.07289, over 12964.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2511, pruned_loss=0.06638, over 2540025.28 frames. ], batch size: 36, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:56:49,061 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=446622.0, ans=0.0 2024-06-21 19:56:55,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=446640.3333333333, ans=0.2 2024-06-21 19:56:59,232 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:57:02,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=446658.6666666667, ans=0.125 2024-06-21 19:57:07,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=446677.0, ans=0.0 2024-06-21 19:57:16,139 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=446695.3333333333, ans=0.2 2024-06-21 19:57:16,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=446695.3333333333, ans=0.95 2024-06-21 19:57:19,788 INFO [train.py:1028] (1/2) Epoch 25, batch 850, loss[loss=0.1853, simple_loss=0.2433, pruned_loss=0.06361, over 13134.00 frames. ], tot_loss[loss=0.1913, simple_loss=0.2507, pruned_loss=0.06597, over 2550569.99 frames. ], batch size: 95, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:57:26,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=446732.0, ans=0.07 2024-06-21 19:57:26,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=446732.0, ans=0.1 2024-06-21 19:57:27,243 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.149e+02 2.247e+02 2.412e+02 3.418e+02, threshold=4.493e+02, percent-clipped=0.0 2024-06-21 19:57:29,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=446732.0, ans=0.125 2024-06-21 19:57:38,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=446768.6666666667, ans=0.2 2024-06-21 19:57:40,675 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.09 vs. limit=15.0 2024-06-21 19:57:41,423 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.17 vs. limit=15.0 2024-06-21 19:57:54,369 INFO [train.py:1028] (1/2) Epoch 25, batch 900, loss[loss=0.2034, simple_loss=0.2617, pruned_loss=0.07254, over 12829.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2506, pruned_loss=0.06637, over 2555580.72 frames. ], batch size: 36, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:58:00,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=446805.3333333333, ans=15.0 2024-06-21 19:58:01,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=446823.6666666667, ans=0.0 2024-06-21 19:58:02,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=446823.6666666667, ans=0.0 2024-06-21 19:58:05,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=446823.6666666667, ans=0.125 2024-06-21 19:58:10,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=446842.0, ans=0.125 2024-06-21 19:58:13,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=446860.3333333333, ans=15.0 2024-06-21 19:58:15,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=446860.3333333333, ans=0.0 2024-06-21 19:58:23,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=446878.6666666667, ans=0.1 2024-06-21 19:58:27,185 INFO [train.py:1028] (1/2) Epoch 25, batch 950, loss[loss=0.1963, simple_loss=0.2602, pruned_loss=0.06622, over 13009.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2507, pruned_loss=0.06635, over 2558764.87 frames. ], batch size: 39, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:58:30,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=446897.0, ans=0.1 2024-06-21 19:58:33,322 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.66 vs. limit=15.0 2024-06-21 19:58:34,686 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.180e+02 2.343e+02 2.592e+02 3.113e+02, threshold=4.686e+02, percent-clipped=0.0 2024-06-21 19:58:42,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=446933.6666666667, ans=0.125 2024-06-21 19:58:49,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=446952.0, ans=0.1 2024-06-21 19:58:50,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=446952.0, ans=0.0 2024-06-21 19:58:52,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=446970.3333333333, ans=0.0 2024-06-21 19:58:55,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=446970.3333333333, ans=0.125 2024-06-21 19:58:59,267 INFO [train.py:1028] (1/2) Epoch 25, batch 1000, loss[loss=0.1881, simple_loss=0.2435, pruned_loss=0.06639, over 13293.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2507, pruned_loss=0.06656, over 2561083.05 frames. ], batch size: 49, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:59:10,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=447007.0, ans=0.125 2024-06-21 19:59:10,564 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.64 vs. limit=15.0 2024-06-21 19:59:13,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=447007.0, ans=0.0 2024-06-21 19:59:19,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=447025.3333333333, ans=0.5 2024-06-21 19:59:22,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=447043.6666666667, ans=0.125 2024-06-21 19:59:34,318 INFO [train.py:1028] (1/2) Epoch 25, batch 1050, loss[loss=0.1775, simple_loss=0.2423, pruned_loss=0.05637, over 13134.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2514, pruned_loss=0.06664, over 2566057.29 frames. ], batch size: 77, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:59:35,092 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:59:35,366 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.90 vs. limit=10.0 2024-06-21 19:59:35,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=447080.3333333333, ans=0.0 2024-06-21 19:59:44,912 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.193e+02 2.284e+02 2.427e+02 3.148e+02, threshold=4.569e+02, percent-clipped=0.0 2024-06-21 19:59:55,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=447135.3333333333, ans=0.2 2024-06-21 19:59:57,843 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-21 20:00:09,362 INFO [train.py:1028] (1/2) Epoch 25, batch 1100, loss[loss=0.1871, simple_loss=0.239, pruned_loss=0.06756, over 13265.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2517, pruned_loss=0.06683, over 2571404.96 frames. ], batch size: 52, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:00:13,765 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2024-06-21 20:00:20,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=447190.3333333333, ans=0.1 2024-06-21 20:00:20,427 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2024-06-21 20:00:22,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=447208.6666666667, ans=0.125 2024-06-21 20:00:25,896 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.13 vs. limit=22.5 2024-06-21 20:00:40,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=447245.3333333333, ans=0.1 2024-06-21 20:00:42,345 INFO [train.py:1028] (1/2) Epoch 25, batch 1150, loss[loss=0.2058, simple_loss=0.2646, pruned_loss=0.07357, over 13240.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.252, pruned_loss=0.06733, over 2572146.91 frames. ], batch size: 52, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:00:45,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=447263.6666666667, ans=0.125 2024-06-21 20:00:47,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=447263.6666666667, ans=0.0 2024-06-21 20:00:50,144 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.849e+02 2.182e+02 2.315e+02 2.456e+02 2.888e+02, threshold=4.629e+02, percent-clipped=0.0 2024-06-21 20:00:57,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=447300.3333333333, ans=0.125 2024-06-21 20:01:07,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=447318.6666666667, ans=0.2 2024-06-21 20:01:09,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=447318.6666666667, ans=0.0 2024-06-21 20:01:18,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=447337.0, ans=0.0 2024-06-21 20:01:25,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=447355.3333333333, ans=0.2 2024-06-21 20:01:26,151 INFO [train.py:1028] (1/2) Epoch 25, batch 1200, loss[loss=0.1887, simple_loss=0.2493, pruned_loss=0.06402, over 13128.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2517, pruned_loss=0.06749, over 2574281.95 frames. ], batch size: 77, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:01:44,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=447392.0, ans=0.125 2024-06-21 20:01:49,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=447410.3333333333, ans=0.0 2024-06-21 20:01:51,174 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.39 vs. limit=15.0 2024-06-21 20:01:52,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=447428.6666666667, ans=0.05 2024-06-21 20:01:52,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=447428.6666666667, ans=0.125 2024-06-21 20:02:01,705 INFO [train.py:1028] (1/2) Epoch 25, batch 1250, loss[loss=0.1811, simple_loss=0.236, pruned_loss=0.06306, over 13160.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2517, pruned_loss=0.06739, over 2583473.21 frames. ], batch size: 112, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:02:09,330 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.197e+02 2.286e+02 2.506e+02 3.473e+02, threshold=4.573e+02, percent-clipped=0.0 2024-06-21 20:02:18,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=447483.6666666667, ans=0.125 2024-06-21 20:02:26,113 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.42 vs. limit=15.0 2024-06-21 20:02:26,753 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.27 vs. limit=12.0 2024-06-21 20:02:28,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=447520.3333333333, ans=0.125 2024-06-21 20:02:32,227 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.52 vs. limit=22.5 2024-06-21 20:02:33,758 INFO [train.py:1028] (1/2) Epoch 25, batch 1300, loss[loss=0.2143, simple_loss=0.2604, pruned_loss=0.0841, over 12690.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.252, pruned_loss=0.0674, over 2583440.57 frames. ], batch size: 176, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:02:36,587 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.89 vs. limit=15.0 2024-06-21 20:02:40,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=447557.0, ans=0.125 2024-06-21 20:02:44,240 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.47 vs. limit=22.5 2024-06-21 20:02:53,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=447593.6666666667, ans=0.025 2024-06-21 20:02:55,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=447593.6666666667, ans=0.2 2024-06-21 20:03:00,731 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.99 vs. limit=15.0 2024-06-21 20:03:04,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=447612.0, ans=0.125 2024-06-21 20:03:05,818 INFO [train.py:1028] (1/2) Epoch 25, batch 1350, loss[loss=0.1869, simple_loss=0.2558, pruned_loss=0.059, over 13155.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2519, pruned_loss=0.06702, over 2586011.99 frames. ], batch size: 59, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:03:06,903 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.53 vs. limit=6.0 2024-06-21 20:03:13,715 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.171e+02 2.310e+02 2.528e+02 3.139e+02, threshold=4.621e+02, percent-clipped=0.0 2024-06-21 20:03:15,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=447648.6666666667, ans=0.125 2024-06-21 20:03:17,091 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2024-06-21 20:03:21,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=447667.0, ans=0.2 2024-06-21 20:03:22,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=447667.0, ans=0.1 2024-06-21 20:03:29,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=447685.3333333333, ans=0.125 2024-06-21 20:03:31,808 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=447685.3333333333, ans=0.125 2024-06-21 20:03:35,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=447703.6666666667, ans=0.1 2024-06-21 20:03:36,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=447703.6666666667, ans=0.0 2024-06-21 20:03:41,214 INFO [train.py:1028] (1/2) Epoch 25, batch 1400, loss[loss=0.1981, simple_loss=0.2576, pruned_loss=0.06934, over 12446.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2521, pruned_loss=0.06713, over 2586896.70 frames. ], batch size: 25, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:03:41,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=447722.0, ans=0.0 2024-06-21 20:03:47,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=447740.3333333333, ans=0.1 2024-06-21 20:03:56,630 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.09 vs. limit=15.0 2024-06-21 20:04:12,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=447795.3333333333, ans=0.2 2024-06-21 20:04:14,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=447795.3333333333, ans=0.125 2024-06-21 20:04:16,035 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=15.0 2024-06-21 20:04:16,147 INFO [train.py:1028] (1/2) Epoch 25, batch 1450, loss[loss=0.184, simple_loss=0.2445, pruned_loss=0.0617, over 13072.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2518, pruned_loss=0.06722, over 2586750.89 frames. ], batch size: 121, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:04:18,479 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.00 vs. limit=15.0 2024-06-21 20:04:24,124 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.201e+02 2.312e+02 2.519e+02 3.489e+02, threshold=4.625e+02, percent-clipped=0.0 2024-06-21 20:04:33,447 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.69 vs. limit=10.0 2024-06-21 20:04:35,337 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.80 vs. limit=12.0 2024-06-21 20:04:36,150 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.05 vs. limit=15.0 2024-06-21 20:04:37,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=447868.6666666667, ans=0.04949747468305833 2024-06-21 20:04:43,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=447887.0, ans=0.0 2024-06-21 20:04:44,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=447887.0, ans=0.0 2024-06-21 20:04:48,753 INFO [train.py:1028] (1/2) Epoch 25, batch 1500, loss[loss=0.1942, simple_loss=0.2489, pruned_loss=0.06973, over 13188.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2522, pruned_loss=0.0677, over 2588509.68 frames. ], batch size: 83, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:05:18,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=447978.6666666667, ans=0.125 2024-06-21 20:05:19,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=447978.6666666667, ans=0.09899494936611666 2024-06-21 20:05:24,107 INFO [train.py:1028] (1/2) Epoch 25, batch 1550, loss[loss=0.1908, simple_loss=0.2451, pruned_loss=0.06822, over 13085.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2525, pruned_loss=0.068, over 2583351.77 frames. ], batch size: 102, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:05:25,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=447997.0, ans=0.0 2024-06-21 20:05:32,043 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.179e+02 2.296e+02 2.461e+02 3.419e+02, threshold=4.591e+02, percent-clipped=0.0 2024-06-21 20:05:36,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=448033.6666666667, ans=0.2 2024-06-21 20:05:44,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=448052.0, ans=0.0 2024-06-21 20:05:47,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=448052.0, ans=0.1 2024-06-21 20:05:49,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=448052.0, ans=0.125 2024-06-21 20:05:59,283 INFO [train.py:1028] (1/2) Epoch 25, batch 1600, loss[loss=0.203, simple_loss=0.2597, pruned_loss=0.07313, over 13195.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2526, pruned_loss=0.06788, over 2578158.88 frames. ], batch size: 77, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:06:00,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=448088.6666666667, ans=0.125 2024-06-21 20:06:11,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=448125.3333333333, ans=0.025 2024-06-21 20:06:16,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=448125.3333333333, ans=0.125 2024-06-21 20:06:16,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=448125.3333333333, ans=0.125 2024-06-21 20:06:18,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=448143.6666666667, ans=0.025 2024-06-21 20:06:19,178 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.51 vs. limit=22.5 2024-06-21 20:06:25,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=448162.0, ans=0.0 2024-06-21 20:06:31,180 INFO [train.py:1028] (1/2) Epoch 25, batch 1650, loss[loss=0.2046, simple_loss=0.2649, pruned_loss=0.0721, over 13187.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2525, pruned_loss=0.0679, over 2574767.92 frames. ], batch size: 95, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:06:35,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=448180.3333333333, ans=0.0 2024-06-21 20:06:37,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=448198.6666666667, ans=0.125 2024-06-21 20:06:38,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=448198.6666666667, ans=0.09899494936611666 2024-06-21 20:06:38,831 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.191e+02 2.307e+02 2.446e+02 3.117e+02, threshold=4.613e+02, percent-clipped=0.0 2024-06-21 20:06:39,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=448198.6666666667, ans=0.2 2024-06-21 20:06:40,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=448198.6666666667, ans=0.5 2024-06-21 20:06:47,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=448217.0, ans=0.0 2024-06-21 20:06:48,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=448217.0, ans=0.0 2024-06-21 20:06:56,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=448235.3333333333, ans=0.05 2024-06-21 20:06:57,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=448253.6666666667, ans=0.125 2024-06-21 20:07:04,513 INFO [train.py:1028] (1/2) Epoch 25, batch 1700, loss[loss=0.1911, simple_loss=0.2501, pruned_loss=0.06608, over 12357.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2528, pruned_loss=0.06786, over 2580336.63 frames. ], batch size: 25, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:07:14,275 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.70 vs. limit=22.5 2024-06-21 20:07:15,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=448290.3333333333, ans=0.125 2024-06-21 20:07:16,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=448290.3333333333, ans=0.2 2024-06-21 20:07:22,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=448308.6666666667, ans=0.1 2024-06-21 20:07:38,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=448345.3333333333, ans=0.125 2024-06-21 20:07:39,949 INFO [train.py:1028] (1/2) Epoch 25, batch 1750, loss[loss=0.2001, simple_loss=0.2673, pruned_loss=0.06642, over 12502.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.253, pruned_loss=0.06765, over 2581142.41 frames. ], batch size: 22, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:07:40,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=448363.6666666667, ans=0.125 2024-06-21 20:07:46,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=448382.0, ans=0.125 2024-06-21 20:07:47,594 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.245e+02 2.363e+02 2.492e+02 3.175e+02, threshold=4.725e+02, percent-clipped=0.0 2024-06-21 20:08:03,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=448418.6666666667, ans=0.1 2024-06-21 20:08:08,522 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2024-06-21 20:08:16,659 INFO [train.py:1028] (1/2) Epoch 25, batch 1800, loss[loss=0.191, simple_loss=0.2512, pruned_loss=0.06536, over 13279.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.253, pruned_loss=0.06788, over 2581102.70 frames. ], batch size: 67, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:08:24,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=448473.6666666667, ans=22.5 2024-06-21 20:08:34,624 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.48 vs. limit=6.0 2024-06-21 20:08:36,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=448510.3333333333, ans=0.0 2024-06-21 20:08:49,370 INFO [train.py:1028] (1/2) Epoch 25, batch 1850, loss[loss=0.184, simple_loss=0.2431, pruned_loss=0.06238, over 13222.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2534, pruned_loss=0.06791, over 2581854.05 frames. ], batch size: 83, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:08:57,072 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.166e+02 2.270e+02 2.447e+02 3.342e+02, threshold=4.540e+02, percent-clipped=0.0 2024-06-21 20:09:01,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=448565.3333333333, ans=0.0 2024-06-21 20:09:07,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=448583.6666666667, ans=0.125 2024-06-21 20:09:08,330 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.45 vs. limit=15.0 2024-06-21 20:09:17,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=448620.3333333333, ans=0.125 2024-06-21 20:09:19,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=448620.3333333333, ans=0.0 2024-06-21 20:09:21,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=448620.3333333333, ans=0.0 2024-06-21 20:09:22,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=448620.3333333333, ans=0.125 2024-06-21 20:09:22,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=448620.3333333333, ans=0.0 2024-06-21 20:09:24,894 INFO [train.py:1028] (1/2) Epoch 25, batch 1900, loss[loss=0.1852, simple_loss=0.2388, pruned_loss=0.06574, over 13137.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2528, pruned_loss=0.06797, over 2584336.49 frames. ], batch size: 95, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:09:40,816 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.46 vs. limit=15.0 2024-06-21 20:09:43,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=448675.3333333333, ans=0.125 2024-06-21 20:09:48,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=448693.6666666667, ans=0.025 2024-06-21 20:09:54,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=448712.0, ans=0.125 2024-06-21 20:10:00,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=448730.3333333333, ans=0.125 2024-06-21 20:10:01,201 INFO [train.py:1028] (1/2) Epoch 25, batch 1950, loss[loss=0.1964, simple_loss=0.2653, pruned_loss=0.06373, over 13264.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2526, pruned_loss=0.06815, over 2590404.87 frames. ], batch size: 52, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:10:09,075 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.243e+02 2.372e+02 2.540e+02 2.975e+02, threshold=4.744e+02, percent-clipped=0.0 2024-06-21 20:10:16,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=448767.0, ans=0.1 2024-06-21 20:10:24,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=448785.3333333333, ans=0.0 2024-06-21 20:10:25,570 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.19 vs. limit=15.0 2024-06-21 20:10:26,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=448803.6666666667, ans=0.035 2024-06-21 20:10:26,960 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.37 vs. limit=15.0 2024-06-21 20:10:33,613 INFO [train.py:1028] (1/2) Epoch 25, batch 2000, loss[loss=0.1767, simple_loss=0.2382, pruned_loss=0.05764, over 12709.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2521, pruned_loss=0.06791, over 2586777.97 frames. ], batch size: 22, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:10:34,587 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.32 vs. limit=10.0 2024-06-21 20:10:35,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=448822.0, ans=0.125 2024-06-21 20:10:39,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=448840.3333333333, ans=0.2 2024-06-21 20:10:52,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=448877.0, ans=0.125 2024-06-21 20:11:04,881 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.75 vs. limit=15.0 2024-06-21 20:11:05,011 INFO [train.py:1028] (1/2) Epoch 25, batch 2050, loss[loss=0.194, simple_loss=0.2624, pruned_loss=0.06283, over 12724.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2524, pruned_loss=0.06822, over 2582793.87 frames. ], batch size: 29, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:11:06,052 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.77 vs. limit=6.0 2024-06-21 20:11:15,608 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.180e+02 2.292e+02 2.432e+02 2.900e+02, threshold=4.584e+02, percent-clipped=0.0 2024-06-21 20:11:24,220 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:11:35,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=448987.0, ans=0.2 2024-06-21 20:11:39,411 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.55 vs. limit=15.0 2024-06-21 20:11:40,298 INFO [train.py:1028] (1/2) Epoch 25, batch 2100, loss[loss=0.2055, simple_loss=0.2676, pruned_loss=0.07169, over 13243.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2523, pruned_loss=0.06779, over 2585696.45 frames. ], batch size: 59, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:11:40,760 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.95 vs. limit=10.0 2024-06-21 20:11:46,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=449023.6666666667, ans=0.1 2024-06-21 20:11:55,584 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.00 vs. limit=22.5 2024-06-21 20:12:09,644 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.85 vs. limit=15.0 2024-06-21 20:12:13,532 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.82 vs. limit=15.0 2024-06-21 20:12:14,408 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.10 vs. limit=10.0 2024-06-21 20:12:15,836 INFO [train.py:1028] (1/2) Epoch 25, batch 2150, loss[loss=0.2003, simple_loss=0.259, pruned_loss=0.07078, over 13287.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2524, pruned_loss=0.06745, over 2588753.94 frames. ], batch size: 52, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:12:15,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=449097.0, ans=0.125 2024-06-21 20:12:17,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=449097.0, ans=0.2 2024-06-21 20:12:23,827 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.181e+02 2.277e+02 2.467e+02 3.878e+02, threshold=4.555e+02, percent-clipped=0.0 2024-06-21 20:12:28,210 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.36 vs. limit=15.0 2024-06-21 20:12:48,610 INFO [train.py:1028] (1/2) Epoch 25, batch 2200, loss[loss=0.1858, simple_loss=0.2438, pruned_loss=0.06385, over 13230.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2521, pruned_loss=0.06725, over 2589408.38 frames. ], batch size: 83, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:12:56,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=449207.0, ans=0.125 2024-06-21 20:13:05,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=449225.3333333333, ans=0.125 2024-06-21 20:13:23,278 INFO [train.py:1028] (1/2) Epoch 25, batch 2250, loss[loss=0.1822, simple_loss=0.2425, pruned_loss=0.06093, over 13249.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2519, pruned_loss=0.06715, over 2588409.75 frames. ], batch size: 63, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:13:27,112 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2024-06-21 20:13:30,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=449298.6666666667, ans=0.125 2024-06-21 20:13:31,117 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.225e+02 2.391e+02 2.732e+02 3.836e+02, threshold=4.783e+02, percent-clipped=0.0 2024-06-21 20:13:34,762 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.31 vs. limit=22.5 2024-06-21 20:13:37,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=449317.0, ans=0.1 2024-06-21 20:13:45,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=449335.3333333333, ans=0.0 2024-06-21 20:13:47,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=449335.3333333333, ans=0.1 2024-06-21 20:13:48,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=449335.3333333333, ans=0.0 2024-06-21 20:13:48,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=449335.3333333333, ans=0.2 2024-06-21 20:13:56,326 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=449353.6666666667, ans=0.125 2024-06-21 20:13:57,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=449353.6666666667, ans=0.035 2024-06-21 20:14:00,824 INFO [train.py:1028] (1/2) Epoch 25, batch 2300, loss[loss=0.1886, simple_loss=0.2512, pruned_loss=0.06298, over 12950.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2527, pruned_loss=0.06759, over 2582437.36 frames. ], batch size: 33, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:14:06,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=449390.3333333333, ans=0.025 2024-06-21 20:14:07,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=449390.3333333333, ans=0.0 2024-06-21 20:14:10,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=449390.3333333333, ans=0.2 2024-06-21 20:14:22,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=449427.0, ans=0.125 2024-06-21 20:14:23,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=449427.0, ans=0.125 2024-06-21 20:14:27,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=449445.3333333333, ans=0.1 2024-06-21 20:14:28,850 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.91 vs. limit=22.5 2024-06-21 20:14:33,759 INFO [train.py:1028] (1/2) Epoch 25, batch 2350, loss[loss=0.1869, simple_loss=0.2441, pruned_loss=0.06485, over 13218.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2529, pruned_loss=0.06786, over 2586192.36 frames. ], batch size: 67, lr: 2.32e-03, grad_scale: 64.0 2024-06-21 20:14:40,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=449482.0, ans=0.125 2024-06-21 20:14:41,869 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.190e+02 2.326e+02 2.570e+02 3.067e+02, threshold=4.652e+02, percent-clipped=0.0 2024-06-21 20:14:51,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=449500.3333333333, ans=0.0 2024-06-21 20:15:03,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=449537.0, ans=0.2 2024-06-21 20:15:06,097 INFO [train.py:1028] (1/2) Epoch 25, batch 2400, loss[loss=0.194, simple_loss=0.2532, pruned_loss=0.06738, over 13298.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2528, pruned_loss=0.06789, over 2588020.87 frames. ], batch size: 46, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:15:16,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=449573.6666666667, ans=0.125 2024-06-21 20:15:24,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=449592.0, ans=0.125 2024-06-21 20:15:28,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=449610.3333333333, ans=0.125 2024-06-21 20:15:40,752 INFO [train.py:1028] (1/2) Epoch 25, batch 2450, loss[loss=0.1995, simple_loss=0.2549, pruned_loss=0.07202, over 13240.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.252, pruned_loss=0.06778, over 2584544.88 frames. ], batch size: 63, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:15:51,883 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.191e+02 2.307e+02 2.431e+02 3.621e+02, threshold=4.615e+02, percent-clipped=0.0 2024-06-21 20:16:03,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=449702.0, ans=0.125 2024-06-21 20:16:04,078 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=15.0 2024-06-21 20:16:05,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=449702.0, ans=0.125 2024-06-21 20:16:11,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=449720.3333333333, ans=0.025 2024-06-21 20:16:15,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=449720.3333333333, ans=0.0 2024-06-21 20:16:16,447 INFO [train.py:1028] (1/2) Epoch 25, batch 2500, loss[loss=0.1966, simple_loss=0.2539, pruned_loss=0.06967, over 13257.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2505, pruned_loss=0.06702, over 2588126.42 frames. ], batch size: 83, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:16:16,750 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2024-06-21 20:16:20,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=449738.6666666667, ans=0.2 2024-06-21 20:16:24,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=449757.0, ans=0.95 2024-06-21 20:16:31,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=449775.3333333333, ans=0.125 2024-06-21 20:16:49,605 INFO [train.py:1028] (1/2) Epoch 25, batch 2550, loss[loss=0.1993, simple_loss=0.2602, pruned_loss=0.06919, over 12651.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2497, pruned_loss=0.06706, over 2587984.05 frames. ], batch size: 22, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:16:52,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=449830.3333333333, ans=0.0 2024-06-21 20:16:58,157 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.175e+02 2.285e+02 2.422e+02 2.710e+02, threshold=4.570e+02, percent-clipped=0.0 2024-06-21 20:16:59,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=449848.6666666667, ans=0.1 2024-06-21 20:16:59,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=449848.6666666667, ans=0.0 2024-06-21 20:17:03,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=449867.0, ans=0.125 2024-06-21 20:17:08,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=449867.0, ans=0.125 2024-06-21 20:17:11,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=449885.3333333333, ans=0.5 2024-06-21 20:17:19,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=449903.6666666667, ans=0.1 2024-06-21 20:17:23,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=449903.6666666667, ans=0.125 2024-06-21 20:17:25,833 INFO [train.py:1028] (1/2) Epoch 25, batch 2600, loss[loss=0.1721, simple_loss=0.2308, pruned_loss=0.05675, over 13246.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2484, pruned_loss=0.0668, over 2587116.60 frames. ], batch size: 52, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:17:33,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=449940.3333333333, ans=0.125 2024-06-21 20:17:35,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=449940.3333333333, ans=0.2 2024-06-21 20:17:40,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=449940.3333333333, ans=0.125 2024-06-21 20:17:41,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=449958.6666666667, ans=0.0 2024-06-21 20:17:43,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=449958.6666666667, ans=0.07 2024-06-21 20:17:51,501 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.42 vs. limit=15.0 2024-06-21 20:17:52,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=449977.0, ans=0.0 2024-06-21 20:18:01,729 INFO [train.py:1028] (1/2) Epoch 25, batch 2650, loss[loss=0.1895, simple_loss=0.2403, pruned_loss=0.06938, over 12993.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.247, pruned_loss=0.06612, over 2588829.80 frames. ], batch size: 144, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:18:10,281 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.166e+02 2.299e+02 2.413e+02 2.978e+02, threshold=4.598e+02, percent-clipped=0.0 2024-06-21 20:18:23,868 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.85 vs. limit=22.5 2024-06-21 20:18:30,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=450087.0, ans=0.125 2024-06-21 20:18:34,593 INFO [train.py:1028] (1/2) Epoch 25, batch 2700, loss[loss=0.1672, simple_loss=0.2239, pruned_loss=0.05529, over 13284.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2449, pruned_loss=0.06569, over 2587512.53 frames. ], batch size: 89, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:18:36,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=450105.3333333333, ans=0.0 2024-06-21 20:18:50,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=450142.0, ans=0.0 2024-06-21 20:18:51,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=450142.0, ans=0.125 2024-06-21 20:19:09,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=450197.0, ans=0.025 2024-06-21 20:19:10,387 INFO [train.py:1028] (1/2) Epoch 25, batch 2750, loss[loss=0.174, simple_loss=0.2331, pruned_loss=0.05745, over 13326.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2441, pruned_loss=0.06516, over 2585554.79 frames. ], batch size: 43, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:19:18,806 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.132e+02 2.307e+02 2.548e+02 5.189e+02, threshold=4.614e+02, percent-clipped=1.0 2024-06-21 20:19:34,969 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2024-06-21 20:19:45,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=450270.3333333333, ans=0.125 2024-06-21 20:19:46,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=450270.3333333333, ans=0.1 2024-06-21 20:19:46,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=450288.6666666667, ans=0.0 2024-06-21 20:19:47,277 INFO [train.py:1028] (1/2) Epoch 25, batch 2800, loss[loss=0.1848, simple_loss=0.2324, pruned_loss=0.0686, over 10914.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2438, pruned_loss=0.06524, over 2582511.66 frames. ], batch size: 303, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:20:00,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=450325.3333333333, ans=0.125 2024-06-21 20:20:01,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=450325.3333333333, ans=0.0 2024-06-21 20:20:09,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=450343.6666666667, ans=0.1 2024-06-21 20:20:12,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=450362.0, ans=0.125 2024-06-21 20:20:16,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=450362.0, ans=0.125 2024-06-21 20:20:19,455 INFO [train.py:1028] (1/2) Epoch 25, batch 2850, loss[loss=0.1765, simple_loss=0.2348, pruned_loss=0.05904, over 13069.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2428, pruned_loss=0.06481, over 2580026.32 frames. ], batch size: 48, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:20:27,478 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.281e+02 2.409e+02 2.625e+02 3.138e+02, threshold=4.817e+02, percent-clipped=0.0 2024-06-21 20:20:28,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=450398.6666666667, ans=0.125 2024-06-21 20:20:33,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=450417.0, ans=0.125 2024-06-21 20:20:34,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=450417.0, ans=0.125 2024-06-21 20:20:39,769 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.28 vs. limit=22.5 2024-06-21 20:20:44,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=450453.6666666667, ans=0.125 2024-06-21 20:20:44,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=450453.6666666667, ans=0.0 2024-06-21 20:20:51,226 INFO [train.py:1028] (1/2) Epoch 25, batch 2900, loss[loss=0.1826, simple_loss=0.2424, pruned_loss=0.06144, over 13118.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2412, pruned_loss=0.06425, over 2587772.80 frames. ], batch size: 55, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:21:02,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=450490.3333333333, ans=0.125 2024-06-21 20:21:08,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=450508.6666666667, ans=0.0 2024-06-21 20:21:08,499 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2024-06-21 20:21:09,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=450508.6666666667, ans=0.125 2024-06-21 20:21:13,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=450508.6666666667, ans=0.125 2024-06-21 20:21:26,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=450545.3333333333, ans=0.1 2024-06-21 20:21:30,682 INFO [train.py:1028] (1/2) Epoch 25, batch 2950, loss[loss=0.1748, simple_loss=0.2277, pruned_loss=0.06099, over 13309.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2409, pruned_loss=0.0644, over 2580019.99 frames. ], batch size: 43, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:21:39,109 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.81 vs. limit=15.0 2024-06-21 20:21:39,405 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.116e+02 2.230e+02 2.402e+02 3.505e+02, threshold=4.460e+02, percent-clipped=0.0 2024-06-21 20:21:42,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=450582.0, ans=0.125 2024-06-21 20:22:03,930 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.24 vs. limit=15.0 2024-06-21 20:22:04,244 INFO [train.py:1028] (1/2) Epoch 25, batch 3000, loss[loss=0.1712, simple_loss=0.2376, pruned_loss=0.05243, over 13199.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2405, pruned_loss=0.06435, over 2579342.06 frames. ], batch size: 59, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:22:04,244 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 20:22:12,041 INFO [train.py:1060] (1/2) Epoch 25, validation: loss=0.1883, simple_loss=0.2506, pruned_loss=0.06299, over 351949.00 frames. 2024-06-21 20:22:12,042 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 20:22:25,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=450692.0, ans=0.125 2024-06-21 20:22:25,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=450692.0, ans=0.0 2024-06-21 20:22:34,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=450710.3333333333, ans=0.07 2024-06-21 20:22:38,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=450728.6666666667, ans=0.1 2024-06-21 20:22:44,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=450728.6666666667, ans=0.1 2024-06-21 20:22:45,244 INFO [train.py:1028] (1/2) Epoch 25, batch 3050, loss[loss=0.1685, simple_loss=0.2213, pruned_loss=0.0579, over 13338.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2399, pruned_loss=0.06433, over 2578545.45 frames. ], batch size: 46, lr: 2.32e-03, grad_scale: 16.0 2024-06-21 20:22:57,501 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.160e+02 2.353e+02 2.529e+02 3.412e+02, threshold=4.707e+02, percent-clipped=0.0 2024-06-21 20:22:59,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=450765.3333333333, ans=0.2 2024-06-21 20:23:02,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten.whitening_limit, batch_count=450783.6666666667, ans=22.5 2024-06-21 20:23:03,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=450783.6666666667, ans=0.125 2024-06-21 20:23:03,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=450783.6666666667, ans=0.0 2024-06-21 20:23:12,956 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.60 vs. limit=15.0 2024-06-21 20:23:16,665 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.26 vs. limit=15.0 2024-06-21 20:23:21,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=450838.6666666667, ans=0.0 2024-06-21 20:23:21,613 INFO [train.py:1028] (1/2) Epoch 25, batch 3100, loss[loss=0.1692, simple_loss=0.2146, pruned_loss=0.06192, over 13023.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.2392, pruned_loss=0.06393, over 2578506.17 frames. ], batch size: 144, lr: 2.32e-03, grad_scale: 16.0 2024-06-21 20:23:36,537 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.43 vs. limit=6.0 2024-06-21 20:23:38,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=450875.3333333333, ans=0.2 2024-06-21 20:23:42,935 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.20 vs. limit=22.5 2024-06-21 20:23:48,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=450893.6666666667, ans=0.0 2024-06-21 20:23:48,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=450893.6666666667, ans=0.125 2024-06-21 20:23:57,914 INFO [train.py:1028] (1/2) Epoch 25, batch 3150, loss[loss=0.2064, simple_loss=0.253, pruned_loss=0.07988, over 12926.00 frames. ], tot_loss[loss=0.1827, simple_loss=0.2381, pruned_loss=0.06365, over 2580681.11 frames. ], batch size: 158, lr: 2.32e-03, grad_scale: 16.0 2024-06-21 20:24:07,158 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.158e+02 2.321e+02 2.501e+02 3.137e+02, threshold=4.641e+02, percent-clipped=0.0 2024-06-21 20:24:14,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=450967.0, ans=0.1 2024-06-21 20:24:20,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=450985.3333333333, ans=0.0 2024-06-21 20:24:24,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=451003.6666666667, ans=0.125 2024-06-21 20:24:30,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=451022.0, ans=0.125 2024-06-21 20:24:31,443 INFO [train.py:1028] (1/2) Epoch 25, batch 3200, loss[loss=0.1906, simple_loss=0.2508, pruned_loss=0.06519, over 13170.00 frames. ], tot_loss[loss=0.1821, simple_loss=0.2374, pruned_loss=0.06343, over 2581253.40 frames. ], batch size: 55, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:24:49,966 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.51 vs. limit=10.0 2024-06-21 20:24:53,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=451077.0, ans=0.2 2024-06-21 20:24:54,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=451077.0, ans=0.0 2024-06-21 20:25:02,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=451095.3333333333, ans=0.125 2024-06-21 20:25:06,688 INFO [train.py:1028] (1/2) Epoch 25, batch 3250, loss[loss=0.1848, simple_loss=0.2507, pruned_loss=0.05941, over 13245.00 frames. ], tot_loss[loss=0.1823, simple_loss=0.2376, pruned_loss=0.0635, over 2586058.53 frames. ], batch size: 72, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:25:16,488 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.170e+02 2.282e+02 2.473e+02 3.444e+02, threshold=4.564e+02, percent-clipped=0.0 2024-06-21 20:25:37,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=451187.0, ans=0.1 2024-06-21 20:25:41,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=451187.0, ans=0.2 2024-06-21 20:25:43,787 INFO [train.py:1028] (1/2) Epoch 25, batch 3300, loss[loss=0.1896, simple_loss=0.2415, pruned_loss=0.06887, over 12773.00 frames. ], tot_loss[loss=0.1821, simple_loss=0.2374, pruned_loss=0.06333, over 2583467.51 frames. ], batch size: 176, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:25:48,021 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.38 vs. limit=15.0 2024-06-21 20:26:00,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=451242.0, ans=0.0 2024-06-21 20:26:06,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=451260.3333333333, ans=0.0 2024-06-21 20:26:06,634 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.29 vs. limit=22.5 2024-06-21 20:26:10,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=451278.6666666667, ans=0.125 2024-06-21 20:26:16,274 INFO [train.py:1028] (1/2) Epoch 25, batch 3350, loss[loss=0.1754, simple_loss=0.2232, pruned_loss=0.0638, over 12964.00 frames. ], tot_loss[loss=0.1819, simple_loss=0.2371, pruned_loss=0.06333, over 2579149.95 frames. ], batch size: 158, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:26:25,566 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 2.123e+02 2.235e+02 2.440e+02 3.127e+02, threshold=4.470e+02, percent-clipped=0.0 2024-06-21 20:26:41,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=451352.0, ans=0.0 2024-06-21 20:26:49,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=451370.3333333333, ans=0.125 2024-06-21 20:26:51,790 INFO [train.py:1028] (1/2) Epoch 25, batch 3400, loss[loss=0.1825, simple_loss=0.2429, pruned_loss=0.06105, over 12518.00 frames. ], tot_loss[loss=0.1821, simple_loss=0.2368, pruned_loss=0.06369, over 2576790.68 frames. ], batch size: 22, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:26:58,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=451407.0, ans=0.025 2024-06-21 20:26:58,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=451407.0, ans=0.2 2024-06-21 20:26:59,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=451407.0, ans=0.0 2024-06-21 20:27:08,174 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.20 vs. limit=15.0 2024-06-21 20:27:08,220 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.43 vs. limit=10.0 2024-06-21 20:27:08,828 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.03 vs. limit=22.5 2024-06-21 20:27:28,455 INFO [train.py:1028] (1/2) Epoch 25, batch 3450, loss[loss=0.1837, simple_loss=0.2401, pruned_loss=0.06368, over 12737.00 frames. ], tot_loss[loss=0.1822, simple_loss=0.237, pruned_loss=0.06366, over 2577484.03 frames. ], batch size: 176, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:27:33,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=451480.3333333333, ans=0.125 2024-06-21 20:27:37,860 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.157e+02 2.281e+02 2.461e+02 3.280e+02, threshold=4.561e+02, percent-clipped=0.0 2024-06-21 20:27:56,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=451553.6666666667, ans=0.125 2024-06-21 20:28:00,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=451572.0, ans=0.125 2024-06-21 20:28:01,148 INFO [train.py:1028] (1/2) Epoch 25, batch 3500, loss[loss=0.1633, simple_loss=0.2204, pruned_loss=0.05312, over 12875.00 frames. ], tot_loss[loss=0.1813, simple_loss=0.236, pruned_loss=0.06332, over 2576814.13 frames. ], batch size: 33, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:28:07,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=451590.3333333333, ans=0.2 2024-06-21 20:28:14,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=451608.6666666667, ans=0.125 2024-06-21 20:28:26,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=451645.3333333333, ans=0.125 2024-06-21 20:28:30,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=451645.3333333333, ans=0.125 2024-06-21 20:28:33,782 INFO [train.py:1028] (1/2) Epoch 25, batch 3550, loss[loss=0.1651, simple_loss=0.2142, pruned_loss=0.05807, over 13086.00 frames. ], tot_loss[loss=0.1806, simple_loss=0.2353, pruned_loss=0.06294, over 2578138.01 frames. ], batch size: 95, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:28:42,283 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.101e+02 2.217e+02 2.330e+02 3.028e+02, threshold=4.434e+02, percent-clipped=0.0 2024-06-21 20:28:43,949 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.78 vs. limit=22.5 2024-06-21 20:28:51,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=451700.3333333333, ans=0.2 2024-06-21 20:28:55,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=451718.6666666667, ans=0.125 2024-06-21 20:29:01,392 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.30 vs. limit=15.0 2024-06-21 20:29:09,135 INFO [train.py:1028] (1/2) Epoch 25, batch 3600, loss[loss=0.1859, simple_loss=0.2435, pruned_loss=0.06412, over 13312.00 frames. ], tot_loss[loss=0.1795, simple_loss=0.2342, pruned_loss=0.06238, over 2582297.60 frames. ], batch size: 49, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:29:16,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=451755.3333333333, ans=0.025 2024-06-21 20:29:25,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=451792.0, ans=0.125 2024-06-21 20:29:26,832 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.94 vs. limit=12.0 2024-06-21 20:29:29,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=451792.0, ans=0.0 2024-06-21 20:29:44,995 INFO [train.py:1028] (1/2) Epoch 25, batch 3650, loss[loss=0.1855, simple_loss=0.2368, pruned_loss=0.06713, over 13059.00 frames. ], tot_loss[loss=0.1796, simple_loss=0.2346, pruned_loss=0.06235, over 2579375.50 frames. ], batch size: 102, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:29:50,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=451847.0, ans=0.1 2024-06-21 20:29:53,760 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.166e+02 2.291e+02 2.425e+02 3.154e+02, threshold=4.582e+02, percent-clipped=0.0 2024-06-21 20:29:58,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=451883.6666666667, ans=0.2 2024-06-21 20:30:00,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=451883.6666666667, ans=0.125 2024-06-21 20:30:04,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=451902.0, ans=0.125 2024-06-21 20:30:09,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=451902.0, ans=0.0 2024-06-21 20:30:17,830 INFO [train.py:1028] (1/2) Epoch 25, batch 3700, loss[loss=0.1711, simple_loss=0.2278, pruned_loss=0.05718, over 13168.00 frames. ], tot_loss[loss=0.1791, simple_loss=0.2338, pruned_loss=0.06218, over 2584852.11 frames. ], batch size: 72, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:30:24,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=451957.0, ans=0.1 2024-06-21 20:30:28,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=451957.0, ans=0.5 2024-06-21 20:30:30,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=451957.0, ans=0.125 2024-06-21 20:30:34,271 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.91 vs. limit=15.0 2024-06-21 20:30:51,457 INFO [train.py:1028] (1/2) Epoch 25, batch 3750, loss[loss=0.1698, simple_loss=0.2291, pruned_loss=0.05528, over 12517.00 frames. ], tot_loss[loss=0.1791, simple_loss=0.2339, pruned_loss=0.06218, over 2587484.81 frames. ], batch size: 22, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:30:58,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=452030.3333333333, ans=0.125 2024-06-21 20:31:04,650 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.092e+02 2.221e+02 2.408e+02 3.356e+02, threshold=4.441e+02, percent-clipped=0.0 2024-06-21 20:31:04,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=452048.6666666667, ans=0.125 2024-06-21 20:31:07,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=452048.6666666667, ans=0.0 2024-06-21 20:31:08,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=452067.0, ans=0.0 2024-06-21 20:31:10,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=452067.0, ans=0.1 2024-06-21 20:31:11,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=452067.0, ans=0.95 2024-06-21 20:31:15,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=452085.3333333333, ans=0.0 2024-06-21 20:31:29,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=452103.6666666667, ans=0.2 2024-06-21 20:31:31,738 INFO [train.py:1028] (1/2) Epoch 25, batch 3800, loss[loss=0.1758, simple_loss=0.2256, pruned_loss=0.06299, over 13176.00 frames. ], tot_loss[loss=0.1784, simple_loss=0.2333, pruned_loss=0.06169, over 2584734.55 frames. ], batch size: 83, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:31:36,160 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.66 vs. limit=5.0 2024-06-21 20:32:03,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=452195.3333333333, ans=0.1 2024-06-21 20:32:05,221 INFO [train.py:1028] (1/2) Epoch 25, batch 3850, loss[loss=0.1802, simple_loss=0.228, pruned_loss=0.06617, over 13007.00 frames. ], tot_loss[loss=0.1783, simple_loss=0.2333, pruned_loss=0.0617, over 2585292.77 frames. ], batch size: 144, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:32:14,732 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.147e+02 2.332e+02 2.596e+02 3.213e+02, threshold=4.663e+02, percent-clipped=0.0 2024-06-21 20:32:24,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=452268.6666666667, ans=0.025 2024-06-21 20:32:38,118 INFO [train.py:1028] (1/2) Epoch 25, batch 3900, loss[loss=0.1777, simple_loss=0.225, pruned_loss=0.06519, over 13245.00 frames. ], tot_loss[loss=0.1782, simple_loss=0.2332, pruned_loss=0.06165, over 2587530.40 frames. ], batch size: 83, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:32:45,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=452323.6666666667, ans=0.125 2024-06-21 20:32:58,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=452360.3333333333, ans=0.1 2024-06-21 20:33:12,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=452378.6666666667, ans=0.0 2024-06-21 20:33:14,681 INFO [train.py:1028] (1/2) Epoch 25, batch 3950, loss[loss=0.1803, simple_loss=0.2217, pruned_loss=0.06943, over 13094.00 frames. ], tot_loss[loss=0.1771, simple_loss=0.2322, pruned_loss=0.06099, over 2589335.24 frames. ], batch size: 132, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:33:24,072 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.120e+02 2.228e+02 2.476e+02 3.835e+02, threshold=4.456e+02, percent-clipped=0.0 2024-06-21 20:33:25,099 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.79 vs. limit=15.0 2024-06-21 20:33:27,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=452415.3333333333, ans=0.0 2024-06-21 20:33:28,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=452433.6666666667, ans=0.1 2024-06-21 20:33:32,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=452433.6666666667, ans=0.2 2024-06-21 20:33:33,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=452433.6666666667, ans=0.1 2024-06-21 20:33:36,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=452433.6666666667, ans=0.125 2024-06-21 20:33:49,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=452470.3333333333, ans=0.125 2024-06-21 20:33:51,786 INFO [train.py:1028] (1/2) Epoch 25, batch 4000, loss[loss=0.1821, simple_loss=0.2336, pruned_loss=0.06525, over 12963.00 frames. ], tot_loss[loss=0.1775, simple_loss=0.2322, pruned_loss=0.06137, over 2584444.92 frames. ], batch size: 39, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:33:51,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=452488.6666666667, ans=0.2 2024-06-21 20:33:54,982 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:33:55,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=452488.6666666667, ans=0.0 2024-06-21 20:33:57,906 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.15 vs. limit=12.0 2024-06-21 20:34:09,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=452525.3333333333, ans=0.0 2024-06-21 20:34:12,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=452543.6666666667, ans=0.125 2024-06-21 20:34:13,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=452543.6666666667, ans=0.2 2024-06-21 20:34:13,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=452543.6666666667, ans=0.125 2024-06-21 20:34:19,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=452562.0, ans=0.0 2024-06-21 20:34:25,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=452580.3333333333, ans=0.125 2024-06-21 20:34:25,449 INFO [train.py:1028] (1/2) Epoch 25, batch 4050, loss[loss=0.1996, simple_loss=0.2372, pruned_loss=0.08096, over 11093.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.2317, pruned_loss=0.06141, over 2581622.96 frames. ], batch size: 303, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:34:26,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=452580.3333333333, ans=0.2 2024-06-21 20:34:30,092 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.86 vs. limit=15.0 2024-06-21 20:34:30,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=452580.3333333333, ans=0.1 2024-06-21 20:34:33,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=452598.6666666667, ans=0.0 2024-06-21 20:34:35,007 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.122e+02 2.253e+02 2.490e+02 3.072e+02, threshold=4.506e+02, percent-clipped=0.0 2024-06-21 20:34:49,877 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.67 vs. limit=15.0 2024-06-21 20:34:52,370 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.28 vs. limit=12.0 2024-06-21 20:34:58,982 INFO [train.py:1028] (1/2) Epoch 25, batch 4100, loss[loss=0.1816, simple_loss=0.2249, pruned_loss=0.06916, over 12995.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.2318, pruned_loss=0.06137, over 2578973.17 frames. ], batch size: 102, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:35:11,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=452690.3333333333, ans=0.0 2024-06-21 20:35:17,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=452708.6666666667, ans=0.1 2024-06-21 20:35:27,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=452727.0, ans=0.125 2024-06-21 20:35:34,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=452745.3333333333, ans=0.125 2024-06-21 20:35:40,909 INFO [train.py:1028] (1/2) Epoch 25, batch 4150, loss[loss=0.1838, simple_loss=0.241, pruned_loss=0.06333, over 13218.00 frames. ], tot_loss[loss=0.1769, simple_loss=0.2316, pruned_loss=0.06114, over 2576472.03 frames. ], batch size: 55, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:35:50,436 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.204e+02 2.371e+02 2.503e+02 3.689e+02, threshold=4.742e+02, percent-clipped=0.0 2024-06-21 20:35:56,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=452800.3333333333, ans=0.0 2024-06-21 20:36:01,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=452818.6666666667, ans=0.1 2024-06-21 20:36:06,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=452818.6666666667, ans=0.0 2024-06-21 20:36:11,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=452837.0, ans=0.09899494936611666 2024-06-21 20:36:14,651 INFO [train.py:1028] (1/2) Epoch 25, batch 4200, loss[loss=0.1903, simple_loss=0.2312, pruned_loss=0.07474, over 13200.00 frames. ], tot_loss[loss=0.1764, simple_loss=0.2307, pruned_loss=0.06106, over 2579724.45 frames. ], batch size: 103, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:36:23,110 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.03 vs. limit=12.0 2024-06-21 20:36:27,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=452892.0, ans=0.0 2024-06-21 20:36:28,918 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2024-06-21 20:36:34,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=452910.3333333333, ans=0.125 2024-06-21 20:36:36,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=452910.3333333333, ans=0.125 2024-06-21 20:36:42,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=452928.6666666667, ans=0.1 2024-06-21 20:36:43,110 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.07 vs. limit=15.0 2024-06-21 20:36:48,102 INFO [train.py:1028] (1/2) Epoch 25, batch 4250, loss[loss=0.1686, simple_loss=0.2321, pruned_loss=0.05253, over 13302.00 frames. ], tot_loss[loss=0.1764, simple_loss=0.2306, pruned_loss=0.06106, over 2582045.87 frames. ], batch size: 46, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:36:48,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=452947.0, ans=0.125 2024-06-21 20:36:49,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=452947.0, ans=0.1 2024-06-21 20:36:51,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=452947.0, ans=0.05 2024-06-21 20:36:53,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=452947.0, ans=0.125 2024-06-21 20:36:57,489 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.069e+02 2.210e+02 2.379e+02 4.416e+02, threshold=4.421e+02, percent-clipped=0.0 2024-06-21 20:37:04,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=452983.6666666667, ans=0.1 2024-06-21 20:37:05,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=452983.6666666667, ans=0.0 2024-06-21 20:37:20,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=453020.3333333333, ans=0.125 2024-06-21 20:37:21,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=453020.3333333333, ans=0.125 2024-06-21 20:37:24,415 INFO [train.py:1028] (1/2) Epoch 25, batch 4300, loss[loss=0.1811, simple_loss=0.2337, pruned_loss=0.06424, over 13224.00 frames. ], tot_loss[loss=0.1763, simple_loss=0.2305, pruned_loss=0.06108, over 2581736.58 frames. ], batch size: 59, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:37:39,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=453057.0, ans=0.125 2024-06-21 20:37:39,450 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:37:39,694 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.69 vs. limit=6.0 2024-06-21 20:37:43,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=453075.3333333333, ans=0.0 2024-06-21 20:37:55,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=453112.0, ans=0.2 2024-06-21 20:37:59,938 INFO [train.py:1028] (1/2) Epoch 25, batch 4350, loss[loss=0.1939, simple_loss=0.2409, pruned_loss=0.07342, over 13193.00 frames. ], tot_loss[loss=0.1763, simple_loss=0.2304, pruned_loss=0.06114, over 2586261.94 frames. ], batch size: 59, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:38:07,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=453148.6666666667, ans=0.0 2024-06-21 20:38:08,740 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.121e+02 2.264e+02 2.544e+02 3.532e+02, threshold=4.528e+02, percent-clipped=0.0 2024-06-21 20:38:14,624 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=453167.0, ans=0.0 2024-06-21 20:38:19,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=453185.3333333333, ans=0.0 2024-06-21 20:38:21,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=453185.3333333333, ans=0.1 2024-06-21 20:38:24,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=453185.3333333333, ans=0.125 2024-06-21 20:38:29,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=453203.6666666667, ans=0.0 2024-06-21 20:38:30,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=453203.6666666667, ans=0.125 2024-06-21 20:38:32,034 INFO [train.py:1028] (1/2) Epoch 25, batch 4400, loss[loss=0.1775, simple_loss=0.2286, pruned_loss=0.06317, over 13226.00 frames. ], tot_loss[loss=0.1763, simple_loss=0.2302, pruned_loss=0.06124, over 2585420.63 frames. ], batch size: 83, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:38:32,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=453222.0, ans=0.125 2024-06-21 20:38:32,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=453222.0, ans=0.0 2024-06-21 20:38:33,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=453222.0, ans=0.125 2024-06-21 20:38:36,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=453222.0, ans=0.1 2024-06-21 20:38:38,782 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.58 vs. limit=12.0 2024-06-21 20:38:38,869 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.06 vs. limit=15.0 2024-06-21 20:38:44,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=453240.3333333333, ans=0.125 2024-06-21 20:39:00,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=453295.3333333333, ans=0.1 2024-06-21 20:39:05,218 INFO [train.py:1028] (1/2) Epoch 25, batch 4450, loss[loss=0.1823, simple_loss=0.2326, pruned_loss=0.06604, over 12854.00 frames. ], tot_loss[loss=0.1766, simple_loss=0.2302, pruned_loss=0.06148, over 2580188.31 frames. ], batch size: 33, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:39:12,122 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:39:12,907 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2024-06-21 20:39:17,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=453332.0, ans=0.0 2024-06-21 20:39:18,435 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.886e+02 2.048e+02 2.153e+02 2.317e+02 3.246e+02, threshold=4.305e+02, percent-clipped=0.0 2024-06-21 20:39:26,483 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.01 vs. limit=15.0 2024-06-21 20:39:28,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=453368.6666666667, ans=0.125 2024-06-21 20:39:31,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=453368.6666666667, ans=0.1 2024-06-21 20:39:40,287 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2024-06-21 20:39:44,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=453405.3333333333, ans=0.125 2024-06-21 20:39:45,126 INFO [train.py:1028] (1/2) Epoch 25, batch 4500, loss[loss=0.1817, simple_loss=0.2304, pruned_loss=0.06647, over 13226.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.229, pruned_loss=0.06107, over 2584973.49 frames. ], batch size: 89, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:39:57,856 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.14 vs. limit=15.0 2024-06-21 20:40:01,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=453442.0, ans=0.0 2024-06-21 20:40:05,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=453460.3333333333, ans=0.0 2024-06-21 20:40:09,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=453460.3333333333, ans=0.025 2024-06-21 20:40:10,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=453478.6666666667, ans=0.07 2024-06-21 20:40:13,648 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.99 vs. limit=8.0 2024-06-21 20:40:18,187 INFO [train.py:1028] (1/2) Epoch 25, batch 4550, loss[loss=0.1743, simple_loss=0.2308, pruned_loss=0.05893, over 13242.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.229, pruned_loss=0.061, over 2588047.93 frames. ], batch size: 52, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:40:27,640 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.141e+02 2.299e+02 2.529e+02 3.471e+02, threshold=4.598e+02, percent-clipped=0.0 2024-06-21 20:40:33,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=453533.6666666667, ans=0.1 2024-06-21 20:40:35,561 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.41 vs. limit=6.0 2024-06-21 20:40:42,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=453552.0, ans=0.125 2024-06-21 20:40:51,879 INFO [train.py:1028] (1/2) Epoch 25, batch 4600, loss[loss=0.1885, simple_loss=0.2363, pruned_loss=0.07028, over 12525.00 frames. ], tot_loss[loss=0.1754, simple_loss=0.2289, pruned_loss=0.06094, over 2583943.25 frames. ], batch size: 202, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:40:52,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=453588.6666666667, ans=0.0 2024-06-21 20:40:53,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=453588.6666666667, ans=0.2 2024-06-21 20:41:07,851 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.94 vs. limit=10.0 2024-06-21 20:41:29,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=453662.0, ans=0.125 2024-06-21 20:41:31,495 INFO [train.py:1028] (1/2) Epoch 25, batch 4650, loss[loss=0.1708, simple_loss=0.2212, pruned_loss=0.06023, over 13101.00 frames. ], tot_loss[loss=0.1747, simple_loss=0.2281, pruned_loss=0.06061, over 2586801.32 frames. ], batch size: 132, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:41:33,342 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.95 vs. limit=15.0 2024-06-21 20:41:33,983 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.21 vs. limit=15.0 2024-06-21 20:41:40,799 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 2.102e+02 2.240e+02 2.514e+02 3.063e+02, threshold=4.479e+02, percent-clipped=0.0 2024-06-21 20:41:41,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=453698.6666666667, ans=0.1 2024-06-21 20:41:43,177 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.99 vs. limit=10.0 2024-06-21 20:41:44,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=453717.0, ans=0.0 2024-06-21 20:41:50,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=453735.3333333333, ans=0.125 2024-06-21 20:42:04,962 INFO [train.py:1028] (1/2) Epoch 25, batch 4700, loss[loss=0.174, simple_loss=0.2366, pruned_loss=0.05569, over 12455.00 frames. ], tot_loss[loss=0.176, simple_loss=0.2292, pruned_loss=0.06142, over 2582883.28 frames. ], batch size: 25, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:42:08,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=453772.0, ans=0.125 2024-06-21 20:42:18,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=453808.6666666667, ans=0.125 2024-06-21 20:42:31,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=453845.3333333333, ans=0.125 2024-06-21 20:42:33,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=453845.3333333333, ans=0.0 2024-06-21 20:42:33,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=453845.3333333333, ans=0.2 2024-06-21 20:42:38,058 INFO [train.py:1028] (1/2) Epoch 25, batch 4750, loss[loss=0.1876, simple_loss=0.2382, pruned_loss=0.06845, over 12516.00 frames. ], tot_loss[loss=0.176, simple_loss=0.2292, pruned_loss=0.06143, over 2579745.26 frames. ], batch size: 202, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:42:38,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=453863.6666666667, ans=0.2 2024-06-21 20:42:42,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=453863.6666666667, ans=0.0 2024-06-21 20:42:47,464 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.137e+02 2.275e+02 2.431e+02 3.911e+02, threshold=4.549e+02, percent-clipped=0.0 2024-06-21 20:43:06,610 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.37 vs. limit=15.0 2024-06-21 20:43:15,057 INFO [train.py:1028] (1/2) Epoch 25, batch 4800, loss[loss=0.1824, simple_loss=0.2397, pruned_loss=0.06254, over 13230.00 frames. ], tot_loss[loss=0.1758, simple_loss=0.2292, pruned_loss=0.06123, over 2575774.85 frames. ], batch size: 63, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:43:40,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=454010.3333333333, ans=0.125 2024-06-21 20:43:41,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=454010.3333333333, ans=0.1 2024-06-21 20:43:44,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=454028.6666666667, ans=0.125 2024-06-21 20:43:46,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=454028.6666666667, ans=0.1 2024-06-21 20:43:51,128 INFO [train.py:1028] (1/2) Epoch 25, batch 4850, loss[loss=0.1693, simple_loss=0.2198, pruned_loss=0.05943, over 13253.00 frames. ], tot_loss[loss=0.1749, simple_loss=0.2282, pruned_loss=0.06078, over 2573435.76 frames. ], batch size: 89, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:44:00,626 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.130e+02 2.237e+02 2.416e+02 3.039e+02, threshold=4.474e+02, percent-clipped=0.0 2024-06-21 20:44:03,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=454065.3333333333, ans=0.125 2024-06-21 20:44:11,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=454102.0, ans=0.125 2024-06-21 20:44:12,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=454102.0, ans=0.125 2024-06-21 20:44:14,765 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.43 vs. limit=15.0 2024-06-21 20:44:15,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=454102.0, ans=0.125 2024-06-21 20:44:22,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=454120.3333333333, ans=0.125 2024-06-21 20:44:23,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=454120.3333333333, ans=0.2 2024-06-21 20:44:25,084 INFO [train.py:1028] (1/2) Epoch 25, batch 4900, loss[loss=0.1765, simple_loss=0.2322, pruned_loss=0.06038, over 13150.00 frames. ], tot_loss[loss=0.175, simple_loss=0.2281, pruned_loss=0.06098, over 2573679.87 frames. ], batch size: 59, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:44:26,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=454138.6666666667, ans=0.0 2024-06-21 20:44:40,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=454175.3333333333, ans=0.125 2024-06-21 20:44:44,025 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2024-06-21 20:44:57,916 INFO [train.py:1028] (1/2) Epoch 25, batch 4950, loss[loss=0.1992, simple_loss=0.2328, pruned_loss=0.08281, over 11133.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2284, pruned_loss=0.06141, over 2568084.90 frames. ], batch size: 304, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:45:00,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=454230.3333333333, ans=0.0 2024-06-21 20:45:07,046 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.062e+02 2.224e+02 2.350e+02 3.118e+02, threshold=4.448e+02, percent-clipped=0.0 2024-06-21 20:45:17,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=454267.0, ans=0.1 2024-06-21 20:45:18,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=454267.0, ans=0.125 2024-06-21 20:45:32,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=454303.6666666667, ans=0.025 2024-06-21 20:45:33,695 INFO [train.py:1028] (1/2) Epoch 25, batch 5000, loss[loss=0.1752, simple_loss=0.2215, pruned_loss=0.06449, over 13138.00 frames. ], tot_loss[loss=0.1757, simple_loss=0.2287, pruned_loss=0.06137, over 2573243.62 frames. ], batch size: 95, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:45:44,102 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.20 vs. limit=15.0 2024-06-21 20:45:48,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=454340.3333333333, ans=0.0 2024-06-21 20:45:50,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=454358.6666666667, ans=0.025 2024-06-21 20:45:51,772 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.63 vs. limit=15.0 2024-06-21 20:45:53,729 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.14 vs. limit=15.0 2024-06-21 20:45:54,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=454358.6666666667, ans=0.0 2024-06-21 20:45:56,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=454358.6666666667, ans=0.2 2024-06-21 20:45:58,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=454377.0, ans=0.2 2024-06-21 20:46:01,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=454377.0, ans=0.09899494936611666 2024-06-21 20:46:09,694 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:46:10,300 INFO [train.py:1028] (1/2) Epoch 25, batch 5050, loss[loss=0.1675, simple_loss=0.2227, pruned_loss=0.05617, over 12864.00 frames. ], tot_loss[loss=0.1757, simple_loss=0.2291, pruned_loss=0.06119, over 2571458.70 frames. ], batch size: 36, lr: 2.31e-03, grad_scale: 64.0 2024-06-21 20:46:15,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=454413.6666666667, ans=0.125 2024-06-21 20:46:19,495 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.103e+02 2.227e+02 2.382e+02 3.210e+02, threshold=4.455e+02, percent-clipped=0.0 2024-06-21 20:46:33,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=454468.6666666667, ans=0.0 2024-06-21 20:46:43,217 INFO [train.py:1028] (1/2) Epoch 25, batch 5100, loss[loss=0.1953, simple_loss=0.2519, pruned_loss=0.06934, over 13035.00 frames. ], tot_loss[loss=0.176, simple_loss=0.229, pruned_loss=0.0615, over 2567385.58 frames. ], batch size: 39, lr: 2.31e-03, grad_scale: 64.0 2024-06-21 20:46:43,780 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.27 vs. limit=15.0 2024-06-21 20:46:45,579 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2024-06-21 20:46:48,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=454505.3333333333, ans=0.125 2024-06-21 20:46:48,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=454505.3333333333, ans=0.125 2024-06-21 20:46:54,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=454523.6666666667, ans=0.125 2024-06-21 20:46:55,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=454523.6666666667, ans=0.125 2024-06-21 20:46:58,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=454542.0, ans=0.07 2024-06-21 20:47:10,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=454560.3333333333, ans=0.1 2024-06-21 20:47:13,092 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:47:22,223 INFO [train.py:1028] (1/2) Epoch 25, batch 5150, loss[loss=0.1656, simple_loss=0.211, pruned_loss=0.0601, over 13143.00 frames. ], tot_loss[loss=0.176, simple_loss=0.2288, pruned_loss=0.06158, over 2570005.54 frames. ], batch size: 132, lr: 2.31e-03, grad_scale: 64.0 2024-06-21 20:47:31,389 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.139e+02 2.274e+02 2.485e+02 3.522e+02, threshold=4.548e+02, percent-clipped=0.0 2024-06-21 20:47:37,475 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2024-06-21 20:47:53,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=454670.3333333333, ans=0.1 2024-06-21 20:47:53,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=454670.3333333333, ans=0.2 2024-06-21 20:47:59,320 INFO [train.py:1028] (1/2) Epoch 25, batch 5200, loss[loss=0.1755, simple_loss=0.2194, pruned_loss=0.06576, over 13180.00 frames. ], tot_loss[loss=0.176, simple_loss=0.2286, pruned_loss=0.06165, over 2573582.23 frames. ], batch size: 95, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:48:00,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=454688.6666666667, ans=0.2 2024-06-21 20:48:03,722 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.42 vs. limit=15.0 2024-06-21 20:48:07,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=454707.0, ans=0.0 2024-06-21 20:48:17,236 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2024-06-21 20:48:23,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=454743.6666666667, ans=0.2 2024-06-21 20:48:28,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=454762.0, ans=0.1 2024-06-21 20:48:30,587 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=5.238e-03 2024-06-21 20:48:31,865 INFO [train.py:1028] (1/2) Epoch 25, batch 5250, loss[loss=0.1647, simple_loss=0.2236, pruned_loss=0.05289, over 13246.00 frames. ], tot_loss[loss=0.1757, simple_loss=0.2286, pruned_loss=0.06142, over 2569074.79 frames. ], batch size: 52, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:48:37,142 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.67 vs. limit=15.0 2024-06-21 20:48:41,236 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.140e+02 2.307e+02 2.518e+02 3.065e+02, threshold=4.614e+02, percent-clipped=0.0 2024-06-21 20:48:43,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=454798.6666666667, ans=0.125 2024-06-21 20:48:43,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=454817.0, ans=0.125 2024-06-21 20:48:44,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=454817.0, ans=0.2 2024-06-21 20:48:48,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=454817.0, ans=0.2 2024-06-21 20:48:49,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=454817.0, ans=0.2 2024-06-21 20:49:03,940 INFO [train.py:1028] (1/2) Epoch 25, batch 5300, loss[loss=0.1652, simple_loss=0.2166, pruned_loss=0.05689, over 13072.00 frames. ], tot_loss[loss=0.1761, simple_loss=0.229, pruned_loss=0.0616, over 2566877.73 frames. ], batch size: 144, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:49:04,465 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.51 vs. limit=22.5 2024-06-21 20:49:13,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=454890.3333333333, ans=0.0 2024-06-21 20:49:14,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=454890.3333333333, ans=0.1 2024-06-21 20:49:38,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=454945.3333333333, ans=0.2 2024-06-21 20:49:42,724 INFO [train.py:1028] (1/2) Epoch 25, batch 5350, loss[loss=0.1903, simple_loss=0.2544, pruned_loss=0.06306, over 11284.00 frames. ], tot_loss[loss=0.1758, simple_loss=0.2287, pruned_loss=0.06146, over 2573196.71 frames. ], batch size: 16, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:49:43,086 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.70 vs. limit=15.0 2024-06-21 20:49:48,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=454982.0, ans=0.125 2024-06-21 20:49:52,083 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.784e+02 2.057e+02 2.158e+02 2.308e+02 3.442e+02, threshold=4.315e+02, percent-clipped=0.0 2024-06-21 20:49:57,964 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.50 vs. limit=15.0 2024-06-21 20:49:59,819 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:50:03,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=455018.6666666667, ans=0.125 2024-06-21 20:50:14,227 INFO [train.py:1028] (1/2) Epoch 25, batch 5400, loss[loss=0.1866, simple_loss=0.2347, pruned_loss=0.06923, over 12232.00 frames. ], tot_loss[loss=0.176, simple_loss=0.2287, pruned_loss=0.06161, over 2566332.35 frames. ], batch size: 240, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:50:15,016 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=455055.3333333333, ans=0.125 2024-06-21 20:50:17,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=455055.3333333333, ans=0.025 2024-06-21 20:50:30,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=455092.0, ans=0.5 2024-06-21 20:50:36,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=455110.3333333333, ans=0.0 2024-06-21 20:50:41,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=455128.6666666667, ans=0.2 2024-06-21 20:50:44,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=455128.6666666667, ans=0.125 2024-06-21 20:50:46,798 INFO [train.py:1028] (1/2) Epoch 25, batch 5450, loss[loss=0.1838, simple_loss=0.2383, pruned_loss=0.06462, over 12505.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2286, pruned_loss=0.06133, over 2569785.54 frames. ], batch size: 25, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:50:48,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=455147.0, ans=0.125 2024-06-21 20:50:50,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=455147.0, ans=0.2 2024-06-21 20:50:51,127 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=12.86 vs. limit=15.0 2024-06-21 20:50:56,523 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.115e+02 2.252e+02 2.391e+02 3.560e+02, threshold=4.505e+02, percent-clipped=0.0 2024-06-21 20:50:59,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=455183.6666666667, ans=0.125 2024-06-21 20:51:05,676 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.58 vs. limit=15.0 2024-06-21 20:51:20,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=455220.3333333333, ans=0.125 2024-06-21 20:51:24,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=455220.3333333333, ans=0.1 2024-06-21 20:51:25,899 INFO [train.py:1028] (1/2) Epoch 25, batch 5500, loss[loss=0.2047, simple_loss=0.2437, pruned_loss=0.08286, over 12288.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.2285, pruned_loss=0.06126, over 2564110.87 frames. ], batch size: 241, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:51:33,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=455257.0, ans=0.125 2024-06-21 20:51:36,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=455257.0, ans=0.125 2024-06-21 20:51:41,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=455275.3333333333, ans=0.2 2024-06-21 20:51:41,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=455275.3333333333, ans=0.125 2024-06-21 20:51:45,245 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.78 vs. limit=15.0 2024-06-21 20:51:58,606 INFO [train.py:1028] (1/2) Epoch 25, batch 5550, loss[loss=0.1771, simple_loss=0.2299, pruned_loss=0.06216, over 13278.00 frames. ], tot_loss[loss=0.1748, simple_loss=0.228, pruned_loss=0.06077, over 2568196.55 frames. ], batch size: 43, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:52:02,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=455330.3333333333, ans=0.125 2024-06-21 20:52:06,465 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:52:08,388 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.728e+02 2.083e+02 2.224e+02 2.415e+02 3.176e+02, threshold=4.448e+02, percent-clipped=0.0 2024-06-21 20:52:30,742 INFO [train.py:1028] (1/2) Epoch 25, batch 5600, loss[loss=0.1781, simple_loss=0.2306, pruned_loss=0.0628, over 13298.00 frames. ], tot_loss[loss=0.1742, simple_loss=0.2274, pruned_loss=0.06051, over 2570773.80 frames. ], batch size: 89, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:52:31,126 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.55 vs. limit=6.0 2024-06-21 20:52:36,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=455422.0, ans=0.1 2024-06-21 20:52:39,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=455440.3333333333, ans=0.0 2024-06-21 20:52:48,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=455458.6666666667, ans=0.125 2024-06-21 20:52:55,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=455477.0, ans=0.125 2024-06-21 20:52:57,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=455495.3333333333, ans=0.125 2024-06-21 20:53:07,400 INFO [train.py:1028] (1/2) Epoch 25, batch 5650, loss[loss=0.2403, simple_loss=0.2702, pruned_loss=0.1052, over 12523.00 frames. ], tot_loss[loss=0.1743, simple_loss=0.2277, pruned_loss=0.06047, over 2575517.97 frames. ], batch size: 202, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:53:07,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=455513.6666666667, ans=0.0 2024-06-21 20:53:09,860 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.86 vs. limit=22.5 2024-06-21 20:53:18,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=455532.0, ans=0.2 2024-06-21 20:53:20,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=455532.0, ans=0.125 2024-06-21 20:53:20,517 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.089e+02 2.205e+02 2.368e+02 2.955e+02, threshold=4.409e+02, percent-clipped=0.0 2024-06-21 20:53:20,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=455532.0, ans=0.0 2024-06-21 20:53:25,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=455550.3333333333, ans=0.0 2024-06-21 20:53:29,179 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=455550.3333333333, ans=0.2 2024-06-21 20:53:29,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=455568.6666666667, ans=0.0 2024-06-21 20:53:30,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=455568.6666666667, ans=0.0 2024-06-21 20:53:37,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=455587.0, ans=0.1 2024-06-21 20:53:40,166 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.07 vs. limit=15.0 2024-06-21 20:53:43,738 INFO [train.py:1028] (1/2) Epoch 25, batch 5700, loss[loss=0.1636, simple_loss=0.2209, pruned_loss=0.0532, over 13290.00 frames. ], tot_loss[loss=0.1739, simple_loss=0.2271, pruned_loss=0.0603, over 2578990.46 frames. ], batch size: 63, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:53:45,377 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2024-06-21 20:53:54,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=455623.6666666667, ans=0.125 2024-06-21 20:54:14,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=455678.6666666667, ans=0.0 2024-06-21 20:54:14,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=455678.6666666667, ans=0.125 2024-06-21 20:54:15,654 INFO [train.py:1028] (1/2) Epoch 25, batch 5750, loss[loss=0.1988, simple_loss=0.2408, pruned_loss=0.07839, over 12772.00 frames. ], tot_loss[loss=0.1748, simple_loss=0.2282, pruned_loss=0.06068, over 2579007.55 frames. ], batch size: 176, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:54:22,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=455715.3333333333, ans=0.1 2024-06-21 20:54:25,603 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.042e+02 2.208e+02 2.378e+02 3.358e+02, threshold=4.415e+02, percent-clipped=0.0 2024-06-21 20:54:38,265 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.92 vs. limit=22.5 2024-06-21 20:54:45,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=455770.3333333333, ans=0.09899494936611666 2024-06-21 20:54:46,674 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.95 vs. limit=10.0 2024-06-21 20:54:48,745 INFO [train.py:1028] (1/2) Epoch 25, batch 5800, loss[loss=0.1888, simple_loss=0.2382, pruned_loss=0.06974, over 12805.00 frames. ], tot_loss[loss=0.1758, simple_loss=0.229, pruned_loss=0.06131, over 2579046.57 frames. ], batch size: 176, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:55:17,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=455843.6666666667, ans=0.125 2024-06-21 20:55:19,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.86 vs. limit=10.0 2024-06-21 20:55:26,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=455862.0, ans=0.0 2024-06-21 20:55:28,100 INFO [train.py:1028] (1/2) Epoch 25, batch 5850, loss[loss=0.1968, simple_loss=0.246, pruned_loss=0.07385, over 12512.00 frames. ], tot_loss[loss=0.177, simple_loss=0.2304, pruned_loss=0.06179, over 2576452.76 frames. ], batch size: 202, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:55:37,899 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.106e+02 2.254e+02 2.391e+02 2.959e+02, threshold=4.507e+02, percent-clipped=0.0 2024-06-21 20:55:39,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=455898.6666666667, ans=0.1 2024-06-21 20:55:41,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=455917.0, ans=0.125 2024-06-21 20:55:41,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=455917.0, ans=0.125 2024-06-21 20:55:44,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=455917.0, ans=0.125 2024-06-21 20:55:47,156 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.44 vs. limit=15.0 2024-06-21 20:55:51,880 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.49 vs. limit=22.5 2024-06-21 20:55:54,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=455953.6666666667, ans=0.125 2024-06-21 20:56:01,449 INFO [train.py:1028] (1/2) Epoch 25, batch 5900, loss[loss=0.1763, simple_loss=0.2285, pruned_loss=0.06204, over 13074.00 frames. ], tot_loss[loss=0.1783, simple_loss=0.2319, pruned_loss=0.06233, over 2576216.24 frames. ], batch size: 121, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:56:02,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=455972.0, ans=0.0 2024-06-21 20:56:04,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=455972.0, ans=0.2 2024-06-21 20:56:14,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=456008.6666666667, ans=0.0 2024-06-21 20:56:21,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=456027.0, ans=0.1 2024-06-21 20:56:21,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=456027.0, ans=0.0 2024-06-21 20:56:29,338 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.36 vs. limit=15.0 2024-06-21 20:56:31,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=456045.3333333333, ans=0.0 2024-06-21 20:56:34,403 INFO [train.py:1028] (1/2) Epoch 25, batch 5950, loss[loss=0.1734, simple_loss=0.2279, pruned_loss=0.05946, over 13145.00 frames. ], tot_loss[loss=0.1803, simple_loss=0.234, pruned_loss=0.06328, over 2581360.52 frames. ], batch size: 121, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:56:44,468 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.212e+02 2.411e+02 2.606e+02 3.285e+02, threshold=4.821e+02, percent-clipped=0.0 2024-06-21 20:56:46,134 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.38 vs. limit=15.0 2024-06-21 20:56:55,452 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2024-06-21 20:57:01,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=456118.6666666667, ans=0.125 2024-06-21 20:57:10,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=456137.0, ans=0.125 2024-06-21 20:57:11,847 INFO [train.py:1028] (1/2) Epoch 25, batch 6000, loss[loss=0.233, simple_loss=0.2691, pruned_loss=0.09845, over 12134.00 frames. ], tot_loss[loss=0.1816, simple_loss=0.2356, pruned_loss=0.06379, over 2573444.20 frames. ], batch size: 240, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:57:11,848 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 20:57:20,654 INFO [train.py:1060] (1/2) Epoch 25, validation: loss=0.1898, simple_loss=0.2515, pruned_loss=0.06411, over 351949.00 frames. 2024-06-21 20:57:20,654 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 20:57:36,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=456192.0, ans=0.125 2024-06-21 20:57:45,872 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.76 vs. limit=10.0 2024-06-21 20:57:48,414 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.18 vs. limit=15.0 2024-06-21 20:57:52,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=456228.6666666667, ans=0.125 2024-06-21 20:57:54,445 INFO [train.py:1028] (1/2) Epoch 25, batch 6050, loss[loss=0.181, simple_loss=0.2336, pruned_loss=0.0642, over 12901.00 frames. ], tot_loss[loss=0.1826, simple_loss=0.2369, pruned_loss=0.06421, over 2576409.20 frames. ], batch size: 39, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:58:03,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=456265.3333333333, ans=0.0 2024-06-21 20:58:04,171 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.179e+02 2.294e+02 2.441e+02 3.429e+02, threshold=4.587e+02, percent-clipped=0.0 2024-06-21 20:58:19,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=456302.0, ans=15.0 2024-06-21 20:58:19,331 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2024-06-21 20:58:23,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=456320.3333333333, ans=15.0 2024-06-21 20:58:26,976 INFO [train.py:1028] (1/2) Epoch 25, batch 6100, loss[loss=0.1699, simple_loss=0.2258, pruned_loss=0.05705, over 13092.00 frames. ], tot_loss[loss=0.1833, simple_loss=0.2378, pruned_loss=0.06444, over 2579371.62 frames. ], batch size: 121, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:58:35,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=456357.0, ans=0.125 2024-06-21 20:58:42,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=456375.3333333333, ans=0.125 2024-06-21 20:58:45,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=456375.3333333333, ans=0.125 2024-06-21 20:58:52,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=456393.6666666667, ans=0.1 2024-06-21 20:59:00,279 INFO [train.py:1028] (1/2) Epoch 25, batch 6150, loss[loss=0.187, simple_loss=0.2357, pruned_loss=0.06918, over 11040.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2395, pruned_loss=0.06491, over 2577445.51 frames. ], batch size: 304, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:59:08,305 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2024-06-21 20:59:10,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=456448.6666666667, ans=0.015 2024-06-21 20:59:17,777 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.216e+02 2.434e+02 2.756e+02 3.812e+02, threshold=4.867e+02, percent-clipped=0.0 2024-06-21 20:59:20,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=456467.0, ans=0.125 2024-06-21 20:59:26,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=456485.3333333333, ans=0.1 2024-06-21 20:59:41,391 INFO [train.py:1028] (1/2) Epoch 25, batch 6200, loss[loss=0.2189, simple_loss=0.2772, pruned_loss=0.0803, over 13231.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2406, pruned_loss=0.06498, over 2574461.76 frames. ], batch size: 89, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:59:41,701 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2024-06-21 20:59:42,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=456522.0, ans=0.025 2024-06-21 20:59:46,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=456522.0, ans=0.125 2024-06-21 20:59:58,039 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=456558.6666666667, ans=0.0 2024-06-21 20:59:59,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=456558.6666666667, ans=0.125 2024-06-21 21:00:01,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=456577.0, ans=0.125 2024-06-21 21:00:11,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=456595.3333333333, ans=0.0 2024-06-21 21:00:15,675 INFO [train.py:1028] (1/2) Epoch 25, batch 6250, loss[loss=0.1892, simple_loss=0.2508, pruned_loss=0.06378, over 13175.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2415, pruned_loss=0.06543, over 2567293.64 frames. ], batch size: 83, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:00:23,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=456632.0, ans=0.2 2024-06-21 21:00:26,241 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.854e+02 2.281e+02 2.459e+02 2.826e+02 4.417e+02, threshold=4.918e+02, percent-clipped=0.0 2024-06-21 21:00:41,338 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.25 vs. limit=22.5 2024-06-21 21:00:49,383 INFO [train.py:1028] (1/2) Epoch 25, batch 6300, loss[loss=0.189, simple_loss=0.2452, pruned_loss=0.06635, over 11592.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2424, pruned_loss=0.06551, over 2563294.18 frames. ], batch size: 17, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:00:49,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=456705.3333333333, ans=0.0 2024-06-21 21:00:52,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=456705.3333333333, ans=0.0 2024-06-21 21:00:53,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=456705.3333333333, ans=0.1 2024-06-21 21:01:03,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=456742.0, ans=0.1 2024-06-21 21:01:26,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=456778.6666666667, ans=0.0 2024-06-21 21:01:30,609 INFO [train.py:1028] (1/2) Epoch 25, batch 6350, loss[loss=0.215, simple_loss=0.2662, pruned_loss=0.08191, over 12541.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2438, pruned_loss=0.0657, over 2573210.24 frames. ], batch size: 202, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:01:39,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=456815.3333333333, ans=0.125 2024-06-21 21:01:40,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=456815.3333333333, ans=0.125 2024-06-21 21:01:40,988 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.217e+02 2.393e+02 2.606e+02 3.577e+02, threshold=4.786e+02, percent-clipped=0.0 2024-06-21 21:01:44,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2024-06-21 21:01:54,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=456852.0, ans=0.125 2024-06-21 21:01:55,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=456852.0, ans=0.1 2024-06-21 21:02:01,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=456870.3333333333, ans=0.2 2024-06-21 21:02:04,497 INFO [train.py:1028] (1/2) Epoch 25, batch 6400, loss[loss=0.1677, simple_loss=0.2307, pruned_loss=0.05235, over 13203.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2459, pruned_loss=0.06671, over 2574015.88 frames. ], batch size: 67, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:02:09,766 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.68 vs. limit=22.5 2024-06-21 21:02:16,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=456907.0, ans=0.0 2024-06-21 21:02:31,691 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.43 vs. limit=10.0 2024-06-21 21:02:37,659 INFO [train.py:1028] (1/2) Epoch 25, batch 6450, loss[loss=0.2147, simple_loss=0.2631, pruned_loss=0.08311, over 12550.00 frames. ], tot_loss[loss=0.1913, simple_loss=0.2478, pruned_loss=0.06741, over 2580413.59 frames. ], batch size: 202, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:02:46,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=456998.6666666667, ans=0.125 2024-06-21 21:02:47,703 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.292e+02 2.450e+02 2.726e+02 3.688e+02, threshold=4.900e+02, percent-clipped=0.0 2024-06-21 21:02:47,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=456998.6666666667, ans=0.125 2024-06-21 21:02:51,550 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.71 vs. limit=6.0 2024-06-21 21:02:54,918 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.55 vs. limit=22.5 2024-06-21 21:03:07,650 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:03:10,839 INFO [train.py:1028] (1/2) Epoch 25, batch 6500, loss[loss=0.1838, simple_loss=0.2337, pruned_loss=0.0669, over 10778.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2491, pruned_loss=0.06767, over 2584192.15 frames. ], batch size: 304, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:03:11,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=457072.0, ans=0.025 2024-06-21 21:03:11,872 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2024-06-21 21:03:18,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=457072.0, ans=0.1 2024-06-21 21:03:20,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=457090.3333333333, ans=0.1 2024-06-21 21:03:22,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=457090.3333333333, ans=0.0 2024-06-21 21:03:22,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=457090.3333333333, ans=0.0 2024-06-21 21:03:36,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=457127.0, ans=0.2 2024-06-21 21:03:36,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=457127.0, ans=0.125 2024-06-21 21:03:50,685 INFO [train.py:1028] (1/2) Epoch 25, batch 6550, loss[loss=0.1911, simple_loss=0.2569, pruned_loss=0.0627, over 12398.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2503, pruned_loss=0.06768, over 2588892.97 frames. ], batch size: 22, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:03:52,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=457163.6666666667, ans=0.1 2024-06-21 21:03:58,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=457182.0, ans=0.125 2024-06-21 21:04:00,511 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.254e+02 2.352e+02 2.591e+02 3.160e+02, threshold=4.705e+02, percent-clipped=0.0 2024-06-21 21:04:11,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=457218.6666666667, ans=0.125 2024-06-21 21:04:14,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=457218.6666666667, ans=0.1 2024-06-21 21:04:16,938 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.81 vs. limit=15.0 2024-06-21 21:04:23,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=457255.3333333333, ans=0.0 2024-06-21 21:04:23,582 INFO [train.py:1028] (1/2) Epoch 25, batch 6600, loss[loss=0.1914, simple_loss=0.2621, pruned_loss=0.06029, over 13217.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2503, pruned_loss=0.06776, over 2590824.63 frames. ], batch size: 72, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:04:32,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=457273.6666666667, ans=0.1 2024-06-21 21:04:33,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=457273.6666666667, ans=0.1 2024-06-21 21:04:37,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=457292.0, ans=12.0 2024-06-21 21:04:38,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=457292.0, ans=0.0 2024-06-21 21:04:57,512 INFO [train.py:1028] (1/2) Epoch 25, batch 6650, loss[loss=0.2041, simple_loss=0.2568, pruned_loss=0.07568, over 12944.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2516, pruned_loss=0.06808, over 2584172.70 frames. ], batch size: 158, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:04:57,916 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=22.5 2024-06-21 21:05:03,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=457347.0, ans=0.125 2024-06-21 21:05:06,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=457365.3333333333, ans=0.125 2024-06-21 21:05:07,873 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.274e+02 2.509e+02 2.793e+02 3.537e+02, threshold=5.019e+02, percent-clipped=0.0 2024-06-21 21:05:16,845 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.26 vs. limit=15.0 2024-06-21 21:05:33,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=457420.3333333333, ans=0.0 2024-06-21 21:05:35,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=457420.3333333333, ans=0.125 2024-06-21 21:05:35,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=457420.3333333333, ans=0.125 2024-06-21 21:05:39,550 INFO [train.py:1028] (1/2) Epoch 25, batch 6700, loss[loss=0.2309, simple_loss=0.2851, pruned_loss=0.08841, over 12625.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2527, pruned_loss=0.06881, over 2582936.73 frames. ], batch size: 176, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:05:51,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=457457.0, ans=0.95 2024-06-21 21:06:13,150 INFO [train.py:1028] (1/2) Epoch 25, batch 6750, loss[loss=0.2455, simple_loss=0.2861, pruned_loss=0.1025, over 12250.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2534, pruned_loss=0.06926, over 2576818.43 frames. ], batch size: 241, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:06:17,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=457530.3333333333, ans=0.125 2024-06-21 21:06:18,079 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.75 vs. limit=22.5 2024-06-21 21:06:18,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=457530.3333333333, ans=0.1 2024-06-21 21:06:22,799 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.346e+02 2.496e+02 2.734e+02 3.276e+02, threshold=4.993e+02, percent-clipped=0.0 2024-06-21 21:06:26,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=457567.0, ans=0.125 2024-06-21 21:06:33,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=457585.3333333333, ans=0.125 2024-06-21 21:06:36,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=457585.3333333333, ans=0.125 2024-06-21 21:06:42,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=457603.6666666667, ans=0.125 2024-06-21 21:06:45,946 INFO [train.py:1028] (1/2) Epoch 25, batch 6800, loss[loss=0.1831, simple_loss=0.2437, pruned_loss=0.06125, over 13229.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2551, pruned_loss=0.0699, over 2578847.17 frames. ], batch size: 67, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:06:52,498 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:06:57,626 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.50 vs. limit=6.0 2024-06-21 21:07:01,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=457658.6666666667, ans=0.1 2024-06-21 21:07:07,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=457677.0, ans=0.0 2024-06-21 21:07:15,582 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.88 vs. limit=10.0 2024-06-21 21:07:19,224 INFO [train.py:1028] (1/2) Epoch 25, batch 6850, loss[loss=0.2023, simple_loss=0.27, pruned_loss=0.06734, over 13259.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2558, pruned_loss=0.06974, over 2583046.11 frames. ], batch size: 63, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:07:32,397 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.335e+02 2.514e+02 2.756e+02 3.546e+02, threshold=5.027e+02, percent-clipped=0.0 2024-06-21 21:07:33,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=457732.0, ans=0.125 2024-06-21 21:07:45,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=457750.3333333333, ans=0.1 2024-06-21 21:07:54,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=457787.0, ans=0.0 2024-06-21 21:07:57,719 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.31 vs. limit=22.5 2024-06-21 21:07:59,949 INFO [train.py:1028] (1/2) Epoch 25, batch 6900, loss[loss=0.211, simple_loss=0.27, pruned_loss=0.076, over 13263.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2569, pruned_loss=0.07027, over 2585182.62 frames. ], batch size: 49, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:08:02,716 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.31 vs. limit=10.0 2024-06-21 21:08:03,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=457805.3333333333, ans=0.1 2024-06-21 21:08:13,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=457842.0, ans=0.0 2024-06-21 21:08:30,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=457878.6666666667, ans=0.0 2024-06-21 21:08:33,510 INFO [train.py:1028] (1/2) Epoch 25, batch 6950, loss[loss=0.185, simple_loss=0.2441, pruned_loss=0.06299, over 11617.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2574, pruned_loss=0.0702, over 2579127.79 frames. ], batch size: 16, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:08:33,809 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2024-06-21 21:08:41,626 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.21 vs. limit=15.0 2024-06-21 21:08:43,114 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.330e+02 2.504e+02 2.691e+02 3.500e+02, threshold=5.008e+02, percent-clipped=0.0 2024-06-21 21:08:46,202 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.37 vs. limit=15.0 2024-06-21 21:08:49,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=457933.6666666667, ans=0.125 2024-06-21 21:08:49,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=457933.6666666667, ans=0.09899494936611666 2024-06-21 21:08:50,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=457933.6666666667, ans=0.0 2024-06-21 21:08:56,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=457952.0, ans=0.125 2024-06-21 21:09:06,348 INFO [train.py:1028] (1/2) Epoch 25, batch 7000, loss[loss=0.2097, simple_loss=0.2714, pruned_loss=0.074, over 12929.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2571, pruned_loss=0.06973, over 2574961.99 frames. ], batch size: 158, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:09:08,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=457988.6666666667, ans=0.125 2024-06-21 21:09:13,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=458007.0, ans=0.0 2024-06-21 21:09:15,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=458007.0, ans=0.125 2024-06-21 21:09:23,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=458025.3333333333, ans=0.0 2024-06-21 21:09:24,927 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.33 vs. limit=22.5 2024-06-21 21:09:29,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458043.6666666667, ans=0.1 2024-06-21 21:09:40,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=458062.0, ans=0.07 2024-06-21 21:09:44,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=458062.0, ans=0.125 2024-06-21 21:09:45,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458062.0, ans=0.1 2024-06-21 21:09:47,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=458080.3333333333, ans=0.0 2024-06-21 21:09:47,777 INFO [train.py:1028] (1/2) Epoch 25, batch 7050, loss[loss=0.2136, simple_loss=0.2747, pruned_loss=0.07632, over 12787.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.259, pruned_loss=0.07023, over 2581695.92 frames. ], batch size: 176, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:09:50,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=458080.3333333333, ans=0.0 2024-06-21 21:09:54,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=458098.6666666667, ans=0.0 2024-06-21 21:09:57,372 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.373e+02 2.536e+02 2.844e+02 4.084e+02, threshold=5.073e+02, percent-clipped=0.0 2024-06-21 21:10:03,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458117.0, ans=0.1 2024-06-21 21:10:10,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458135.3333333333, ans=0.1 2024-06-21 21:10:11,778 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=458135.3333333333, ans=0.0 2024-06-21 21:10:13,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=458153.6666666667, ans=0.0 2024-06-21 21:10:20,308 INFO [train.py:1028] (1/2) Epoch 25, batch 7100, loss[loss=0.2293, simple_loss=0.2893, pruned_loss=0.08461, over 13180.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2599, pruned_loss=0.07084, over 2573978.44 frames. ], batch size: 112, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:10:31,562 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.58 vs. limit=15.0 2024-06-21 21:10:34,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458208.6666666667, ans=0.1 2024-06-21 21:10:39,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=458208.6666666667, ans=0.125 2024-06-21 21:10:44,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=458227.0, ans=0.125 2024-06-21 21:10:45,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=458227.0, ans=0.125 2024-06-21 21:10:48,638 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.43 vs. limit=15.0 2024-06-21 21:10:53,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=458263.6666666667, ans=0.025 2024-06-21 21:10:53,978 INFO [train.py:1028] (1/2) Epoch 25, batch 7150, loss[loss=0.2128, simple_loss=0.2655, pruned_loss=0.08007, over 12523.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2597, pruned_loss=0.07073, over 2572524.77 frames. ], batch size: 202, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:10:54,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=458263.6666666667, ans=0.125 2024-06-21 21:11:03,836 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.248e+02 2.445e+02 2.669e+02 3.399e+02, threshold=4.891e+02, percent-clipped=0.0 2024-06-21 21:11:14,446 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2024-06-21 21:11:22,751 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.33 vs. limit=12.0 2024-06-21 21:11:27,346 INFO [train.py:1028] (1/2) Epoch 25, batch 7200, loss[loss=0.2155, simple_loss=0.2753, pruned_loss=0.07787, over 13185.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2606, pruned_loss=0.07097, over 2578793.78 frames. ], batch size: 112, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:11:36,512 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.34 vs. limit=6.0 2024-06-21 21:11:39,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=458373.6666666667, ans=0.125 2024-06-21 21:12:02,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=458428.6666666667, ans=0.125 2024-06-21 21:12:05,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=458428.6666666667, ans=0.05 2024-06-21 21:12:07,336 INFO [train.py:1028] (1/2) Epoch 25, batch 7250, loss[loss=0.1846, simple_loss=0.2467, pruned_loss=0.06122, over 12961.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2615, pruned_loss=0.07112, over 2579552.62 frames. ], batch size: 36, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:12:15,537 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:12:17,394 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 2.294e+02 2.462e+02 2.674e+02 3.557e+02, threshold=4.924e+02, percent-clipped=0.0 2024-06-21 21:12:30,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=458502.0, ans=0.1 2024-06-21 21:12:31,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=458502.0, ans=0.95 2024-06-21 21:12:37,454 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.99 vs. limit=22.5 2024-06-21 21:12:39,945 INFO [train.py:1028] (1/2) Epoch 25, batch 7300, loss[loss=0.1779, simple_loss=0.244, pruned_loss=0.0559, over 12901.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2626, pruned_loss=0.07171, over 2579513.97 frames. ], batch size: 36, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:12:44,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=458538.6666666667, ans=0.07 2024-06-21 21:12:47,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=458557.0, ans=0.125 2024-06-21 21:12:58,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=458575.3333333333, ans=0.0 2024-06-21 21:13:05,494 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.52 vs. limit=15.0 2024-06-21 21:13:12,800 INFO [train.py:1028] (1/2) Epoch 25, batch 7350, loss[loss=0.2037, simple_loss=0.2726, pruned_loss=0.06743, over 13303.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.263, pruned_loss=0.07198, over 2581019.07 frames. ], batch size: 46, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:13:14,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=458630.3333333333, ans=0.0 2024-06-21 21:13:16,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=458630.3333333333, ans=0.2 2024-06-21 21:13:19,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=458648.6666666667, ans=0.0 2024-06-21 21:13:22,545 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.351e+02 2.469e+02 2.728e+02 3.618e+02, threshold=4.938e+02, percent-clipped=0.0 2024-06-21 21:13:35,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458685.3333333333, ans=0.1 2024-06-21 21:13:45,855 INFO [train.py:1028] (1/2) Epoch 25, batch 7400, loss[loss=0.2209, simple_loss=0.2896, pruned_loss=0.07611, over 13256.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2635, pruned_loss=0.07212, over 2586363.19 frames. ], batch size: 63, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:14:00,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=458740.3333333333, ans=0.1 2024-06-21 21:14:01,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458740.3333333333, ans=0.1 2024-06-21 21:14:01,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=458740.3333333333, ans=0.95 2024-06-21 21:14:02,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=458740.3333333333, ans=0.0 2024-06-21 21:14:11,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=458758.6666666667, ans=0.2 2024-06-21 21:14:15,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=458777.0, ans=0.125 2024-06-21 21:14:16,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=458777.0, ans=0.125 2024-06-21 21:14:18,086 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.77 vs. limit=6.0 2024-06-21 21:14:20,160 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.33 vs. limit=10.0 2024-06-21 21:14:26,858 INFO [train.py:1028] (1/2) Epoch 25, batch 7450, loss[loss=0.1887, simple_loss=0.2529, pruned_loss=0.06222, over 13058.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.263, pruned_loss=0.07173, over 2581784.75 frames. ], batch size: 30, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:14:27,338 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.34 vs. limit=15.0 2024-06-21 21:14:33,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=458832.0, ans=0.125 2024-06-21 21:14:37,001 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.318e+02 2.456e+02 2.708e+02 4.154e+02, threshold=4.912e+02, percent-clipped=0.0 2024-06-21 21:14:41,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=458850.3333333333, ans=0.125 2024-06-21 21:14:45,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=458850.3333333333, ans=22.5 2024-06-21 21:14:47,014 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.33 vs. limit=15.0 2024-06-21 21:14:48,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=458868.6666666667, ans=0.2 2024-06-21 21:14:52,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=458868.6666666667, ans=0.0 2024-06-21 21:14:57,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=458887.0, ans=0.125 2024-06-21 21:14:58,749 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2024-06-21 21:15:00,505 INFO [train.py:1028] (1/2) Epoch 25, batch 7500, loss[loss=0.2213, simple_loss=0.2709, pruned_loss=0.08587, over 10747.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2636, pruned_loss=0.07193, over 2578497.28 frames. ], batch size: 303, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:15:03,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=458905.3333333333, ans=0.125 2024-06-21 21:15:11,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=458923.6666666667, ans=0.125 2024-06-21 21:15:15,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=458942.0, ans=0.0 2024-06-21 21:15:26,008 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.52 vs. limit=15.0 2024-06-21 21:15:28,620 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.69 vs. limit=6.0 2024-06-21 21:15:31,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=458978.6666666667, ans=0.125 2024-06-21 21:15:33,411 INFO [train.py:1028] (1/2) Epoch 25, batch 7550, loss[loss=0.2086, simple_loss=0.2664, pruned_loss=0.07544, over 12922.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2639, pruned_loss=0.0724, over 2578331.59 frames. ], batch size: 158, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:15:43,307 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.394e+02 2.563e+02 2.773e+02 3.568e+02, threshold=5.125e+02, percent-clipped=0.0 2024-06-21 21:16:12,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=459088.6666666667, ans=0.125 2024-06-21 21:16:13,324 INFO [train.py:1028] (1/2) Epoch 25, batch 7600, loss[loss=0.1942, simple_loss=0.2535, pruned_loss=0.06744, over 13265.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2642, pruned_loss=0.07251, over 2578525.83 frames. ], batch size: 83, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:16:18,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=459088.6666666667, ans=0.1 2024-06-21 21:16:22,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=459107.0, ans=0.04949747468305833 2024-06-21 21:16:46,598 INFO [train.py:1028] (1/2) Epoch 25, batch 7650, loss[loss=0.1998, simple_loss=0.2621, pruned_loss=0.06879, over 12904.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2643, pruned_loss=0.07236, over 2573192.52 frames. ], batch size: 33, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:16:49,964 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.66 vs. limit=12.0 2024-06-21 21:16:52,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=459180.3333333333, ans=0.09899494936611666 2024-06-21 21:16:56,681 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.347e+02 2.534e+02 2.831e+02 4.293e+02, threshold=5.068e+02, percent-clipped=0.0 2024-06-21 21:17:02,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=459217.0, ans=0.035 2024-06-21 21:17:05,926 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.60 vs. limit=22.5 2024-06-21 21:17:19,866 INFO [train.py:1028] (1/2) Epoch 25, batch 7700, loss[loss=0.2041, simple_loss=0.2777, pruned_loss=0.06527, over 13270.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2652, pruned_loss=0.07265, over 2569875.01 frames. ], batch size: 63, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:17:24,152 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.72 vs. limit=15.0 2024-06-21 21:17:24,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=459272.0, ans=0.125 2024-06-21 21:17:26,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=459290.3333333333, ans=0.2 2024-06-21 21:17:58,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=459345.3333333333, ans=0.1 2024-06-21 21:17:59,898 INFO [train.py:1028] (1/2) Epoch 25, batch 7750, loss[loss=0.23, simple_loss=0.289, pruned_loss=0.08549, over 13208.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.266, pruned_loss=0.07342, over 2573229.36 frames. ], batch size: 72, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:18:02,713 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:18:05,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=459363.6666666667, ans=0.125 2024-06-21 21:18:08,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=459382.0, ans=15.0 2024-06-21 21:18:09,773 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.324e+02 2.449e+02 2.639e+02 3.398e+02, threshold=4.899e+02, percent-clipped=0.0 2024-06-21 21:18:25,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=459418.6666666667, ans=0.125 2024-06-21 21:18:26,093 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2024-06-21 21:18:28,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=459437.0, ans=0.125 2024-06-21 21:18:32,762 INFO [train.py:1028] (1/2) Epoch 25, batch 7800, loss[loss=0.2149, simple_loss=0.2711, pruned_loss=0.07937, over 13131.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2668, pruned_loss=0.07359, over 2578548.67 frames. ], batch size: 95, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:18:45,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=459492.0, ans=0.125 2024-06-21 21:18:54,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=459510.3333333333, ans=0.025 2024-06-21 21:19:05,709 INFO [train.py:1028] (1/2) Epoch 25, batch 7850, loss[loss=0.1622, simple_loss=0.222, pruned_loss=0.05115, over 11036.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2672, pruned_loss=0.07385, over 2572047.37 frames. ], batch size: 16, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:19:07,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=459547.0, ans=0.125 2024-06-21 21:19:08,725 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.61 vs. limit=15.0 2024-06-21 21:19:10,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=459547.0, ans=0.125 2024-06-21 21:19:14,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=459565.3333333333, ans=0.07 2024-06-21 21:19:15,604 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.383e+02 2.508e+02 2.750e+02 3.302e+02, threshold=5.017e+02, percent-clipped=0.0 2024-06-21 21:19:16,109 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.67 vs. limit=15.0 2024-06-21 21:19:19,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=459583.6666666667, ans=0.0 2024-06-21 21:19:25,448 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.27 vs. limit=15.0 2024-06-21 21:19:30,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=459602.0, ans=0.2 2024-06-21 21:19:31,988 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:19:41,013 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.07 vs. limit=15.0 2024-06-21 21:19:45,203 INFO [train.py:1028] (1/2) Epoch 25, batch 7900, loss[loss=0.2247, simple_loss=0.2814, pruned_loss=0.08399, over 13243.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2675, pruned_loss=0.07387, over 2571712.37 frames. ], batch size: 77, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:19:57,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=459675.3333333333, ans=0.025 2024-06-21 21:20:10,245 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.67 vs. limit=6.0 2024-06-21 21:20:11,073 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2024-06-21 21:20:13,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=459712.0, ans=0.0 2024-06-21 21:20:18,599 INFO [train.py:1028] (1/2) Epoch 25, batch 7950, loss[loss=0.2228, simple_loss=0.2756, pruned_loss=0.08504, over 10676.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2678, pruned_loss=0.07378, over 2575460.35 frames. ], batch size: 304, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:20:20,180 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2024-06-21 21:20:23,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=459730.3333333333, ans=0.125 2024-06-21 21:20:25,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.95 vs. limit=22.5 2024-06-21 21:20:25,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=459748.6666666667, ans=0.125 2024-06-21 21:20:29,506 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 2.353e+02 2.495e+02 2.656e+02 3.689e+02, threshold=4.991e+02, percent-clipped=0.0 2024-06-21 21:20:35,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=459767.0, ans=0.2 2024-06-21 21:20:39,267 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2024-06-21 21:20:40,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=459785.3333333333, ans=0.035 2024-06-21 21:20:46,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=459803.6666666667, ans=0.0 2024-06-21 21:20:48,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=459803.6666666667, ans=0.2 2024-06-21 21:20:50,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=459803.6666666667, ans=0.0 2024-06-21 21:20:52,354 INFO [train.py:1028] (1/2) Epoch 25, batch 8000, loss[loss=0.1767, simple_loss=0.2406, pruned_loss=0.05638, over 12765.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2683, pruned_loss=0.07387, over 2572299.64 frames. ], batch size: 29, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:20:55,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=459822.0, ans=0.125 2024-06-21 21:20:56,819 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.76 vs. limit=5.0 2024-06-21 21:20:59,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=459840.3333333333, ans=0.0 2024-06-21 21:21:23,418 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2024-06-21 21:21:25,824 INFO [train.py:1028] (1/2) Epoch 25, batch 8050, loss[loss=0.2085, simple_loss=0.2729, pruned_loss=0.07212, over 13228.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2683, pruned_loss=0.07397, over 2571877.51 frames. ], batch size: 83, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:21:28,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=459913.6666666667, ans=0.1 2024-06-21 21:21:30,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=459913.6666666667, ans=0.2 2024-06-21 21:21:39,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=459932.0, ans=0.2 2024-06-21 21:21:40,126 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.356e+02 2.506e+02 2.722e+02 3.626e+02, threshold=5.012e+02, percent-clipped=0.0 2024-06-21 21:21:46,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=459950.3333333333, ans=0.125 2024-06-21 21:21:47,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=459950.3333333333, ans=0.125 2024-06-21 21:21:50,501 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.70 vs. limit=15.0 2024-06-21 21:21:51,148 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.90 vs. limit=22.5 2024-06-21 21:21:51,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=459950.3333333333, ans=0.0 2024-06-21 21:21:56,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=459968.6666666667, ans=0.125 2024-06-21 21:22:01,532 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.49 vs. limit=10.0 2024-06-21 21:22:04,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=459987.0, ans=0.125 2024-06-21 21:22:05,336 INFO [train.py:1028] (1/2) Epoch 25, batch 8100, loss[loss=0.2056, simple_loss=0.272, pruned_loss=0.0696, over 13196.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2695, pruned_loss=0.07441, over 2576647.82 frames. ], batch size: 112, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:22:23,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=460042.0, ans=0.1 2024-06-21 21:22:25,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=460060.3333333333, ans=0.0 2024-06-21 21:22:28,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=460060.3333333333, ans=0.2 2024-06-21 21:22:39,100 INFO [train.py:1028] (1/2) Epoch 25, batch 8150, loss[loss=0.1859, simple_loss=0.2473, pruned_loss=0.06228, over 13101.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2694, pruned_loss=0.07368, over 2579959.13 frames. ], batch size: 121, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:22:49,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=460115.3333333333, ans=0.2 2024-06-21 21:22:49,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=460115.3333333333, ans=0.05 2024-06-21 21:22:49,920 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.83 vs. limit=15.0 2024-06-21 21:22:50,134 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.342e+02 2.457e+02 2.669e+02 3.414e+02, threshold=4.913e+02, percent-clipped=0.0 2024-06-21 21:22:51,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=460115.3333333333, ans=0.1 2024-06-21 21:22:53,294 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2024-06-21 21:23:12,948 INFO [train.py:1028] (1/2) Epoch 25, batch 8200, loss[loss=0.2033, simple_loss=0.2631, pruned_loss=0.07177, over 13153.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2691, pruned_loss=0.07342, over 2583666.81 frames. ], batch size: 112, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:23:14,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=460188.6666666667, ans=0.0 2024-06-21 21:23:20,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=460207.0, ans=0.125 2024-06-21 21:23:20,945 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.65 vs. limit=22.5 2024-06-21 21:23:25,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=460207.0, ans=0.0 2024-06-21 21:23:30,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=460225.3333333333, ans=0.125 2024-06-21 21:23:32,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=460225.3333333333, ans=0.2 2024-06-21 21:23:36,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=460243.6666666667, ans=0.0 2024-06-21 21:23:45,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=460262.0, ans=0.125 2024-06-21 21:23:49,588 INFO [train.py:1028] (1/2) Epoch 25, batch 8250, loss[loss=0.1998, simple_loss=0.2686, pruned_loss=0.06547, over 13260.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2701, pruned_loss=0.07393, over 2582501.14 frames. ], batch size: 52, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:24:03,332 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.331e+02 2.503e+02 2.715e+02 4.068e+02, threshold=5.006e+02, percent-clipped=0.0 2024-06-21 21:24:09,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=460317.0, ans=0.0 2024-06-21 21:24:12,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=460335.3333333333, ans=0.0 2024-06-21 21:24:13,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=460335.3333333333, ans=0.125 2024-06-21 21:24:19,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=460353.6666666667, ans=0.125 2024-06-21 21:24:25,361 INFO [train.py:1028] (1/2) Epoch 25, batch 8300, loss[loss=0.2289, simple_loss=0.2815, pruned_loss=0.08821, over 13020.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2691, pruned_loss=0.07346, over 2579661.02 frames. ], batch size: 102, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:24:29,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=460372.0, ans=0.125 2024-06-21 21:24:30,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=460372.0, ans=0.0 2024-06-21 21:24:37,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=460408.6666666667, ans=0.0 2024-06-21 21:24:44,564 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2024-06-21 21:24:45,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=460427.0, ans=0.0 2024-06-21 21:24:49,381 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.98 vs. limit=15.0 2024-06-21 21:24:58,466 INFO [train.py:1028] (1/2) Epoch 25, batch 8350, loss[loss=0.2336, simple_loss=0.2861, pruned_loss=0.09052, over 13137.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2692, pruned_loss=0.07333, over 2579228.37 frames. ], batch size: 112, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:25:09,296 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.937e+02 2.364e+02 2.492e+02 2.693e+02 3.725e+02, threshold=4.984e+02, percent-clipped=0.0 2024-06-21 21:25:09,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=460482.0, ans=0.0 2024-06-21 21:25:21,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=460518.6666666667, ans=0.125 2024-06-21 21:25:24,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=460518.6666666667, ans=0.2 2024-06-21 21:25:28,573 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.72 vs. limit=6.0 2024-06-21 21:25:32,351 INFO [train.py:1028] (1/2) Epoch 25, batch 8400, loss[loss=0.2003, simple_loss=0.2686, pruned_loss=0.066, over 12934.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2691, pruned_loss=0.07344, over 2576733.42 frames. ], batch size: 39, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:25:37,256 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2024-06-21 21:25:43,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=460573.6666666667, ans=0.1 2024-06-21 21:25:46,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=460573.6666666667, ans=0.2 2024-06-21 21:25:47,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=460573.6666666667, ans=0.0 2024-06-21 21:26:08,951 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.61 vs. limit=10.0 2024-06-21 21:26:11,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=460647.0, ans=0.2 2024-06-21 21:26:11,631 INFO [train.py:1028] (1/2) Epoch 25, batch 8450, loss[loss=0.207, simple_loss=0.2629, pruned_loss=0.07557, over 13168.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2699, pruned_loss=0.07405, over 2578735.52 frames. ], batch size: 112, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:26:17,355 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.54 vs. limit=6.0 2024-06-21 21:26:22,116 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.348e+02 2.633e+02 2.920e+02 3.877e+02, threshold=5.267e+02, percent-clipped=0.0 2024-06-21 21:26:29,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=460683.6666666667, ans=0.5 2024-06-21 21:26:34,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=460702.0, ans=0.1 2024-06-21 21:26:40,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=460720.3333333333, ans=0.125 2024-06-21 21:26:44,763 INFO [train.py:1028] (1/2) Epoch 25, batch 8500, loss[loss=0.1894, simple_loss=0.2505, pruned_loss=0.06413, over 12683.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2705, pruned_loss=0.07402, over 2577127.87 frames. ], batch size: 29, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:26:52,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=460757.0, ans=0.125 2024-06-21 21:26:53,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=460757.0, ans=0.025 2024-06-21 21:26:54,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=460757.0, ans=0.0 2024-06-21 21:27:02,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=460775.3333333333, ans=0.1 2024-06-21 21:27:02,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=460775.3333333333, ans=0.0 2024-06-21 21:27:07,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=460793.6666666667, ans=0.125 2024-06-21 21:27:17,881 INFO [train.py:1028] (1/2) Epoch 25, batch 8550, loss[loss=0.1896, simple_loss=0.2503, pruned_loss=0.06448, over 12425.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2702, pruned_loss=0.07399, over 2575665.59 frames. ], batch size: 22, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:27:19,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=460830.3333333333, ans=0.0 2024-06-21 21:27:24,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=460848.6666666667, ans=0.125 2024-06-21 21:27:27,528 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.98 vs. limit=12.0 2024-06-21 21:27:28,309 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.383e+02 2.565e+02 2.842e+02 3.625e+02, threshold=5.131e+02, percent-clipped=0.0 2024-06-21 21:27:43,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=460885.3333333333, ans=0.125 2024-06-21 21:27:51,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=460903.6666666667, ans=0.0 2024-06-21 21:27:57,347 INFO [train.py:1028] (1/2) Epoch 25, batch 8600, loss[loss=0.2044, simple_loss=0.2581, pruned_loss=0.07532, over 13031.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2701, pruned_loss=0.07378, over 2572686.40 frames. ], batch size: 121, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:28:09,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=460940.3333333333, ans=0.95 2024-06-21 21:28:09,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=460940.3333333333, ans=0.1 2024-06-21 21:28:10,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=460958.6666666667, ans=0.2 2024-06-21 21:28:30,754 INFO [train.py:1028] (1/2) Epoch 25, batch 8650, loss[loss=0.1893, simple_loss=0.2458, pruned_loss=0.06636, over 13047.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2703, pruned_loss=0.07371, over 2576135.91 frames. ], batch size: 102, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:28:34,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=461013.6666666667, ans=0.0 2024-06-21 21:28:39,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=461032.0, ans=0.0 2024-06-21 21:28:41,145 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.301e+02 2.464e+02 2.674e+02 4.058e+02, threshold=4.927e+02, percent-clipped=0.0 2024-06-21 21:28:44,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=461050.3333333333, ans=0.125 2024-06-21 21:28:47,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=461050.3333333333, ans=0.2 2024-06-21 21:28:53,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=461068.6666666667, ans=0.125 2024-06-21 21:28:54,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=461068.6666666667, ans=0.125 2024-06-21 21:28:56,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=461087.0, ans=0.0 2024-06-21 21:29:03,645 INFO [train.py:1028] (1/2) Epoch 25, batch 8700, loss[loss=0.2034, simple_loss=0.2745, pruned_loss=0.06613, over 13184.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2716, pruned_loss=0.07475, over 2572921.75 frames. ], batch size: 59, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:29:05,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=461105.3333333333, ans=0.125 2024-06-21 21:29:21,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=461142.0, ans=0.125 2024-06-21 21:29:35,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=461178.6666666667, ans=0.125 2024-06-21 21:29:37,059 INFO [train.py:1028] (1/2) Epoch 25, batch 8750, loss[loss=0.2183, simple_loss=0.2755, pruned_loss=0.0806, over 13083.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2719, pruned_loss=0.0749, over 2570725.70 frames. ], batch size: 121, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:29:43,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=461197.0, ans=0.0 2024-06-21 21:29:44,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=461197.0, ans=0.125 2024-06-21 21:29:52,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=461215.3333333333, ans=0.125 2024-06-21 21:29:54,220 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.404e+02 2.582e+02 2.752e+02 5.680e+02, threshold=5.165e+02, percent-clipped=1.0 2024-06-21 21:29:59,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=461233.6666666667, ans=0.125 2024-06-21 21:30:09,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=461252.0, ans=0.1 2024-06-21 21:30:09,310 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=461252.0, ans=0.025 2024-06-21 21:30:13,116 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.84 vs. limit=15.0 2024-06-21 21:30:16,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=461288.6666666667, ans=0.0 2024-06-21 21:30:17,545 INFO [train.py:1028] (1/2) Epoch 25, batch 8800, loss[loss=0.1963, simple_loss=0.2675, pruned_loss=0.0626, over 13248.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2719, pruned_loss=0.07463, over 2576063.82 frames. ], batch size: 72, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:30:17,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=461288.6666666667, ans=0.125 2024-06-21 21:30:20,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=461288.6666666667, ans=0.2 2024-06-21 21:30:27,690 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=10.75 vs. limit=12.0 2024-06-21 21:30:35,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=461325.3333333333, ans=0.125 2024-06-21 21:30:42,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=461343.6666666667, ans=0.125 2024-06-21 21:30:47,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=461362.0, ans=0.1 2024-06-21 21:30:49,423 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.04 vs. limit=15.0 2024-06-21 21:30:51,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=461380.3333333333, ans=0.1 2024-06-21 21:30:52,403 INFO [train.py:1028] (1/2) Epoch 25, batch 8850, loss[loss=0.2205, simple_loss=0.2764, pruned_loss=0.08227, over 12566.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2723, pruned_loss=0.07508, over 2562775.26 frames. ], batch size: 202, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:31:03,301 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 2.387e+02 2.507e+02 2.660e+02 3.786e+02, threshold=5.013e+02, percent-clipped=0.0 2024-06-21 21:31:16,919 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=461435.3333333333, ans=0.0 2024-06-21 21:31:17,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=461435.3333333333, ans=0.0 2024-06-21 21:31:23,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=461453.6666666667, ans=0.125 2024-06-21 21:31:26,602 INFO [train.py:1028] (1/2) Epoch 25, batch 8900, loss[loss=0.2114, simple_loss=0.2763, pruned_loss=0.07326, over 12987.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2727, pruned_loss=0.07532, over 2561395.33 frames. ], batch size: 33, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:31:28,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=461472.0, ans=0.1 2024-06-21 21:31:31,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=461472.0, ans=0.2 2024-06-21 21:31:34,198 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.47 vs. limit=22.5 2024-06-21 21:31:34,616 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:31:36,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=461490.3333333333, ans=0.0 2024-06-21 21:31:49,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=461527.0, ans=0.2 2024-06-21 21:31:54,942 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=8.0 2024-06-21 21:32:05,026 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2024-06-21 21:32:07,166 INFO [train.py:1028] (1/2) Epoch 25, batch 8950, loss[loss=0.2254, simple_loss=0.2836, pruned_loss=0.08357, over 12537.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2726, pruned_loss=0.07495, over 2560658.21 frames. ], batch size: 202, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:32:17,833 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.356e+02 2.522e+02 2.764e+02 3.436e+02, threshold=5.045e+02, percent-clipped=0.0 2024-06-21 21:32:36,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=461637.0, ans=0.2 2024-06-21 21:32:38,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=461637.0, ans=0.125 2024-06-21 21:32:41,098 INFO [train.py:1028] (1/2) Epoch 25, batch 9000, loss[loss=0.2101, simple_loss=0.2752, pruned_loss=0.07256, over 13302.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2726, pruned_loss=0.07479, over 2566717.36 frames. ], batch size: 46, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:32:41,099 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 21:32:49,004 INFO [train.py:1060] (1/2) Epoch 25, validation: loss=0.19, simple_loss=0.2509, pruned_loss=0.06457, over 351949.00 frames. 2024-06-21 21:32:49,005 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 21:32:50,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=461655.3333333333, ans=0.0 2024-06-21 21:33:01,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=461673.6666666667, ans=0.2 2024-06-21 21:33:03,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=461692.0, ans=0.2 2024-06-21 21:33:07,773 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.29 vs. limit=15.0 2024-06-21 21:33:21,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=461747.0, ans=0.0 2024-06-21 21:33:22,032 INFO [train.py:1028] (1/2) Epoch 25, batch 9050, loss[loss=0.1693, simple_loss=0.2359, pruned_loss=0.05134, over 11244.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2736, pruned_loss=0.07528, over 2565118.50 frames. ], batch size: 17, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:33:24,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=461747.0, ans=0.5 2024-06-21 21:33:25,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=461747.0, ans=0.1 2024-06-21 21:33:29,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=461765.3333333333, ans=0.2 2024-06-21 21:33:32,178 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.342e+02 2.449e+02 2.640e+02 3.471e+02, threshold=4.898e+02, percent-clipped=0.0 2024-06-21 21:33:34,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=461783.6666666667, ans=0.05 2024-06-21 21:33:41,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=461802.0, ans=0.125 2024-06-21 21:33:43,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=461802.0, ans=0.125 2024-06-21 21:33:44,898 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.17 vs. limit=15.0 2024-06-21 21:33:54,079 INFO [train.py:1028] (1/2) Epoch 25, batch 9100, loss[loss=0.2041, simple_loss=0.2664, pruned_loss=0.07084, over 13260.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2735, pruned_loss=0.07468, over 2566527.83 frames. ], batch size: 72, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:34:00,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=461857.0, ans=0.0 2024-06-21 21:34:12,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=461893.6666666667, ans=0.125 2024-06-21 21:34:12,873 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.66 vs. limit=15.0 2024-06-21 21:34:20,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-21 21:34:22,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=461912.0, ans=0.0 2024-06-21 21:34:25,783 INFO [train.py:1028] (1/2) Epoch 25, batch 9150, loss[loss=0.2054, simple_loss=0.268, pruned_loss=0.07137, over 13142.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2734, pruned_loss=0.07453, over 2568750.95 frames. ], batch size: 77, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:34:36,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=461948.6666666667, ans=0.2 2024-06-21 21:34:36,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=461948.6666666667, ans=0.0 2024-06-21 21:34:38,772 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.320e+02 2.479e+02 2.654e+02 3.424e+02, threshold=4.957e+02, percent-clipped=0.0 2024-06-21 21:34:52,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=461985.3333333333, ans=0.1 2024-06-21 21:35:08,382 INFO [train.py:1028] (1/2) Epoch 25, batch 9200, loss[loss=0.2084, simple_loss=0.2721, pruned_loss=0.07235, over 12888.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2728, pruned_loss=0.0741, over 2572696.37 frames. ], batch size: 36, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:35:11,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=462022.0, ans=0.2 2024-06-21 21:35:20,741 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.76 vs. limit=15.0 2024-06-21 21:35:23,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=462058.6666666667, ans=0.125 2024-06-21 21:35:30,888 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.03 vs. limit=22.5 2024-06-21 21:35:31,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=462077.0, ans=0.0 2024-06-21 21:35:32,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=462077.0, ans=0.125 2024-06-21 21:35:39,835 INFO [train.py:1028] (1/2) Epoch 25, batch 9250, loss[loss=0.2132, simple_loss=0.2817, pruned_loss=0.0723, over 13200.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2725, pruned_loss=0.07383, over 2575274.15 frames. ], batch size: 67, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:35:49,870 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.363e+02 2.490e+02 2.609e+02 3.303e+02, threshold=4.981e+02, percent-clipped=0.0 2024-06-21 21:35:51,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=462150.3333333333, ans=0.125 2024-06-21 21:35:52,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=462150.3333333333, ans=0.125 2024-06-21 21:36:11,413 INFO [train.py:1028] (1/2) Epoch 25, batch 9300, loss[loss=0.2114, simple_loss=0.272, pruned_loss=0.07541, over 13308.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2727, pruned_loss=0.07382, over 2571396.96 frames. ], batch size: 40, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:36:13,270 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.93 vs. limit=15.0 2024-06-21 21:36:30,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=462260.3333333333, ans=0.0 2024-06-21 21:36:38,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=462278.6666666667, ans=0.0 2024-06-21 21:36:42,725 INFO [train.py:1028] (1/2) Epoch 25, batch 9350, loss[loss=0.2074, simple_loss=0.2806, pruned_loss=0.06705, over 12576.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2732, pruned_loss=0.07428, over 2568446.16 frames. ], batch size: 22, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:36:49,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=462315.3333333333, ans=0.04949747468305833 2024-06-21 21:36:52,664 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.024e+02 2.359e+02 2.535e+02 2.821e+02 3.794e+02, threshold=5.070e+02, percent-clipped=0.0 2024-06-21 21:37:03,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=462352.0, ans=0.125 2024-06-21 21:37:04,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=462352.0, ans=0.0 2024-06-21 21:37:08,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=462370.3333333333, ans=0.125 2024-06-21 21:37:09,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=462370.3333333333, ans=0.025 2024-06-21 21:37:12,831 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.49 vs. limit=15.0 2024-06-21 21:37:13,597 INFO [train.py:1028] (1/2) Epoch 25, batch 9400, loss[loss=0.2131, simple_loss=0.2794, pruned_loss=0.07346, over 13225.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2734, pruned_loss=0.07431, over 2569638.40 frames. ], batch size: 52, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:37:19,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=462407.0, ans=0.0 2024-06-21 21:37:23,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=462407.0, ans=0.125 2024-06-21 21:37:25,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=462425.3333333333, ans=0.125 2024-06-21 21:37:26,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=462425.3333333333, ans=0.1 2024-06-21 21:37:28,269 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.94 vs. limit=15.0 2024-06-21 21:37:39,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=462462.0, ans=0.125 2024-06-21 21:37:44,267 INFO [train.py:1028] (1/2) Epoch 25, batch 9450, loss[loss=0.1918, simple_loss=0.2503, pruned_loss=0.06663, over 12572.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2733, pruned_loss=0.07442, over 2569506.11 frames. ], batch size: 22, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:37:56,788 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.336e+02 2.451e+02 2.684e+02 4.201e+02, threshold=4.901e+02, percent-clipped=0.0 2024-06-21 21:38:06,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=462535.3333333333, ans=0.1 2024-06-21 21:38:08,031 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.72 vs. limit=15.0 2024-06-21 21:38:18,092 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.55 vs. limit=15.0 2024-06-21 21:38:19,572 INFO [train.py:1028] (1/2) Epoch 25, batch 9500, loss[loss=0.218, simple_loss=0.2859, pruned_loss=0.07501, over 13242.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2729, pruned_loss=0.07405, over 2578275.62 frames. ], batch size: 43, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:38:20,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=462572.0, ans=0.1 2024-06-21 21:38:20,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=462572.0, ans=15.0 2024-06-21 21:38:28,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=462590.3333333333, ans=0.125 2024-06-21 21:38:30,260 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.94 vs. limit=22.5 2024-06-21 21:38:31,314 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.74 vs. limit=6.0 2024-06-21 21:38:35,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=462608.6666666667, ans=0.0 2024-06-21 21:38:36,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=462608.6666666667, ans=15.0 2024-06-21 21:38:50,598 INFO [train.py:1028] (1/2) Epoch 25, batch 9550, loss[loss=0.2039, simple_loss=0.2622, pruned_loss=0.07285, over 13172.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2732, pruned_loss=0.07439, over 2574366.46 frames. ], batch size: 40, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:38:51,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=462663.6666666667, ans=0.0 2024-06-21 21:38:58,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=462682.0, ans=0.07 2024-06-21 21:39:00,548 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.319e+02 2.466e+02 2.704e+02 3.652e+02, threshold=4.931e+02, percent-clipped=0.0 2024-06-21 21:39:00,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=462682.0, ans=0.0 2024-06-21 21:39:03,373 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2024-06-21 21:39:05,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=462700.3333333333, ans=0.0 2024-06-21 21:39:13,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=462718.6666666667, ans=0.0 2024-06-21 21:39:14,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=462718.6666666667, ans=0.125 2024-06-21 21:39:21,607 INFO [train.py:1028] (1/2) Epoch 25, batch 9600, loss[loss=0.2086, simple_loss=0.2613, pruned_loss=0.07793, over 10395.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2723, pruned_loss=0.07417, over 2572494.76 frames. ], batch size: 303, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:39:24,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=462755.3333333333, ans=0.2 2024-06-21 21:39:28,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=462773.6666666667, ans=0.125 2024-06-21 21:39:31,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=462773.6666666667, ans=0.125 2024-06-21 21:39:32,844 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=462773.6666666667, ans=0.125 2024-06-21 21:39:46,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=462828.6666666667, ans=0.125 2024-06-21 21:39:47,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=462828.6666666667, ans=0.125 2024-06-21 21:39:51,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=462847.0, ans=0.125 2024-06-21 21:39:52,491 INFO [train.py:1028] (1/2) Epoch 25, batch 9650, loss[loss=0.2111, simple_loss=0.2688, pruned_loss=0.07672, over 13099.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2728, pruned_loss=0.07465, over 2562402.78 frames. ], batch size: 132, lr: 2.29e-03, grad_scale: 16.0 2024-06-21 21:39:54,656 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.94 vs. limit=15.0 2024-06-21 21:40:01,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=462865.3333333333, ans=0.2 2024-06-21 21:40:02,673 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.399e+02 2.610e+02 2.903e+02 4.224e+02, threshold=5.220e+02, percent-clipped=0.0 2024-06-21 21:40:03,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=462865.3333333333, ans=0.125 2024-06-21 21:40:10,543 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.76 vs. limit=15.0 2024-06-21 21:40:13,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=462902.0, ans=0.015 2024-06-21 21:40:24,894 INFO [train.py:1028] (1/2) Epoch 25, batch 9700, loss[loss=0.2061, simple_loss=0.2632, pruned_loss=0.07447, over 13038.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2724, pruned_loss=0.07456, over 2557326.86 frames. ], batch size: 144, lr: 2.29e-03, grad_scale: 16.0 2024-06-21 21:40:45,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=462993.6666666667, ans=0.0 2024-06-21 21:40:50,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=463012.0, ans=0.2 2024-06-21 21:40:56,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=463012.0, ans=0.05 2024-06-21 21:40:57,215 INFO [train.py:1028] (1/2) Epoch 25, batch 9750, loss[loss=0.188, simple_loss=0.2456, pruned_loss=0.06521, over 13117.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2708, pruned_loss=0.07387, over 2554298.63 frames. ], batch size: 132, lr: 2.29e-03, grad_scale: 16.0 2024-06-21 21:41:03,427 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.12 vs. limit=10.0 2024-06-21 21:41:07,846 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.309e+02 2.515e+02 2.704e+02 3.548e+02, threshold=5.030e+02, percent-clipped=0.0 2024-06-21 21:41:14,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=463067.0, ans=0.125 2024-06-21 21:41:15,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=463085.3333333333, ans=0.2 2024-06-21 21:41:16,071 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.82 vs. limit=22.5 2024-06-21 21:41:20,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=463085.3333333333, ans=0.0 2024-06-21 21:41:27,816 INFO [train.py:1028] (1/2) Epoch 25, batch 9800, loss[loss=0.1963, simple_loss=0.2641, pruned_loss=0.0642, over 12922.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2703, pruned_loss=0.07323, over 2547453.54 frames. ], batch size: 39, lr: 2.29e-03, grad_scale: 16.0 2024-06-21 21:41:29,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=463122.0, ans=0.0 2024-06-21 21:41:30,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=463122.0, ans=0.0 2024-06-21 21:41:36,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=463140.3333333333, ans=0.125 2024-06-21 21:41:51,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=463177.0, ans=0.125 2024-06-21 21:41:55,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=463195.3333333333, ans=0.025 2024-06-21 21:41:58,176 INFO [train.py:1028] (1/2) Epoch 25, batch 9850, loss[loss=0.2162, simple_loss=0.2746, pruned_loss=0.07892, over 13028.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2698, pruned_loss=0.07314, over 2540434.71 frames. ], batch size: 102, lr: 2.29e-03, grad_scale: 16.0 2024-06-21 21:42:06,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463232.0, ans=0.1 2024-06-21 21:42:07,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=463232.0, ans=0.025 2024-06-21 21:42:08,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=463232.0, ans=0.0 2024-06-21 21:42:09,559 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.392e+02 2.535e+02 2.772e+02 3.494e+02, threshold=5.070e+02, percent-clipped=0.0 2024-06-21 21:42:10,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=463232.0, ans=0.0 2024-06-21 21:42:15,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=463250.3333333333, ans=0.125 2024-06-21 21:42:21,649 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=13.22 vs. limit=15.0 2024-06-21 21:42:23,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463268.6666666667, ans=0.1 2024-06-21 21:42:28,030 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=22.5 2024-06-21 21:42:30,728 INFO [train.py:1028] (1/2) Epoch 25, batch 9900, loss[loss=0.1738, simple_loss=0.2414, pruned_loss=0.05305, over 13195.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2693, pruned_loss=0.07363, over 2532870.27 frames. ], batch size: 40, lr: 2.29e-03, grad_scale: 16.0 2024-06-21 21:42:35,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=463305.3333333333, ans=0.2 2024-06-21 21:42:48,596 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.66 vs. limit=15.0 2024-06-21 21:42:51,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=463360.3333333333, ans=0.0 2024-06-21 21:42:51,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=463360.3333333333, ans=0.125 2024-06-21 21:42:54,134 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2024-06-21 21:42:57,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=463378.6666666667, ans=0.125 2024-06-21 21:43:01,626 INFO [train.py:1028] (1/2) Epoch 25, batch 9950, loss[loss=0.1995, simple_loss=0.2648, pruned_loss=0.0671, over 12545.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2684, pruned_loss=0.0736, over 2524203.62 frames. ], batch size: 29, lr: 2.29e-03, grad_scale: 16.0 2024-06-21 21:43:13,801 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 2.373e+02 2.506e+02 2.700e+02 3.986e+02, threshold=5.012e+02, percent-clipped=0.0 2024-06-21 21:43:17,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463433.6666666667, ans=0.1 2024-06-21 21:43:20,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=463433.6666666667, ans=0.0 2024-06-21 21:43:27,735 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2024-06-21 21:43:31,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=463470.3333333333, ans=0.1 2024-06-21 21:43:35,027 INFO [train.py:1028] (1/2) Epoch 25, batch 10000, loss[loss=0.2076, simple_loss=0.2734, pruned_loss=0.07084, over 12542.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2686, pruned_loss=0.07376, over 2485686.64 frames. ], batch size: 22, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:43:36,069 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2024-06-21 21:43:45,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463507.0, ans=0.1 2024-06-21 21:43:59,126 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:43:59,379 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.11 vs. limit=15.0 2024-06-21 21:44:07,422 INFO [train.py:1028] (1/2) Epoch 25, batch 10050, loss[loss=0.2123, simple_loss=0.2731, pruned_loss=0.07572, over 12645.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2693, pruned_loss=0.07466, over 2444649.98 frames. ], batch size: 22, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:44:17,275 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.424e+02 2.644e+02 2.914e+02 3.949e+02, threshold=5.287e+02, percent-clipped=0.0 2024-06-21 21:44:22,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=463617.0, ans=0.125 2024-06-21 21:44:25,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=463635.3333333333, ans=0.2 2024-06-21 21:44:26,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=463635.3333333333, ans=0.07 2024-06-21 21:44:29,358 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2024-06-21 21:44:37,292 INFO [train.py:1028] (1/2) Epoch 25, batch 10100, loss[loss=0.1695, simple_loss=0.227, pruned_loss=0.05604, over 11196.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2687, pruned_loss=0.07376, over 2429672.50 frames. ], batch size: 16, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:44:45,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=463690.3333333333, ans=0.0 2024-06-21 21:46:45,746 INFO [train.py:1028] (1/2) Epoch 26, batch 0, loss[loss=0.159, simple_loss=0.2212, pruned_loss=0.04841, over 12955.00 frames. ], tot_loss[loss=0.159, simple_loss=0.2212, pruned_loss=0.04841, over 12955.00 frames. ], batch size: 36, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:46:45,746 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 21:46:52,771 INFO [train.py:1060] (1/2) Epoch 26, validation: loss=0.1908, simple_loss=0.2527, pruned_loss=0.06451, over 351949.00 frames. 2024-06-21 21:46:52,772 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 21:46:55,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=463703.1666666667, ans=0.125 2024-06-21 21:46:58,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=463703.1666666667, ans=0.025 2024-06-21 21:47:28,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=463794.8333333333, ans=0.125 2024-06-21 21:47:29,322 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.769e+02 2.190e+02 2.373e+02 2.531e+02 3.649e+02, threshold=4.746e+02, percent-clipped=0.0 2024-06-21 21:47:29,351 INFO [train.py:1028] (1/2) Epoch 26, batch 50, loss[loss=0.1996, simple_loss=0.2693, pruned_loss=0.06498, over 12711.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.253, pruned_loss=0.06892, over 574904.51 frames. ], batch size: 29, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:47:33,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463794.8333333333, ans=0.1 2024-06-21 21:47:33,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=463794.8333333333, ans=0.2 2024-06-21 21:47:35,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=463813.1666666667, ans=0.2 2024-06-21 21:47:41,338 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2024-06-21 21:47:43,427 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.62 vs. limit=6.0 2024-06-21 21:47:53,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=463868.1666666667, ans=0.07 2024-06-21 21:47:54,838 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.80 vs. limit=22.5 2024-06-21 21:47:56,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=463868.1666666667, ans=0.0 2024-06-21 21:48:00,998 INFO [train.py:1028] (1/2) Epoch 26, batch 100, loss[loss=0.1873, simple_loss=0.2529, pruned_loss=0.06091, over 13233.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2526, pruned_loss=0.06822, over 1017616.26 frames. ], batch size: 46, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:48:01,430 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.98 vs. limit=15.0 2024-06-21 21:48:11,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=463904.8333333333, ans=0.0 2024-06-21 21:48:12,370 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:48:12,453 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=463904.8333333333, ans=0.125 2024-06-21 21:48:13,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=463904.8333333333, ans=0.125 2024-06-21 21:48:16,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=463923.1666666667, ans=0.1 2024-06-21 21:48:35,285 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.199e+02 2.354e+02 2.540e+02 3.565e+02, threshold=4.708e+02, percent-clipped=0.0 2024-06-21 21:48:35,313 INFO [train.py:1028] (1/2) Epoch 26, batch 150, loss[loss=0.2142, simple_loss=0.277, pruned_loss=0.07569, over 12703.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2521, pruned_loss=0.06721, over 1365077.44 frames. ], batch size: 29, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:48:40,991 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:48:50,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=464014.8333333333, ans=0.2 2024-06-21 21:48:56,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=464033.1666666667, ans=0.0 2024-06-21 21:49:06,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=464051.5, ans=0.125 2024-06-21 21:49:10,122 INFO [train.py:1028] (1/2) Epoch 26, batch 200, loss[loss=0.2158, simple_loss=0.2697, pruned_loss=0.08091, over 12456.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2515, pruned_loss=0.06692, over 1635748.37 frames. ], batch size: 202, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:49:26,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=464106.5, ans=0.0 2024-06-21 21:49:29,859 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.09 vs. limit=15.0 2024-06-21 21:49:34,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=464124.8333333333, ans=0.0 2024-06-21 21:49:35,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=464124.8333333333, ans=0.125 2024-06-21 21:49:38,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=464143.1666666667, ans=10.0 2024-06-21 21:49:42,632 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.198e+02 2.315e+02 2.428e+02 3.898e+02, threshold=4.630e+02, percent-clipped=0.0 2024-06-21 21:49:42,660 INFO [train.py:1028] (1/2) Epoch 26, batch 250, loss[loss=0.1907, simple_loss=0.2377, pruned_loss=0.07185, over 13024.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2497, pruned_loss=0.06588, over 1846668.98 frames. ], batch size: 144, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:49:50,567 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.12 vs. limit=15.0 2024-06-21 21:49:55,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=464198.1666666667, ans=0.1 2024-06-21 21:49:55,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=464198.1666666667, ans=0.0 2024-06-21 21:50:06,203 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.50 vs. limit=15.0 2024-06-21 21:50:18,017 INFO [train.py:1028] (1/2) Epoch 26, batch 300, loss[loss=0.1872, simple_loss=0.2422, pruned_loss=0.06613, over 13159.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2498, pruned_loss=0.0659, over 2011971.64 frames. ], batch size: 112, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:50:23,657 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.43 vs. limit=10.0 2024-06-21 21:50:33,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=464289.8333333333, ans=0.1 2024-06-21 21:50:35,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=464289.8333333333, ans=0.125 2024-06-21 21:50:37,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=464308.1666666667, ans=0.025 2024-06-21 21:50:40,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=464308.1666666667, ans=0.2 2024-06-21 21:50:41,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=464308.1666666667, ans=0.2 2024-06-21 21:50:50,766 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.277e+02 2.408e+02 2.668e+02 4.399e+02, threshold=4.816e+02, percent-clipped=0.0 2024-06-21 21:50:50,793 INFO [train.py:1028] (1/2) Epoch 26, batch 350, loss[loss=0.1946, simple_loss=0.2552, pruned_loss=0.06698, over 12950.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2495, pruned_loss=0.06576, over 2140454.88 frames. ], batch size: 33, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:50:55,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=464344.8333333333, ans=0.0 2024-06-21 21:50:57,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=464363.1666666667, ans=0.0 2024-06-21 21:51:19,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=464399.8333333333, ans=0.125 2024-06-21 21:51:26,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=464436.5, ans=0.125 2024-06-21 21:51:27,383 INFO [train.py:1028] (1/2) Epoch 26, batch 400, loss[loss=0.184, simple_loss=0.2473, pruned_loss=0.06036, over 13234.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.25, pruned_loss=0.06562, over 2240683.68 frames. ], batch size: 63, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:51:35,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=464454.8333333333, ans=0.125 2024-06-21 21:51:36,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=464454.8333333333, ans=0.04949747468305833 2024-06-21 21:51:40,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=464473.1666666667, ans=0.125 2024-06-21 21:51:43,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=464473.1666666667, ans=0.0 2024-06-21 21:51:55,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=464509.8333333333, ans=0.2 2024-06-21 21:51:58,302 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.194e+02 2.344e+02 2.535e+02 3.433e+02, threshold=4.688e+02, percent-clipped=0.0 2024-06-21 21:51:58,331 INFO [train.py:1028] (1/2) Epoch 26, batch 450, loss[loss=0.1774, simple_loss=0.2421, pruned_loss=0.05636, over 13246.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2501, pruned_loss=0.06575, over 2314136.65 frames. ], batch size: 67, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:52:12,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=464564.8333333333, ans=0.125 2024-06-21 21:52:12,895 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.90 vs. limit=22.5 2024-06-21 21:52:13,642 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.25 vs. limit=6.0 2024-06-21 21:52:26,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=464583.1666666667, ans=0.2 2024-06-21 21:52:33,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=464619.8333333333, ans=0.05 2024-06-21 21:52:33,861 INFO [train.py:1028] (1/2) Epoch 26, batch 500, loss[loss=0.1967, simple_loss=0.2446, pruned_loss=0.07438, over 13106.00 frames. ], tot_loss[loss=0.1913, simple_loss=0.2508, pruned_loss=0.06596, over 2376157.61 frames. ], batch size: 121, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:52:35,877 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:52:39,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=464638.1666666667, ans=0.0 2024-06-21 21:52:55,578 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.46 vs. limit=10.0 2024-06-21 21:53:05,125 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.10 vs. limit=15.0 2024-06-21 21:53:08,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=464711.5, ans=0.1 2024-06-21 21:53:09,060 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.231e+02 2.357e+02 2.477e+02 2.920e+02, threshold=4.714e+02, percent-clipped=0.0 2024-06-21 21:53:09,090 INFO [train.py:1028] (1/2) Epoch 26, batch 550, loss[loss=0.1872, simple_loss=0.2422, pruned_loss=0.06615, over 12952.00 frames. ], tot_loss[loss=0.1913, simple_loss=0.2509, pruned_loss=0.06588, over 2420960.88 frames. ], batch size: 158, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:53:24,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=464748.1666666667, ans=0.125 2024-06-21 21:53:27,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=464748.1666666667, ans=0.125 2024-06-21 21:53:41,380 INFO [train.py:1028] (1/2) Epoch 26, batch 600, loss[loss=0.174, simple_loss=0.2236, pruned_loss=0.06223, over 13066.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2502, pruned_loss=0.06555, over 2458800.69 frames. ], batch size: 144, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:53:42,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=464803.1666666667, ans=0.05 2024-06-21 21:53:46,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=464803.1666666667, ans=0.1 2024-06-21 21:53:51,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=464821.5, ans=0.125 2024-06-21 21:53:55,420 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.66 vs. limit=22.5 2024-06-21 21:53:57,244 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2024-06-21 21:54:00,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=464858.1666666667, ans=0.125 2024-06-21 21:54:02,383 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.82 vs. limit=15.0 2024-06-21 21:54:07,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=464876.5, ans=0.125 2024-06-21 21:54:08,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=464876.5, ans=0.125 2024-06-21 21:54:14,136 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.203e+02 2.380e+02 2.561e+02 3.662e+02, threshold=4.760e+02, percent-clipped=0.0 2024-06-21 21:54:14,166 INFO [train.py:1028] (1/2) Epoch 26, batch 650, loss[loss=0.1957, simple_loss=0.253, pruned_loss=0.06915, over 13237.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2499, pruned_loss=0.06495, over 2490392.58 frames. ], batch size: 59, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:54:18,091 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2024-06-21 21:54:24,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=464913.1666666667, ans=0.95 2024-06-21 21:54:26,309 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.37 vs. limit=15.0 2024-06-21 21:54:32,453 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=464931.5, ans=0.0 2024-06-21 21:54:49,444 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.33 vs. limit=15.0 2024-06-21 21:54:49,686 INFO [train.py:1028] (1/2) Epoch 26, batch 700, loss[loss=0.217, simple_loss=0.2682, pruned_loss=0.08293, over 13264.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2504, pruned_loss=0.0655, over 2512027.99 frames. ], batch size: 46, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:54:58,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=465004.8333333333, ans=0.125 2024-06-21 21:55:11,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=465041.5, ans=0.0 2024-06-21 21:55:17,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=465059.8333333333, ans=15.0 2024-06-21 21:55:17,681 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.07 vs. limit=22.5 2024-06-21 21:55:22,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=465059.8333333333, ans=0.0 2024-06-21 21:55:23,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=465059.8333333333, ans=0.125 2024-06-21 21:55:23,643 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.91 vs. limit=10.0 2024-06-21 21:55:25,149 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.274e+02 2.382e+02 2.523e+02 3.095e+02, threshold=4.764e+02, percent-clipped=0.0 2024-06-21 21:55:25,178 INFO [train.py:1028] (1/2) Epoch 26, batch 750, loss[loss=0.1935, simple_loss=0.2588, pruned_loss=0.06411, over 13269.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2501, pruned_loss=0.06516, over 2527411.15 frames. ], batch size: 63, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:55:27,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=465078.1666666667, ans=0.0 2024-06-21 21:55:28,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=465078.1666666667, ans=0.125 2024-06-21 21:55:36,299 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.29 vs. limit=15.0 2024-06-21 21:55:37,497 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.54 vs. limit=22.5 2024-06-21 21:55:45,198 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:55:45,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=465133.1666666667, ans=0.1 2024-06-21 21:55:47,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=465133.1666666667, ans=0.1 2024-06-21 21:55:51,337 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.63 vs. limit=6.0 2024-06-21 21:55:57,362 INFO [train.py:1028] (1/2) Epoch 26, batch 800, loss[loss=0.1693, simple_loss=0.232, pruned_loss=0.05328, over 12938.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2504, pruned_loss=0.06554, over 2540515.65 frames. ], batch size: 36, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:55:57,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=465169.8333333333, ans=0.0 2024-06-21 21:56:00,357 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2024-06-21 21:56:01,829 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.10 vs. limit=22.5 2024-06-21 21:56:02,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=465169.8333333333, ans=0.0 2024-06-21 21:56:09,232 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=13.73 vs. limit=15.0 2024-06-21 21:56:10,758 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.87 vs. limit=15.0 2024-06-21 21:56:15,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=465206.5, ans=0.0 2024-06-21 21:56:15,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=465224.8333333333, ans=0.1 2024-06-21 21:56:23,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=465243.1666666667, ans=0.05 2024-06-21 21:56:34,741 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.308e+02 2.485e+02 2.667e+02 3.454e+02, threshold=4.970e+02, percent-clipped=0.0 2024-06-21 21:56:34,773 INFO [train.py:1028] (1/2) Epoch 26, batch 850, loss[loss=0.1849, simple_loss=0.2485, pruned_loss=0.06072, over 13142.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2501, pruned_loss=0.06523, over 2550804.82 frames. ], batch size: 95, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:56:37,781 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=12.0 2024-06-21 21:56:38,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=465261.5, ans=0.1 2024-06-21 21:56:39,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=465261.5, ans=0.0 2024-06-21 21:56:56,476 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2024-06-21 21:57:03,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=465334.8333333333, ans=0.05 2024-06-21 21:57:03,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=465334.8333333333, ans=0.0 2024-06-21 21:57:05,453 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:57:07,128 INFO [train.py:1028] (1/2) Epoch 26, batch 900, loss[loss=0.1832, simple_loss=0.2496, pruned_loss=0.0584, over 12950.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2499, pruned_loss=0.06544, over 2555478.66 frames. ], batch size: 36, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:57:09,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=465353.1666666667, ans=0.2 2024-06-21 21:57:10,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=465353.1666666667, ans=0.0 2024-06-21 21:57:15,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=465371.5, ans=0.0 2024-06-21 21:57:16,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=465371.5, ans=0.015 2024-06-21 21:57:23,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=465389.8333333333, ans=0.95 2024-06-21 21:57:24,868 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.56 vs. limit=10.0 2024-06-21 21:57:42,920 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.244e+02 2.489e+02 2.752e+02 3.970e+02, threshold=4.978e+02, percent-clipped=0.0 2024-06-21 21:57:42,950 INFO [train.py:1028] (1/2) Epoch 26, batch 950, loss[loss=0.1711, simple_loss=0.2341, pruned_loss=0.05402, over 12969.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2499, pruned_loss=0.06557, over 2558336.15 frames. ], batch size: 39, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:57:47,848 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.69 vs. limit=15.0 2024-06-21 21:57:49,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=465463.1666666667, ans=0.0 2024-06-21 21:57:50,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=465463.1666666667, ans=0.125 2024-06-21 21:58:00,547 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.93 vs. limit=22.5 2024-06-21 21:58:06,098 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=465499.8333333333, ans=0.1 2024-06-21 21:58:11,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=465518.1666666667, ans=0.125 2024-06-21 21:58:12,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=465518.1666666667, ans=0.125 2024-06-21 21:58:14,749 INFO [train.py:1028] (1/2) Epoch 26, batch 1000, loss[loss=0.197, simple_loss=0.2569, pruned_loss=0.06855, over 13105.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2494, pruned_loss=0.06581, over 2561895.38 frames. ], batch size: 48, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:58:17,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=465536.5, ans=0.2 2024-06-21 21:58:18,848 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.97 vs. limit=22.5 2024-06-21 21:58:23,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=465554.8333333333, ans=0.0 2024-06-21 21:58:24,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=465554.8333333333, ans=0.125 2024-06-21 21:58:38,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=465591.5, ans=0.0 2024-06-21 21:58:49,548 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.233e+02 2.348e+02 2.500e+02 3.277e+02, threshold=4.697e+02, percent-clipped=0.0 2024-06-21 21:58:49,578 INFO [train.py:1028] (1/2) Epoch 26, batch 1050, loss[loss=0.1715, simple_loss=0.2353, pruned_loss=0.0538, over 13153.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.2496, pruned_loss=0.06574, over 2565824.75 frames. ], batch size: 77, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:58:54,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=465628.1666666667, ans=0.0 2024-06-21 21:59:07,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=465664.8333333333, ans=0.0 2024-06-21 21:59:13,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=465683.1666666667, ans=0.0 2024-06-21 21:59:15,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=465701.5, ans=0.125 2024-06-21 21:59:17,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=465701.5, ans=0.0 2024-06-21 21:59:24,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=465701.5, ans=0.0 2024-06-21 21:59:25,679 INFO [train.py:1028] (1/2) Epoch 26, batch 1100, loss[loss=0.1758, simple_loss=0.2415, pruned_loss=0.05502, over 13205.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.2499, pruned_loss=0.06568, over 2570212.82 frames. ], batch size: 52, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:59:30,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=465719.8333333333, ans=0.1 2024-06-21 21:59:32,501 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.30 vs. limit=10.0 2024-06-21 21:59:35,245 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.20 vs. limit=15.0 2024-06-21 21:59:38,415 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.23 vs. limit=15.0 2024-06-21 21:59:48,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=465774.8333333333, ans=0.125 2024-06-21 21:59:49,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=465774.8333333333, ans=0.0 2024-06-21 21:59:50,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=465774.8333333333, ans=0.125 2024-06-21 21:59:55,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=465793.1666666667, ans=0.2 2024-06-21 21:59:56,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=465793.1666666667, ans=0.1 2024-06-21 21:59:58,431 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.223e+02 2.340e+02 2.467e+02 3.194e+02, threshold=4.679e+02, percent-clipped=0.0 2024-06-21 21:59:58,461 INFO [train.py:1028] (1/2) Epoch 26, batch 1150, loss[loss=0.1814, simple_loss=0.2387, pruned_loss=0.06203, over 13262.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2503, pruned_loss=0.06574, over 2571643.91 frames. ], batch size: 52, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:59:58,855 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.87 vs. limit=15.0 2024-06-21 22:00:02,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=465811.5, ans=10.0 2024-06-21 22:00:05,463 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2024-06-21 22:00:05,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=465829.8333333333, ans=0.025 2024-06-21 22:00:18,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=465866.5, ans=0.0 2024-06-21 22:00:35,120 INFO [train.py:1028] (1/2) Epoch 26, batch 1200, loss[loss=0.2147, simple_loss=0.2719, pruned_loss=0.07877, over 13146.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2508, pruned_loss=0.06603, over 2573664.05 frames. ], batch size: 77, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 22:00:35,370 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:00:45,407 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:00:53,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=465939.8333333333, ans=0.0 2024-06-21 22:00:55,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=465958.1666666667, ans=0.125 2024-06-21 22:01:00,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=465958.1666666667, ans=0.2 2024-06-21 22:01:08,473 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.277e+02 2.418e+02 2.581e+02 3.681e+02, threshold=4.836e+02, percent-clipped=0.0 2024-06-21 22:01:08,502 INFO [train.py:1028] (1/2) Epoch 26, batch 1250, loss[loss=0.1834, simple_loss=0.2395, pruned_loss=0.0637, over 13194.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2506, pruned_loss=0.06592, over 2583120.84 frames. ], batch size: 112, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 22:01:24,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=466031.5, ans=0.0 2024-06-21 22:01:26,693 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.93 vs. limit=15.0 2024-06-21 22:01:45,340 INFO [train.py:1028] (1/2) Epoch 26, batch 1300, loss[loss=0.2031, simple_loss=0.2549, pruned_loss=0.07563, over 12685.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2515, pruned_loss=0.06649, over 2584577.85 frames. ], batch size: 176, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 22:01:46,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=466086.5, ans=0.1 2024-06-21 22:01:52,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=466104.8333333333, ans=0.125 2024-06-21 22:01:53,708 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.74 vs. limit=6.0 2024-06-21 22:02:07,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=466141.5, ans=0.0 2024-06-21 22:02:13,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=466159.8333333333, ans=0.125 2024-06-21 22:02:18,983 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.197e+02 2.333e+02 2.561e+02 3.124e+02, threshold=4.666e+02, percent-clipped=0.0 2024-06-21 22:02:19,014 INFO [train.py:1028] (1/2) Epoch 26, batch 1350, loss[loss=0.1925, simple_loss=0.2593, pruned_loss=0.06286, over 13233.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2513, pruned_loss=0.06629, over 2586536.83 frames. ], batch size: 59, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 22:02:19,729 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:02:37,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=466214.8333333333, ans=0.125 2024-06-21 22:02:40,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=466233.1666666667, ans=0.125 2024-06-21 22:02:46,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=466233.1666666667, ans=0.5 2024-06-21 22:02:46,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=466233.1666666667, ans=0.1 2024-06-21 22:02:52,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=466251.5, ans=0.0 2024-06-21 22:02:55,534 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.98 vs. limit=15.0 2024-06-21 22:02:55,922 INFO [train.py:1028] (1/2) Epoch 26, batch 1400, loss[loss=0.2098, simple_loss=0.2755, pruned_loss=0.07208, over 12425.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.251, pruned_loss=0.06617, over 2587497.60 frames. ], batch size: 25, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 22:03:03,140 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.85 vs. limit=15.0 2024-06-21 22:03:21,621 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2024-06-21 22:03:32,573 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.244e+02 2.394e+02 2.571e+02 3.266e+02, threshold=4.788e+02, percent-clipped=0.0 2024-06-21 22:03:32,608 INFO [train.py:1028] (1/2) Epoch 26, batch 1450, loss[loss=0.1958, simple_loss=0.2461, pruned_loss=0.07276, over 13108.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2507, pruned_loss=0.06631, over 2587267.13 frames. ], batch size: 121, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 22:03:40,358 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.56 vs. limit=15.0 2024-06-21 22:03:40,920 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:03:42,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=466379.8333333333, ans=0.125 2024-06-21 22:03:52,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=466416.5, ans=0.0 2024-06-21 22:04:05,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=466453.1666666667, ans=0.025 2024-06-21 22:04:05,770 INFO [train.py:1028] (1/2) Epoch 26, batch 1500, loss[loss=0.1701, simple_loss=0.2333, pruned_loss=0.05349, over 13208.00 frames. ], tot_loss[loss=0.1916, simple_loss=0.2507, pruned_loss=0.06629, over 2589547.43 frames. ], batch size: 83, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 22:04:08,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=466453.1666666667, ans=0.0 2024-06-21 22:04:12,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=466471.5, ans=0.125 2024-06-21 22:04:35,324 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.87 vs. limit=6.0 2024-06-21 22:04:36,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=466526.5, ans=0.2 2024-06-21 22:04:38,694 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.224e+02 2.425e+02 2.566e+02 2.949e+02, threshold=4.850e+02, percent-clipped=0.0 2024-06-21 22:04:38,732 INFO [train.py:1028] (1/2) Epoch 26, batch 1550, loss[loss=0.2046, simple_loss=0.2613, pruned_loss=0.07391, over 13162.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2511, pruned_loss=0.06677, over 2584282.83 frames. ], batch size: 103, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:04:40,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=466544.8333333333, ans=0.125 2024-06-21 22:04:44,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=466544.8333333333, ans=0.2 2024-06-21 22:04:45,505 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.96 vs. limit=15.0 2024-06-21 22:04:46,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=466544.8333333333, ans=0.125 2024-06-21 22:04:47,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=466563.1666666667, ans=0.125 2024-06-21 22:04:53,473 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.70 vs. limit=15.0 2024-06-21 22:04:56,705 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.20 vs. limit=22.5 2024-06-21 22:04:58,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=466581.5, ans=0.0 2024-06-21 22:05:14,172 INFO [train.py:1028] (1/2) Epoch 26, batch 1600, loss[loss=0.1675, simple_loss=0.2265, pruned_loss=0.05425, over 13235.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.251, pruned_loss=0.06646, over 2580279.09 frames. ], batch size: 77, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:05:16,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=466636.5, ans=0.2 2024-06-21 22:05:23,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=466654.8333333333, ans=0.125 2024-06-21 22:05:31,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=466673.1666666667, ans=0.125 2024-06-21 22:05:41,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=466691.5, ans=0.1 2024-06-21 22:05:49,164 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.258e+02 2.358e+02 2.543e+02 3.752e+02, threshold=4.715e+02, percent-clipped=0.0 2024-06-21 22:05:49,195 INFO [train.py:1028] (1/2) Epoch 26, batch 1650, loss[loss=0.2034, simple_loss=0.2537, pruned_loss=0.07657, over 13171.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2511, pruned_loss=0.06628, over 2575961.29 frames. ], batch size: 95, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:05:49,680 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.20 vs. limit=22.5 2024-06-21 22:05:54,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=466728.1666666667, ans=0.2 2024-06-21 22:06:01,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=466764.8333333333, ans=0.5 2024-06-21 22:06:02,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=466764.8333333333, ans=0.2 2024-06-21 22:06:12,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.66 vs. limit=6.0 2024-06-21 22:06:15,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=466801.5, ans=0.0 2024-06-21 22:06:22,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=466819.8333333333, ans=0.1 2024-06-21 22:06:22,592 INFO [train.py:1028] (1/2) Epoch 26, batch 1700, loss[loss=0.1751, simple_loss=0.2443, pruned_loss=0.05294, over 12340.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2512, pruned_loss=0.06639, over 2581037.32 frames. ], batch size: 25, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:06:28,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=466819.8333333333, ans=0.125 2024-06-21 22:06:40,697 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.78 vs. limit=10.0 2024-06-21 22:06:46,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=466874.8333333333, ans=0.125 2024-06-21 22:06:47,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=466874.8333333333, ans=0.2 2024-06-21 22:06:55,524 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=466893.1666666667, ans=0.125 2024-06-21 22:06:59,984 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.224e+02 2.459e+02 2.575e+02 3.376e+02, threshold=4.918e+02, percent-clipped=0.0 2024-06-21 22:07:00,017 INFO [train.py:1028] (1/2) Epoch 26, batch 1750, loss[loss=0.2144, simple_loss=0.2722, pruned_loss=0.07832, over 12551.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2514, pruned_loss=0.0664, over 2581362.16 frames. ], batch size: 22, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:07:06,493 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2024-06-21 22:07:11,660 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=466929.8333333333, ans=0.125 2024-06-21 22:07:22,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=466966.5, ans=0.0 2024-06-21 22:07:23,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=466966.5, ans=0.02 2024-06-21 22:07:28,124 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.58 vs. limit=22.5 2024-06-21 22:07:34,844 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.84 vs. limit=15.0 2024-06-21 22:07:35,633 INFO [train.py:1028] (1/2) Epoch 26, batch 1800, loss[loss=0.1757, simple_loss=0.24, pruned_loss=0.05569, over 13257.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2515, pruned_loss=0.06666, over 2581655.63 frames. ], batch size: 67, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:07:37,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=467003.1666666667, ans=0.125 2024-06-21 22:07:38,715 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.55 vs. limit=15.0 2024-06-21 22:07:39,097 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=467003.1666666667, ans=0.125 2024-06-21 22:07:53,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=467039.8333333333, ans=0.0 2024-06-21 22:07:55,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=467058.1666666667, ans=0.1 2024-06-21 22:07:57,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=467058.1666666667, ans=0.125 2024-06-21 22:07:59,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=467058.1666666667, ans=0.05 2024-06-21 22:08:00,457 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.53 vs. limit=22.5 2024-06-21 22:08:07,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=467094.8333333333, ans=0.125 2024-06-21 22:08:08,537 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.221e+02 2.327e+02 2.523e+02 3.559e+02, threshold=4.654e+02, percent-clipped=0.0 2024-06-21 22:08:08,565 INFO [train.py:1028] (1/2) Epoch 26, batch 1850, loss[loss=0.2013, simple_loss=0.2599, pruned_loss=0.07133, over 13168.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2517, pruned_loss=0.06653, over 2582370.52 frames. ], batch size: 83, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:08:08,769 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.486e+00 2024-06-21 22:08:09,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=467094.8333333333, ans=0.0 2024-06-21 22:08:16,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=467113.1666666667, ans=0.0 2024-06-21 22:08:26,019 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:08:31,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=467149.8333333333, ans=0.1 2024-06-21 22:08:33,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=467168.1666666667, ans=0.2 2024-06-21 22:08:43,658 INFO [train.py:1028] (1/2) Epoch 26, batch 1900, loss[loss=0.1957, simple_loss=0.2529, pruned_loss=0.06925, over 13153.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2513, pruned_loss=0.06662, over 2585161.14 frames. ], batch size: 95, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:08:44,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=467186.5, ans=0.0 2024-06-21 22:08:59,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=467223.1666666667, ans=0.0 2024-06-21 22:09:13,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=467259.8333333333, ans=0.125 2024-06-21 22:09:15,523 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.57 vs. limit=15.0 2024-06-21 22:09:15,667 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.217e+02 2.364e+02 2.500e+02 3.346e+02, threshold=4.728e+02, percent-clipped=0.0 2024-06-21 22:09:15,697 INFO [train.py:1028] (1/2) Epoch 26, batch 1950, loss[loss=0.1955, simple_loss=0.2558, pruned_loss=0.06758, over 13275.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2511, pruned_loss=0.06672, over 2591271.56 frames. ], batch size: 52, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:09:24,342 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.47 vs. limit=15.0 2024-06-21 22:09:26,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2024-06-21 22:09:34,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=467314.8333333333, ans=0.1 2024-06-21 22:09:46,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=467351.5, ans=0.125 2024-06-21 22:09:51,449 INFO [train.py:1028] (1/2) Epoch 26, batch 2000, loss[loss=0.1985, simple_loss=0.2598, pruned_loss=0.06863, over 12518.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2505, pruned_loss=0.06657, over 2587426.87 frames. ], batch size: 22, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:09:53,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=467369.8333333333, ans=0.125 2024-06-21 22:09:55,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=467369.8333333333, ans=0.125 2024-06-21 22:09:58,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=467388.1666666667, ans=0.0 2024-06-21 22:10:02,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=467388.1666666667, ans=0.0 2024-06-21 22:10:03,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=467388.1666666667, ans=0.1 2024-06-21 22:10:24,013 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.263e+02 2.384e+02 2.469e+02 3.091e+02, threshold=4.769e+02, percent-clipped=0.0 2024-06-21 22:10:24,042 INFO [train.py:1028] (1/2) Epoch 26, batch 2050, loss[loss=0.1846, simple_loss=0.2423, pruned_loss=0.06346, over 12582.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2508, pruned_loss=0.06665, over 2582956.49 frames. ], batch size: 29, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:10:29,532 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.77 vs. limit=15.0 2024-06-21 22:10:39,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=467498.1666666667, ans=0.0 2024-06-21 22:10:40,607 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.71 vs. limit=10.0 2024-06-21 22:10:41,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=467498.1666666667, ans=0.0 2024-06-21 22:10:49,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=467516.5, ans=0.025 2024-06-21 22:10:56,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=467534.8333333333, ans=0.0 2024-06-21 22:10:59,912 INFO [train.py:1028] (1/2) Epoch 26, batch 2100, loss[loss=0.1937, simple_loss=0.261, pruned_loss=0.06322, over 13213.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2516, pruned_loss=0.06663, over 2585452.31 frames. ], batch size: 59, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:11:02,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=467553.1666666667, ans=0.5 2024-06-21 22:11:06,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=467571.5, ans=0.125 2024-06-21 22:11:18,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=467608.1666666667, ans=0.0 2024-06-21 22:11:34,925 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.240e+02 2.368e+02 2.518e+02 3.356e+02, threshold=4.736e+02, percent-clipped=0.0 2024-06-21 22:11:34,954 INFO [train.py:1028] (1/2) Epoch 26, batch 2150, loss[loss=0.1547, simple_loss=0.2158, pruned_loss=0.04685, over 13200.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2514, pruned_loss=0.06643, over 2587816.58 frames. ], batch size: 52, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:11:37,893 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.44 vs. limit=15.0 2024-06-21 22:11:41,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=467663.1666666667, ans=0.025 2024-06-21 22:11:56,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=467699.8333333333, ans=0.0 2024-06-21 22:12:01,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=467718.1666666667, ans=0.125 2024-06-21 22:12:05,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=467718.1666666667, ans=0.125 2024-06-21 22:12:07,650 INFO [train.py:1028] (1/2) Epoch 26, batch 2200, loss[loss=0.1931, simple_loss=0.245, pruned_loss=0.07056, over 13204.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2515, pruned_loss=0.06662, over 2588134.96 frames. ], batch size: 83, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:12:08,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=467736.5, ans=0.125 2024-06-21 22:12:11,202 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.62 vs. limit=15.0 2024-06-21 22:12:22,174 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.68 vs. limit=22.5 2024-06-21 22:12:28,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=467791.5, ans=0.1 2024-06-21 22:12:33,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=467809.8333333333, ans=0.125 2024-06-21 22:12:39,803 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.201e+02 2.362e+02 2.509e+02 2.984e+02, threshold=4.725e+02, percent-clipped=0.0 2024-06-21 22:12:39,839 INFO [train.py:1028] (1/2) Epoch 26, batch 2250, loss[loss=0.1968, simple_loss=0.2684, pruned_loss=0.06256, over 13271.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2516, pruned_loss=0.06655, over 2587082.94 frames. ], batch size: 63, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:12:43,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=467828.1666666667, ans=0.2 2024-06-21 22:12:46,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=467846.5, ans=0.0 2024-06-21 22:12:49,888 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.03 vs. limit=15.0 2024-06-21 22:12:50,402 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.28 vs. limit=15.0 2024-06-21 22:12:55,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=467864.8333333333, ans=0.0 2024-06-21 22:13:12,947 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.23 vs. limit=10.0 2024-06-21 22:13:15,110 INFO [train.py:1028] (1/2) Epoch 26, batch 2300, loss[loss=0.1961, simple_loss=0.2562, pruned_loss=0.06797, over 12880.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2512, pruned_loss=0.06622, over 2581632.53 frames. ], batch size: 33, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:13:32,293 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.19 vs. limit=22.5 2024-06-21 22:13:34,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=467956.5, ans=0.2 2024-06-21 22:13:37,782 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.70 vs. limit=15.0 2024-06-21 22:13:50,151 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.279e+02 2.418e+02 2.670e+02 3.520e+02, threshold=4.836e+02, percent-clipped=0.0 2024-06-21 22:13:50,179 INFO [train.py:1028] (1/2) Epoch 26, batch 2350, loss[loss=0.1982, simple_loss=0.2546, pruned_loss=0.07091, over 13251.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2511, pruned_loss=0.06622, over 2584790.30 frames. ], batch size: 67, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:13:50,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=468011.5, ans=0.125 2024-06-21 22:13:54,312 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=22.5 2024-06-21 22:13:55,178 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2024-06-21 22:13:56,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=468029.8333333333, ans=0.0 2024-06-21 22:13:58,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=468029.8333333333, ans=0.125 2024-06-21 22:14:21,776 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:14:22,250 INFO [train.py:1028] (1/2) Epoch 26, batch 2400, loss[loss=0.1874, simple_loss=0.2491, pruned_loss=0.06284, over 13335.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2509, pruned_loss=0.06622, over 2587778.50 frames. ], batch size: 46, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:14:25,746 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.70 vs. limit=22.5 2024-06-21 22:14:25,833 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.94 vs. limit=15.0 2024-06-21 22:14:28,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=468121.5, ans=0.025 2024-06-21 22:14:28,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=468121.5, ans=0.0 2024-06-21 22:14:31,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=468121.5, ans=0.125 2024-06-21 22:14:31,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=468121.5, ans=0.125 2024-06-21 22:14:32,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=468121.5, ans=0.125 2024-06-21 22:14:34,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=468139.8333333333, ans=0.2 2024-06-21 22:14:41,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=468139.8333333333, ans=0.0 2024-06-21 22:14:41,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=468139.8333333333, ans=0.95 2024-06-21 22:14:44,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=468158.1666666667, ans=0.125 2024-06-21 22:14:56,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=468194.8333333333, ans=0.05 2024-06-21 22:14:56,632 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.937e+02 2.233e+02 2.358e+02 2.594e+02 3.691e+02, threshold=4.715e+02, percent-clipped=0.0 2024-06-21 22:14:56,661 INFO [train.py:1028] (1/2) Epoch 26, batch 2450, loss[loss=0.1676, simple_loss=0.2275, pruned_loss=0.05381, over 13304.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2503, pruned_loss=0.06651, over 2584298.40 frames. ], batch size: 63, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:15:06,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=468213.1666666667, ans=0.2 2024-06-21 22:15:11,463 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=468231.5, ans=0.2 2024-06-21 22:15:32,651 INFO [train.py:1028] (1/2) Epoch 26, batch 2500, loss[loss=0.1913, simple_loss=0.249, pruned_loss=0.06686, over 13216.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.249, pruned_loss=0.06593, over 2587471.98 frames. ], batch size: 83, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:15:37,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=468286.5, ans=0.2 2024-06-21 22:15:37,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=468286.5, ans=0.2 2024-06-21 22:15:42,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=468304.8333333333, ans=0.125 2024-06-21 22:15:48,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=468323.1666666667, ans=0.125 2024-06-21 22:15:50,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=468323.1666666667, ans=0.125 2024-06-21 22:16:00,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=468359.8333333333, ans=0.04949747468305833 2024-06-21 22:16:02,319 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.21 vs. limit=12.0 2024-06-21 22:16:05,102 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.166e+02 2.320e+02 2.590e+02 3.730e+02, threshold=4.641e+02, percent-clipped=0.0 2024-06-21 22:16:05,131 INFO [train.py:1028] (1/2) Epoch 26, batch 2550, loss[loss=0.19, simple_loss=0.2499, pruned_loss=0.06501, over 12815.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2478, pruned_loss=0.06567, over 2587846.49 frames. ], batch size: 22, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:16:19,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=468414.8333333333, ans=0.125 2024-06-21 22:16:36,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=468451.5, ans=0.1 2024-06-21 22:16:36,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=468451.5, ans=0.125 2024-06-21 22:16:39,973 INFO [train.py:1028] (1/2) Epoch 26, batch 2600, loss[loss=0.193, simple_loss=0.2551, pruned_loss=0.06551, over 13257.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2465, pruned_loss=0.06532, over 2586951.47 frames. ], batch size: 52, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:16:42,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=468469.8333333333, ans=0.125 2024-06-21 22:16:46,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=468488.1666666667, ans=0.125 2024-06-21 22:16:48,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=468488.1666666667, ans=0.125 2024-06-21 22:16:54,079 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.11 vs. limit=15.0 2024-06-21 22:17:13,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=468543.1666666667, ans=0.07 2024-06-21 22:17:16,182 INFO [train.py:1028] (1/2) Epoch 26, batch 2650, loss[loss=0.1885, simple_loss=0.2392, pruned_loss=0.06892, over 12980.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2456, pruned_loss=0.06503, over 2587213.63 frames. ], batch size: 144, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:17:16,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=468561.5, ans=0.125 2024-06-21 22:17:16,835 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.200e+02 2.333e+02 2.444e+02 3.258e+02, threshold=4.665e+02, percent-clipped=0.0 2024-06-21 22:17:21,881 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.96 vs. limit=12.0 2024-06-21 22:17:23,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=468579.8333333333, ans=0.1 2024-06-21 22:17:24,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=468579.8333333333, ans=0.125 2024-06-21 22:17:24,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=468579.8333333333, ans=15.0 2024-06-21 22:17:27,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=468579.8333333333, ans=0.0 2024-06-21 22:17:33,741 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:17:38,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=468616.5, ans=0.125 2024-06-21 22:17:44,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=468634.8333333333, ans=0.125 2024-06-21 22:17:49,683 INFO [train.py:1028] (1/2) Epoch 26, batch 2700, loss[loss=0.1853, simple_loss=0.2361, pruned_loss=0.06726, over 13216.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2442, pruned_loss=0.06483, over 2585531.34 frames. ], batch size: 89, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:17:51,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=468653.1666666667, ans=0.2 2024-06-21 22:17:59,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=468671.5, ans=0.0 2024-06-21 22:18:01,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=468671.5, ans=15.0 2024-06-21 22:18:04,609 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.79 vs. limit=6.0 2024-06-21 22:18:12,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=468708.1666666667, ans=0.125 2024-06-21 22:18:22,274 INFO [train.py:1028] (1/2) Epoch 26, batch 2750, loss[loss=0.1979, simple_loss=0.2575, pruned_loss=0.06916, over 13292.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2435, pruned_loss=0.06427, over 2582815.65 frames. ], batch size: 43, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:18:22,828 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.207e+02 2.359e+02 2.626e+02 3.748e+02, threshold=4.717e+02, percent-clipped=0.0 2024-06-21 22:18:23,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=468744.8333333333, ans=0.125 2024-06-21 22:18:27,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=468744.8333333333, ans=0.035 2024-06-21 22:18:28,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=468744.8333333333, ans=0.0 2024-06-21 22:18:29,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=468744.8333333333, ans=0.125 2024-06-21 22:18:32,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=468763.1666666667, ans=0.2 2024-06-21 22:18:38,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=468781.5, ans=0.035 2024-06-21 22:18:40,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=468781.5, ans=0.125 2024-06-21 22:18:59,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=468818.1666666667, ans=0.125 2024-06-21 22:19:00,979 INFO [train.py:1028] (1/2) Epoch 26, batch 2800, loss[loss=0.1903, simple_loss=0.2401, pruned_loss=0.0703, over 10753.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2428, pruned_loss=0.0642, over 2580918.46 frames. ], batch size: 305, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:19:11,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=468854.8333333333, ans=0.0 2024-06-21 22:19:13,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=468873.1666666667, ans=0.1 2024-06-21 22:19:19,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=468891.5, ans=0.125 2024-06-21 22:19:31,187 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.89 vs. limit=10.0 2024-06-21 22:19:32,927 INFO [train.py:1028] (1/2) Epoch 26, batch 2850, loss[loss=0.1847, simple_loss=0.2512, pruned_loss=0.05913, over 13356.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2421, pruned_loss=0.06396, over 2579061.10 frames. ], batch size: 49, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:19:33,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=468928.1666666667, ans=0.025 2024-06-21 22:19:33,529 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.225e+02 2.347e+02 2.498e+02 2.969e+02, threshold=4.693e+02, percent-clipped=0.0 2024-06-21 22:19:36,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=468928.1666666667, ans=0.1 2024-06-21 22:20:05,510 INFO [train.py:1028] (1/2) Epoch 26, batch 2900, loss[loss=0.1918, simple_loss=0.2542, pruned_loss=0.06471, over 13097.00 frames. ], tot_loss[loss=0.1833, simple_loss=0.2402, pruned_loss=0.0632, over 2586675.08 frames. ], batch size: 55, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:20:07,340 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2024-06-21 22:20:27,467 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.75 vs. limit=15.0 2024-06-21 22:20:34,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=469093.1666666667, ans=0.0 2024-06-21 22:20:42,092 INFO [train.py:1028] (1/2) Epoch 26, batch 2950, loss[loss=0.1851, simple_loss=0.25, pruned_loss=0.06011, over 13248.00 frames. ], tot_loss[loss=0.1836, simple_loss=0.2405, pruned_loss=0.06334, over 2580850.13 frames. ], batch size: 43, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:20:42,626 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.224e+02 2.380e+02 2.563e+02 3.217e+02, threshold=4.759e+02, percent-clipped=0.0 2024-06-21 22:20:46,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=469111.5, ans=0.125 2024-06-21 22:20:47,263 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.01 vs. limit=10.0 2024-06-21 22:20:52,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=469129.8333333333, ans=0.125 2024-06-21 22:20:53,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=469129.8333333333, ans=0.125 2024-06-21 22:21:07,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=469166.5, ans=0.125 2024-06-21 22:21:15,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=469184.8333333333, ans=0.0 2024-06-21 22:21:20,261 INFO [train.py:1028] (1/2) Epoch 26, batch 3000, loss[loss=0.1735, simple_loss=0.2345, pruned_loss=0.05621, over 13183.00 frames. ], tot_loss[loss=0.1829, simple_loss=0.2396, pruned_loss=0.06313, over 2579241.90 frames. ], batch size: 59, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:21:20,261 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 22:21:25,328 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.8775, 4.3979, 4.3302, 3.7442, 3.7126, 4.0812, 4.0917, 3.7899], device='cuda:1') 2024-06-21 22:21:28,193 INFO [train.py:1060] (1/2) Epoch 26, validation: loss=0.19, simple_loss=0.2513, pruned_loss=0.06441, over 351949.00 frames. 2024-06-21 22:21:28,194 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 22:21:37,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=469221.5, ans=0.2 2024-06-21 22:21:37,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=469221.5, ans=0.125 2024-06-21 22:21:40,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=469221.5, ans=0.1 2024-06-21 22:21:51,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=469258.1666666667, ans=0.125 2024-06-21 22:21:55,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=469276.5, ans=0.2 2024-06-21 22:22:01,444 INFO [train.py:1028] (1/2) Epoch 26, batch 3050, loss[loss=0.1831, simple_loss=0.2388, pruned_loss=0.06374, over 13303.00 frames. ], tot_loss[loss=0.1825, simple_loss=0.2386, pruned_loss=0.0632, over 2578667.75 frames. ], batch size: 46, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:22:02,085 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.239e+02 2.394e+02 2.560e+02 3.892e+02, threshold=4.788e+02, percent-clipped=0.0 2024-06-21 22:22:38,446 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.62 vs. limit=15.0 2024-06-21 22:22:41,814 INFO [train.py:1028] (1/2) Epoch 26, batch 3100, loss[loss=0.1653, simple_loss=0.2144, pruned_loss=0.05807, over 12955.00 frames. ], tot_loss[loss=0.1817, simple_loss=0.2377, pruned_loss=0.06282, over 2579000.36 frames. ], batch size: 144, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:22:44,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=469386.5, ans=0.0 2024-06-21 22:23:02,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=469423.1666666667, ans=0.125 2024-06-21 22:23:07,776 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.05 vs. limit=15.0 2024-06-21 22:23:19,834 INFO [train.py:1028] (1/2) Epoch 26, batch 3150, loss[loss=0.1795, simple_loss=0.2356, pruned_loss=0.06173, over 12947.00 frames. ], tot_loss[loss=0.1808, simple_loss=0.2367, pruned_loss=0.06245, over 2580839.33 frames. ], batch size: 158, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:23:20,444 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.203e+02 2.359e+02 2.601e+02 3.409e+02, threshold=4.718e+02, percent-clipped=0.0 2024-06-21 22:23:22,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=469478.1666666667, ans=0.0 2024-06-21 22:23:23,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=469478.1666666667, ans=0.2 2024-06-21 22:23:33,859 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2024-06-21 22:23:35,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=469514.8333333333, ans=0.1 2024-06-21 22:23:43,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=469533.1666666667, ans=0.125 2024-06-21 22:23:43,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=469533.1666666667, ans=0.0 2024-06-21 22:23:44,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=469533.1666666667, ans=0.0 2024-06-21 22:23:49,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=469551.5, ans=0.125 2024-06-21 22:23:52,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=469569.8333333333, ans=0.125 2024-06-21 22:23:52,631 INFO [train.py:1028] (1/2) Epoch 26, batch 3200, loss[loss=0.1622, simple_loss=0.224, pruned_loss=0.05023, over 13216.00 frames. ], tot_loss[loss=0.1804, simple_loss=0.2364, pruned_loss=0.06222, over 2580749.70 frames. ], batch size: 55, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:23:53,478 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=469569.8333333333, ans=0.2 2024-06-21 22:23:56,424 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2024-06-21 22:24:04,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=469588.1666666667, ans=0.125 2024-06-21 22:24:07,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=469606.5, ans=0.035 2024-06-21 22:24:08,799 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.05 vs. limit=6.0 2024-06-21 22:24:10,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=469606.5, ans=0.2 2024-06-21 22:24:10,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=469606.5, ans=0.0 2024-06-21 22:24:10,803 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=469606.5, ans=0.025 2024-06-21 22:24:17,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=469624.8333333333, ans=0.125 2024-06-21 22:24:24,605 INFO [train.py:1028] (1/2) Epoch 26, batch 3250, loss[loss=0.1933, simple_loss=0.2565, pruned_loss=0.06506, over 13232.00 frames. ], tot_loss[loss=0.1804, simple_loss=0.2362, pruned_loss=0.06227, over 2584496.12 frames. ], batch size: 72, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:24:25,177 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.136e+02 2.250e+02 2.442e+02 3.025e+02, threshold=4.499e+02, percent-clipped=0.0 2024-06-21 22:24:25,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=469661.5, ans=0.125 2024-06-21 22:24:39,973 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:24:42,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=469698.1666666667, ans=0.04949747468305833 2024-06-21 22:24:45,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=469698.1666666667, ans=0.2 2024-06-21 22:24:46,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=469698.1666666667, ans=0.125 2024-06-21 22:25:06,055 INFO [train.py:1028] (1/2) Epoch 26, batch 3300, loss[loss=0.194, simple_loss=0.2434, pruned_loss=0.07231, over 12695.00 frames. ], tot_loss[loss=0.1792, simple_loss=0.2352, pruned_loss=0.06161, over 2581127.53 frames. ], batch size: 176, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:25:16,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=469771.5, ans=0.1 2024-06-21 22:25:24,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=469789.8333333333, ans=0.1 2024-06-21 22:25:31,279 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:25:37,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=469826.5, ans=0.0 2024-06-21 22:25:38,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=469844.8333333333, ans=0.1 2024-06-21 22:25:38,753 INFO [train.py:1028] (1/2) Epoch 26, batch 3350, loss[loss=0.1626, simple_loss=0.216, pruned_loss=0.05462, over 12880.00 frames. ], tot_loss[loss=0.1794, simple_loss=0.2351, pruned_loss=0.06185, over 2576261.22 frames. ], batch size: 158, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:25:39,433 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.183e+02 2.310e+02 2.482e+02 3.430e+02, threshold=4.620e+02, percent-clipped=0.0 2024-06-21 22:25:39,683 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:25:54,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=469881.5, ans=0.0 2024-06-21 22:25:59,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=469899.8333333333, ans=0.125 2024-06-21 22:26:00,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=469899.8333333333, ans=0.0 2024-06-21 22:26:12,360 INFO [train.py:1028] (1/2) Epoch 26, batch 3400, loss[loss=0.1801, simple_loss=0.2414, pruned_loss=0.05945, over 12446.00 frames. ], tot_loss[loss=0.1796, simple_loss=0.235, pruned_loss=0.06212, over 2573997.85 frames. ], batch size: 22, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:26:16,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=469936.5, ans=0.125 2024-06-21 22:26:17,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=469936.5, ans=0.125 2024-06-21 22:26:17,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=469936.5, ans=0.1 2024-06-21 22:26:18,760 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=469936.5, ans=0.0 2024-06-21 22:26:22,452 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=12.92 vs. limit=15.0 2024-06-21 22:26:31,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=469973.1666666667, ans=0.125 2024-06-21 22:26:31,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=469973.1666666667, ans=0.125 2024-06-21 22:26:49,269 INFO [train.py:1028] (1/2) Epoch 26, batch 3450, loss[loss=0.1907, simple_loss=0.2388, pruned_loss=0.07134, over 12702.00 frames. ], tot_loss[loss=0.1795, simple_loss=0.235, pruned_loss=0.06202, over 2576498.38 frames. ], batch size: 176, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:26:49,821 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.873e+02 2.200e+02 2.336e+02 2.535e+02 3.349e+02, threshold=4.673e+02, percent-clipped=0.0 2024-06-21 22:26:54,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=470028.1666666667, ans=0.0 2024-06-21 22:27:13,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=470083.1666666667, ans=0.125 2024-06-21 22:27:15,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=470083.1666666667, ans=0.125 2024-06-21 22:27:17,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=470083.1666666667, ans=0.125 2024-06-21 22:27:20,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=470101.5, ans=0.95 2024-06-21 22:27:26,090 INFO [train.py:1028] (1/2) Epoch 26, batch 3500, loss[loss=0.158, simple_loss=0.2228, pruned_loss=0.04666, over 12918.00 frames. ], tot_loss[loss=0.1792, simple_loss=0.2348, pruned_loss=0.06186, over 2575971.15 frames. ], batch size: 33, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:27:26,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=470119.8333333333, ans=0.125 2024-06-21 22:27:52,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=470193.1666666667, ans=0.1 2024-06-21 22:27:59,151 INFO [train.py:1028] (1/2) Epoch 26, batch 3550, loss[loss=0.1697, simple_loss=0.2189, pruned_loss=0.06022, over 13157.00 frames. ], tot_loss[loss=0.1785, simple_loss=0.2341, pruned_loss=0.06143, over 2576444.99 frames. ], batch size: 95, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:27:59,701 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.178e+02 2.269e+02 2.418e+02 3.289e+02, threshold=4.539e+02, percent-clipped=0.0 2024-06-21 22:27:59,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=470211.5, ans=0.125 2024-06-21 22:28:03,924 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.69 vs. limit=22.5 2024-06-21 22:28:12,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=470248.1666666667, ans=0.025 2024-06-21 22:28:33,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=470303.1666666667, ans=0.125 2024-06-21 22:28:34,106 INFO [train.py:1028] (1/2) Epoch 26, batch 3600, loss[loss=0.196, simple_loss=0.2485, pruned_loss=0.07171, over 13312.00 frames. ], tot_loss[loss=0.1782, simple_loss=0.2337, pruned_loss=0.06131, over 2580039.79 frames. ], batch size: 49, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:28:38,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=470303.1666666667, ans=0.0 2024-06-21 22:28:38,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=470303.1666666667, ans=0.0 2024-06-21 22:28:39,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=470321.5, ans=0.0 2024-06-21 22:28:42,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=470321.5, ans=0.125 2024-06-21 22:28:57,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=470358.1666666667, ans=0.0 2024-06-21 22:29:09,711 INFO [train.py:1028] (1/2) Epoch 26, batch 3650, loss[loss=0.1917, simple_loss=0.2355, pruned_loss=0.07392, over 13071.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.2331, pruned_loss=0.06077, over 2578521.23 frames. ], batch size: 102, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:29:10,296 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.131e+02 2.299e+02 2.507e+02 3.321e+02, threshold=4.598e+02, percent-clipped=0.0 2024-06-21 22:29:37,395 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.97 vs. limit=15.0 2024-06-21 22:29:40,826 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:29:41,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=470486.5, ans=0.0 2024-06-21 22:29:42,017 INFO [train.py:1028] (1/2) Epoch 26, batch 3700, loss[loss=0.1653, simple_loss=0.2288, pruned_loss=0.0509, over 13243.00 frames. ], tot_loss[loss=0.1764, simple_loss=0.2321, pruned_loss=0.06038, over 2583389.86 frames. ], batch size: 72, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:29:48,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=470504.8333333333, ans=0.0 2024-06-21 22:30:00,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=470523.1666666667, ans=0.125 2024-06-21 22:30:05,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=470541.5, ans=0.0 2024-06-21 22:30:09,113 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:30:11,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=470559.8333333333, ans=0.1 2024-06-21 22:30:14,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=470578.1666666667, ans=0.125 2024-06-21 22:30:15,081 INFO [train.py:1028] (1/2) Epoch 26, batch 3750, loss[loss=0.1725, simple_loss=0.2357, pruned_loss=0.0546, over 12566.00 frames. ], tot_loss[loss=0.1765, simple_loss=0.232, pruned_loss=0.0605, over 2585448.74 frames. ], batch size: 22, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:30:15,734 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.153e+02 2.246e+02 2.425e+02 3.051e+02, threshold=4.493e+02, percent-clipped=0.0 2024-06-21 22:30:32,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=470614.8333333333, ans=0.125 2024-06-21 22:30:35,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=470614.8333333333, ans=0.125 2024-06-21 22:30:39,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=470633.1666666667, ans=0.0 2024-06-21 22:30:51,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=470669.8333333333, ans=0.1 2024-06-21 22:30:52,028 INFO [train.py:1028] (1/2) Epoch 26, batch 3800, loss[loss=0.1664, simple_loss=0.2209, pruned_loss=0.05593, over 13251.00 frames. ], tot_loss[loss=0.1762, simple_loss=0.2317, pruned_loss=0.06036, over 2582209.79 frames. ], batch size: 83, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:30:54,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=470669.8333333333, ans=0.125 2024-06-21 22:30:59,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=470688.1666666667, ans=0.125 2024-06-21 22:31:24,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=470743.1666666667, ans=0.125 2024-06-21 22:31:27,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=470743.1666666667, ans=0.0 2024-06-21 22:31:27,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=470743.1666666667, ans=0.125 2024-06-21 22:31:29,556 INFO [train.py:1028] (1/2) Epoch 26, batch 3850, loss[loss=0.1791, simple_loss=0.227, pruned_loss=0.06564, over 13008.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.231, pruned_loss=0.05998, over 2580743.40 frames. ], batch size: 144, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:31:30,204 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.893e+02 2.132e+02 2.259e+02 2.452e+02 3.350e+02, threshold=4.518e+02, percent-clipped=0.0 2024-06-21 22:31:35,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=470779.8333333333, ans=0.1 2024-06-21 22:31:37,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=470779.8333333333, ans=0.0 2024-06-21 22:31:45,453 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=470798.1666666667, ans=0.125 2024-06-21 22:31:46,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=470798.1666666667, ans=0.125 2024-06-21 22:31:52,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=470816.5, ans=0.025 2024-06-21 22:31:54,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=470816.5, ans=0.125 2024-06-21 22:32:00,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=470834.8333333333, ans=0.025 2024-06-21 22:32:00,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=470834.8333333333, ans=0.0 2024-06-21 22:32:02,313 INFO [train.py:1028] (1/2) Epoch 26, batch 3900, loss[loss=0.1681, simple_loss=0.2201, pruned_loss=0.05802, over 13170.00 frames. ], tot_loss[loss=0.1757, simple_loss=0.2308, pruned_loss=0.06032, over 2584337.39 frames. ], batch size: 83, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:32:07,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=470853.1666666667, ans=0.125 2024-06-21 22:32:07,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=470853.1666666667, ans=0.125 2024-06-21 22:32:09,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=470871.5, ans=0.125 2024-06-21 22:32:19,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=470889.8333333333, ans=0.1 2024-06-21 22:32:19,768 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.64 vs. limit=15.0 2024-06-21 22:32:21,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=470908.1666666667, ans=0.1 2024-06-21 22:32:30,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=470926.5, ans=0.125 2024-06-21 22:32:35,178 INFO [train.py:1028] (1/2) Epoch 26, batch 3950, loss[loss=0.1606, simple_loss=0.2137, pruned_loss=0.05376, over 13064.00 frames. ], tot_loss[loss=0.1748, simple_loss=0.2302, pruned_loss=0.05971, over 2588139.53 frames. ], batch size: 132, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:32:39,145 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.738e+02 2.149e+02 2.289e+02 2.472e+02 4.018e+02, threshold=4.578e+02, percent-clipped=0.0 2024-06-21 22:32:39,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=470944.8333333333, ans=0.0 2024-06-21 22:32:45,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=470963.1666666667, ans=0.07 2024-06-21 22:32:48,087 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2024-06-21 22:33:00,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=470999.8333333333, ans=0.125 2024-06-21 22:33:11,546 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=471018.1666666667, ans=0.0 2024-06-21 22:33:11,813 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.18 vs. limit=12.0 2024-06-21 22:33:14,721 INFO [train.py:1028] (1/2) Epoch 26, batch 4000, loss[loss=0.1895, simple_loss=0.2511, pruned_loss=0.06395, over 13020.00 frames. ], tot_loss[loss=0.1751, simple_loss=0.2304, pruned_loss=0.05986, over 2583062.80 frames. ], batch size: 39, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:33:14,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=471036.5, ans=0.125 2024-06-21 22:33:15,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=471036.5, ans=0.1 2024-06-21 22:33:16,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=471036.5, ans=0.2 2024-06-21 22:33:19,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=471036.5, ans=0.125 2024-06-21 22:33:29,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=471073.1666666667, ans=0.125 2024-06-21 22:33:41,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=471109.8333333333, ans=0.0 2024-06-21 22:33:42,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=471109.8333333333, ans=0.05 2024-06-21 22:33:43,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=471109.8333333333, ans=0.0 2024-06-21 22:33:48,297 INFO [train.py:1028] (1/2) Epoch 26, batch 4050, loss[loss=0.1951, simple_loss=0.2408, pruned_loss=0.07472, over 11009.00 frames. ], tot_loss[loss=0.175, simple_loss=0.2305, pruned_loss=0.05976, over 2581393.62 frames. ], batch size: 304, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:33:48,870 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.148e+02 2.250e+02 2.404e+02 3.385e+02, threshold=4.501e+02, percent-clipped=0.0 2024-06-21 22:33:51,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=471128.1666666667, ans=0.0 2024-06-21 22:33:53,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=471128.1666666667, ans=0.125 2024-06-21 22:33:55,826 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.30 vs. limit=15.0 2024-06-21 22:34:03,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=471164.8333333333, ans=0.2 2024-06-21 22:34:09,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471183.1666666667, ans=0.1 2024-06-21 22:34:21,080 INFO [train.py:1028] (1/2) Epoch 26, batch 4100, loss[loss=0.181, simple_loss=0.2266, pruned_loss=0.06773, over 12982.00 frames. ], tot_loss[loss=0.1751, simple_loss=0.2303, pruned_loss=0.06, over 2577517.05 frames. ], batch size: 102, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:34:21,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=471219.8333333333, ans=0.1 2024-06-21 22:34:24,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=471219.8333333333, ans=0.0 2024-06-21 22:34:26,812 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2024-06-21 22:34:42,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471256.5, ans=0.1 2024-06-21 22:34:57,963 INFO [train.py:1028] (1/2) Epoch 26, batch 4150, loss[loss=0.1594, simple_loss=0.2174, pruned_loss=0.05073, over 13103.00 frames. ], tot_loss[loss=0.1748, simple_loss=0.2298, pruned_loss=0.05984, over 2575911.88 frames. ], batch size: 55, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:34:58,599 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.243e+02 2.353e+02 2.480e+02 3.313e+02, threshold=4.706e+02, percent-clipped=0.0 2024-06-21 22:35:20,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=471366.5, ans=10.0 2024-06-21 22:35:20,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=471366.5, ans=0.0 2024-06-21 22:35:34,243 INFO [train.py:1028] (1/2) Epoch 26, batch 4200, loss[loss=0.1632, simple_loss=0.214, pruned_loss=0.05623, over 13165.00 frames. ], tot_loss[loss=0.1745, simple_loss=0.2292, pruned_loss=0.05991, over 2579508.83 frames. ], batch size: 103, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:35:34,551 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2024-06-21 22:35:40,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=471421.5, ans=0.125 2024-06-21 22:35:40,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=471421.5, ans=0.5 2024-06-21 22:35:45,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=471421.5, ans=0.125 2024-06-21 22:35:47,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=471439.8333333333, ans=0.025 2024-06-21 22:35:50,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=471439.8333333333, ans=0.0 2024-06-21 22:35:52,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=471458.1666666667, ans=0.1 2024-06-21 22:35:57,054 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:35:57,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471458.1666666667, ans=0.1 2024-06-21 22:36:03,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=471476.5, ans=0.2 2024-06-21 22:36:03,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=471476.5, ans=0.125 2024-06-21 22:36:06,850 INFO [train.py:1028] (1/2) Epoch 26, batch 4250, loss[loss=0.1677, simple_loss=0.2277, pruned_loss=0.05381, over 13249.00 frames. ], tot_loss[loss=0.174, simple_loss=0.2288, pruned_loss=0.05957, over 2581639.55 frames. ], batch size: 46, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:36:07,466 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.178e+02 2.345e+02 2.576e+02 3.363e+02, threshold=4.690e+02, percent-clipped=0.0 2024-06-21 22:36:10,449 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.39 vs. limit=10.0 2024-06-21 22:36:17,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=471513.1666666667, ans=0.125 2024-06-21 22:36:21,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=471531.5, ans=0.025 2024-06-21 22:36:39,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=471568.1666666667, ans=0.0 2024-06-21 22:36:42,526 INFO [train.py:1028] (1/2) Epoch 26, batch 4300, loss[loss=0.169, simple_loss=0.2255, pruned_loss=0.05621, over 13201.00 frames. ], tot_loss[loss=0.1737, simple_loss=0.2284, pruned_loss=0.05945, over 2581662.65 frames. ], batch size: 59, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:36:43,686 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.33 vs. limit=6.0 2024-06-21 22:36:54,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=471604.8333333333, ans=0.025 2024-06-21 22:36:58,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=471623.1666666667, ans=0.0 2024-06-21 22:37:04,962 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=471641.5, ans=0.1 2024-06-21 22:37:13,720 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2024-06-21 22:37:19,357 INFO [train.py:1028] (1/2) Epoch 26, batch 4350, loss[loss=0.1804, simple_loss=0.2346, pruned_loss=0.06313, over 13237.00 frames. ], tot_loss[loss=0.1737, simple_loss=0.2282, pruned_loss=0.05963, over 2586333.72 frames. ], batch size: 59, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:37:19,907 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.146e+02 2.258e+02 2.415e+02 2.906e+02, threshold=4.516e+02, percent-clipped=0.0 2024-06-21 22:37:32,853 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.61 vs. limit=10.0 2024-06-21 22:37:44,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=471733.1666666667, ans=0.0 2024-06-21 22:37:45,664 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:37:47,837 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2024-06-21 22:37:52,151 INFO [train.py:1028] (1/2) Epoch 26, batch 4400, loss[loss=0.1823, simple_loss=0.2337, pruned_loss=0.06545, over 13219.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.2284, pruned_loss=0.05986, over 2586545.48 frames. ], batch size: 83, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:38:21,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.51 vs. limit=5.0 2024-06-21 22:38:23,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=471843.1666666667, ans=0.125 2024-06-21 22:38:24,670 INFO [train.py:1028] (1/2) Epoch 26, batch 4450, loss[loss=0.171, simple_loss=0.2247, pruned_loss=0.05865, over 12843.00 frames. ], tot_loss[loss=0.1745, simple_loss=0.2286, pruned_loss=0.06016, over 2580369.71 frames. ], batch size: 33, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:38:25,334 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.098e+02 2.208e+02 2.373e+02 2.901e+02, threshold=4.415e+02, percent-clipped=0.0 2024-06-21 22:38:27,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=471861.5, ans=0.125 2024-06-21 22:38:39,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=471879.8333333333, ans=0.2 2024-06-21 22:38:43,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=471898.1666666667, ans=22.5 2024-06-21 22:38:50,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=471916.5, ans=0.1 2024-06-21 22:39:00,360 INFO [train.py:1028] (1/2) Epoch 26, batch 4500, loss[loss=0.1894, simple_loss=0.2435, pruned_loss=0.06766, over 13260.00 frames. ], tot_loss[loss=0.1736, simple_loss=0.2279, pruned_loss=0.05968, over 2584515.18 frames. ], batch size: 89, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:39:00,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=471953.1666666667, ans=0.125 2024-06-21 22:39:18,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=471989.8333333333, ans=0.025 2024-06-21 22:39:36,733 INFO [train.py:1028] (1/2) Epoch 26, batch 4550, loss[loss=0.1613, simple_loss=0.2209, pruned_loss=0.05089, over 13235.00 frames. ], tot_loss[loss=0.1736, simple_loss=0.2281, pruned_loss=0.0596, over 2588519.44 frames. ], batch size: 52, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:39:37,279 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.134e+02 2.272e+02 2.428e+02 2.828e+02, threshold=4.543e+02, percent-clipped=0.0 2024-06-21 22:39:42,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=472063.1666666667, ans=0.0 2024-06-21 22:39:49,723 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.01 vs. limit=22.5 2024-06-21 22:39:52,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=472081.5, ans=0.0 2024-06-21 22:39:59,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=472099.8333333333, ans=0.04949747468305833 2024-06-21 22:40:06,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=472118.1666666667, ans=0.0 2024-06-21 22:40:08,632 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.65 vs. limit=15.0 2024-06-21 22:40:09,504 INFO [train.py:1028] (1/2) Epoch 26, batch 4600, loss[loss=0.1966, simple_loss=0.2433, pruned_loss=0.07491, over 12588.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.2287, pruned_loss=0.0598, over 2585599.81 frames. ], batch size: 202, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:40:25,893 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.61 vs. limit=22.5 2024-06-21 22:40:27,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=472173.1666666667, ans=0.125 2024-06-21 22:40:29,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=472191.5, ans=0.025 2024-06-21 22:40:45,266 INFO [train.py:1028] (1/2) Epoch 26, batch 4650, loss[loss=0.1727, simple_loss=0.22, pruned_loss=0.06271, over 13144.00 frames. ], tot_loss[loss=0.1739, simple_loss=0.2283, pruned_loss=0.05978, over 2588199.64 frames. ], batch size: 132, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:40:45,841 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.125e+02 2.236e+02 2.357e+02 3.184e+02, threshold=4.472e+02, percent-clipped=0.0 2024-06-21 22:40:54,720 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2024-06-21 22:40:57,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=472246.5, ans=0.125 2024-06-21 22:41:09,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=472283.1666666667, ans=0.025 2024-06-21 22:41:10,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=472283.1666666667, ans=0.125 2024-06-21 22:41:22,025 INFO [train.py:1028] (1/2) Epoch 26, batch 4700, loss[loss=0.1829, simple_loss=0.24, pruned_loss=0.06293, over 12882.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.2283, pruned_loss=0.05991, over 2584040.55 frames. ], batch size: 26, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:41:30,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=472338.1666666667, ans=0.125 2024-06-21 22:41:33,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=472338.1666666667, ans=0.125 2024-06-21 22:41:35,298 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.48 vs. limit=10.0 2024-06-21 22:41:44,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=472374.8333333333, ans=0.125 2024-06-21 22:41:44,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=472374.8333333333, ans=0.0 2024-06-21 22:41:54,752 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.00 vs. limit=15.0 2024-06-21 22:41:54,963 INFO [train.py:1028] (1/2) Epoch 26, batch 4750, loss[loss=0.1851, simple_loss=0.2319, pruned_loss=0.06917, over 12533.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.2281, pruned_loss=0.06002, over 2581552.16 frames. ], batch size: 202, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:41:55,745 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.225e+02 2.414e+02 2.680e+02 3.339e+02, threshold=4.827e+02, percent-clipped=0.0 2024-06-21 22:42:03,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=472429.8333333333, ans=0.0 2024-06-21 22:42:17,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=472466.5, ans=0.125 2024-06-21 22:42:27,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=472484.8333333333, ans=0.125 2024-06-21 22:42:28,293 INFO [train.py:1028] (1/2) Epoch 26, batch 4800, loss[loss=0.1531, simple_loss=0.2141, pruned_loss=0.04603, over 13305.00 frames. ], tot_loss[loss=0.1738, simple_loss=0.2279, pruned_loss=0.05982, over 2578121.45 frames. ], batch size: 63, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:42:33,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=472503.1666666667, ans=0.05 2024-06-21 22:42:33,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=472503.1666666667, ans=0.125 2024-06-21 22:42:41,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472521.5, ans=0.1 2024-06-21 22:43:01,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=472576.5, ans=0.125 2024-06-21 22:43:07,637 INFO [train.py:1028] (1/2) Epoch 26, batch 4850, loss[loss=0.1694, simple_loss=0.2192, pruned_loss=0.05976, over 13313.00 frames. ], tot_loss[loss=0.1737, simple_loss=0.2278, pruned_loss=0.0598, over 2574878.66 frames. ], batch size: 89, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:43:08,327 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.124e+02 2.257e+02 2.407e+02 3.075e+02, threshold=4.514e+02, percent-clipped=0.0 2024-06-21 22:43:09,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=472594.8333333333, ans=22.5 2024-06-21 22:43:13,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=472613.1666666667, ans=0.0 2024-06-21 22:43:14,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=472613.1666666667, ans=0.125 2024-06-21 22:43:19,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=472613.1666666667, ans=0.125 2024-06-21 22:43:30,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=472649.8333333333, ans=0.0 2024-06-21 22:43:41,797 INFO [train.py:1028] (1/2) Epoch 26, batch 4900, loss[loss=0.1665, simple_loss=0.2258, pruned_loss=0.05358, over 13215.00 frames. ], tot_loss[loss=0.1743, simple_loss=0.2281, pruned_loss=0.0602, over 2575310.28 frames. ], batch size: 59, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:43:57,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=472723.1666666667, ans=0.05 2024-06-21 22:43:57,876 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:44:04,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=472741.5, ans=0.2 2024-06-21 22:44:09,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=472759.8333333333, ans=0.025 2024-06-21 22:44:09,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=472759.8333333333, ans=0.125 2024-06-21 22:44:15,387 INFO [train.py:1028] (1/2) Epoch 26, batch 4950, loss[loss=0.1988, simple_loss=0.237, pruned_loss=0.08032, over 11119.00 frames. ], tot_loss[loss=0.1746, simple_loss=0.2286, pruned_loss=0.06034, over 2570070.11 frames. ], batch size: 303, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:44:15,989 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.172e+02 2.333e+02 2.669e+02 3.582e+02, threshold=4.667e+02, percent-clipped=0.0 2024-06-21 22:44:22,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=472796.5, ans=0.125 2024-06-21 22:44:24,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=472796.5, ans=0.0 2024-06-21 22:44:41,184 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:44:53,171 INFO [train.py:1028] (1/2) Epoch 26, batch 5000, loss[loss=0.1744, simple_loss=0.2235, pruned_loss=0.06264, over 13190.00 frames. ], tot_loss[loss=0.1744, simple_loss=0.2284, pruned_loss=0.06025, over 2574976.55 frames. ], batch size: 95, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:45:21,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=472924.8333333333, ans=0.0 2024-06-21 22:45:23,269 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.32 vs. limit=15.0 2024-06-21 22:45:24,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472943.1666666667, ans=0.1 2024-06-21 22:45:27,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472943.1666666667, ans=0.1 2024-06-21 22:45:30,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=472943.1666666667, ans=0.2 2024-06-21 22:45:31,725 INFO [train.py:1028] (1/2) Epoch 26, batch 5050, loss[loss=0.1694, simple_loss=0.2242, pruned_loss=0.05732, over 12879.00 frames. ], tot_loss[loss=0.1737, simple_loss=0.2281, pruned_loss=0.05965, over 2572479.46 frames. ], batch size: 36, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:45:32,327 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.820e+02 2.111e+02 2.272e+02 2.459e+02 3.080e+02, threshold=4.543e+02, percent-clipped=0.0 2024-06-21 22:45:35,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=472961.5, ans=0.0 2024-06-21 22:45:35,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=472961.5, ans=0.125 2024-06-21 22:45:42,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=472979.8333333333, ans=0.0 2024-06-21 22:45:42,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=472979.8333333333, ans=0.025 2024-06-21 22:45:51,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=473016.5, ans=0.125 2024-06-21 22:46:00,969 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.73 vs. limit=15.0 2024-06-21 22:46:05,548 INFO [train.py:1028] (1/2) Epoch 26, batch 5100, loss[loss=0.1715, simple_loss=0.2313, pruned_loss=0.05586, over 12904.00 frames. ], tot_loss[loss=0.1736, simple_loss=0.2278, pruned_loss=0.05964, over 2569032.09 frames. ], batch size: 39, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:46:13,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=473071.5, ans=0.125 2024-06-21 22:46:21,045 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.88 vs. limit=12.0 2024-06-21 22:46:29,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=473108.1666666667, ans=0.125 2024-06-21 22:46:34,709 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2024-06-21 22:46:41,339 INFO [train.py:1028] (1/2) Epoch 26, batch 5150, loss[loss=0.1718, simple_loss=0.2281, pruned_loss=0.05774, over 13060.00 frames. ], tot_loss[loss=0.1738, simple_loss=0.2278, pruned_loss=0.05989, over 2570775.36 frames. ], batch size: 132, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:46:42,016 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.184e+02 2.383e+02 2.587e+02 3.699e+02, threshold=4.767e+02, percent-clipped=0.0 2024-06-21 22:46:43,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=473144.8333333333, ans=0.0 2024-06-21 22:46:44,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=473144.8333333333, ans=0.1 2024-06-21 22:46:48,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=473163.1666666667, ans=0.0 2024-06-21 22:46:58,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=473181.5, ans=0.1 2024-06-21 22:47:08,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=473199.8333333333, ans=0.0 2024-06-21 22:47:14,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=473218.1666666667, ans=0.0 2024-06-21 22:47:17,937 INFO [train.py:1028] (1/2) Epoch 26, batch 5200, loss[loss=0.1782, simple_loss=0.2288, pruned_loss=0.06384, over 13170.00 frames. ], tot_loss[loss=0.1739, simple_loss=0.2277, pruned_loss=0.06003, over 2573694.33 frames. ], batch size: 95, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:47:19,436 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=473236.5, ans=0.125 2024-06-21 22:47:22,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=473236.5, ans=0.0 2024-06-21 22:47:28,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=473254.8333333333, ans=0.125 2024-06-21 22:47:33,659 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.26 vs. limit=15.0 2024-06-21 22:47:37,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=473291.5, ans=0.0 2024-06-21 22:47:44,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=473309.8333333333, ans=0.04949747468305833 2024-06-21 22:47:51,352 INFO [train.py:1028] (1/2) Epoch 26, batch 5250, loss[loss=0.1772, simple_loss=0.2315, pruned_loss=0.0614, over 13276.00 frames. ], tot_loss[loss=0.174, simple_loss=0.2279, pruned_loss=0.06005, over 2570840.19 frames. ], batch size: 52, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:47:51,806 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.26 vs. limit=12.0 2024-06-21 22:47:51,952 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.154e+02 2.268e+02 2.455e+02 3.354e+02, threshold=4.537e+02, percent-clipped=0.0 2024-06-21 22:47:52,448 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.85 vs. limit=15.0 2024-06-21 22:47:54,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=473328.1666666667, ans=0.125 2024-06-21 22:47:54,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=473328.1666666667, ans=0.125 2024-06-21 22:48:06,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=473364.8333333333, ans=0.125 2024-06-21 22:48:13,093 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.60 vs. limit=15.0 2024-06-21 22:48:18,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=473401.5, ans=0.125 2024-06-21 22:48:22,777 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.47 vs. limit=15.0 2024-06-21 22:48:24,264 INFO [train.py:1028] (1/2) Epoch 26, batch 5300, loss[loss=0.1748, simple_loss=0.2251, pruned_loss=0.06221, over 13040.00 frames. ], tot_loss[loss=0.1733, simple_loss=0.2273, pruned_loss=0.05963, over 2567617.79 frames. ], batch size: 144, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:48:29,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=473419.8333333333, ans=0.2 2024-06-21 22:48:32,062 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2024-06-21 22:48:33,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=473438.1666666667, ans=0.0 2024-06-21 22:48:34,722 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.07 vs. limit=22.5 2024-06-21 22:48:36,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=473438.1666666667, ans=0.125 2024-06-21 22:48:52,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=473474.8333333333, ans=0.125 2024-06-21 22:49:01,924 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.80 vs. limit=6.0 2024-06-21 22:49:05,223 INFO [train.py:1028] (1/2) Epoch 26, batch 5350, loss[loss=0.1516, simple_loss=0.2189, pruned_loss=0.04218, over 11431.00 frames. ], tot_loss[loss=0.1728, simple_loss=0.2267, pruned_loss=0.05945, over 2573552.58 frames. ], batch size: 16, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:49:05,813 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.088e+02 2.226e+02 2.386e+02 3.248e+02, threshold=4.451e+02, percent-clipped=0.0 2024-06-21 22:49:14,017 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.78 vs. limit=22.5 2024-06-21 22:49:20,148 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=473548.1666666667, ans=0.0 2024-06-21 22:49:30,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=473566.5, ans=0.125 2024-06-21 22:49:34,573 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:49:38,163 INFO [train.py:1028] (1/2) Epoch 26, batch 5400, loss[loss=0.204, simple_loss=0.2492, pruned_loss=0.07937, over 12255.00 frames. ], tot_loss[loss=0.1734, simple_loss=0.2269, pruned_loss=0.05997, over 2566418.34 frames. ], batch size: 240, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:49:38,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=473603.1666666667, ans=0.1 2024-06-21 22:49:43,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=473603.1666666667, ans=0.04949747468305833 2024-06-21 22:49:54,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=473639.8333333333, ans=0.07 2024-06-21 22:49:56,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=473639.8333333333, ans=0.025 2024-06-21 22:49:57,148 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.87 vs. limit=15.0 2024-06-21 22:49:59,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=473658.1666666667, ans=0.125 2024-06-21 22:50:06,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=473676.5, ans=0.0 2024-06-21 22:50:06,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=473676.5, ans=0.09899494936611666 2024-06-21 22:50:11,033 INFO [train.py:1028] (1/2) Epoch 26, batch 5450, loss[loss=0.1907, simple_loss=0.2491, pruned_loss=0.06611, over 12406.00 frames. ], tot_loss[loss=0.1731, simple_loss=0.2269, pruned_loss=0.05968, over 2570477.02 frames. ], batch size: 25, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:50:11,721 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.165e+02 2.344e+02 2.492e+02 3.563e+02, threshold=4.689e+02, percent-clipped=0.0 2024-06-21 22:50:14,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=473694.8333333333, ans=0.125 2024-06-21 22:50:18,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=473713.1666666667, ans=0.2 2024-06-21 22:50:28,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=473731.5, ans=0.95 2024-06-21 22:50:39,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=473749.8333333333, ans=0.125 2024-06-21 22:50:43,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=473768.1666666667, ans=0.07 2024-06-21 22:50:47,703 INFO [train.py:1028] (1/2) Epoch 26, batch 5500, loss[loss=0.1917, simple_loss=0.2358, pruned_loss=0.07376, over 12159.00 frames. ], tot_loss[loss=0.173, simple_loss=0.2267, pruned_loss=0.05968, over 2563869.03 frames. ], batch size: 241, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:50:54,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=473786.5, ans=0.025 2024-06-21 22:50:57,802 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.10 vs. limit=15.0 2024-06-21 22:51:00,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=473804.8333333333, ans=0.125 2024-06-21 22:51:19,284 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:51:23,895 INFO [train.py:1028] (1/2) Epoch 26, batch 5550, loss[loss=0.1685, simple_loss=0.2317, pruned_loss=0.05264, over 13293.00 frames. ], tot_loss[loss=0.1723, simple_loss=0.2261, pruned_loss=0.05927, over 2567431.58 frames. ], batch size: 43, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:51:25,211 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.135e+02 2.262e+02 2.450e+02 3.251e+02, threshold=4.524e+02, percent-clipped=0.0 2024-06-21 22:51:29,333 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.92 vs. limit=15.0 2024-06-21 22:51:34,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=473896.5, ans=0.125 2024-06-21 22:51:37,663 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.82 vs. limit=10.0 2024-06-21 22:51:37,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=473914.8333333333, ans=0.0 2024-06-21 22:51:38,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=473914.8333333333, ans=0.0 2024-06-21 22:51:39,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=473914.8333333333, ans=10.0 2024-06-21 22:51:41,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=473914.8333333333, ans=0.125 2024-06-21 22:51:56,211 INFO [train.py:1028] (1/2) Epoch 26, batch 5600, loss[loss=0.183, simple_loss=0.2274, pruned_loss=0.06928, over 13242.00 frames. ], tot_loss[loss=0.1722, simple_loss=0.2259, pruned_loss=0.05927, over 2568825.62 frames. ], batch size: 89, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:51:59,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=473969.8333333333, ans=0.125 2024-06-21 22:51:59,593 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.30 vs. limit=15.0 2024-06-21 22:52:29,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=474061.5, ans=0.0 2024-06-21 22:52:29,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=474061.5, ans=0.2 2024-06-21 22:52:30,257 INFO [train.py:1028] (1/2) Epoch 26, batch 5650, loss[loss=0.1799, simple_loss=0.2313, pruned_loss=0.06423, over 12516.00 frames. ], tot_loss[loss=0.1721, simple_loss=0.2259, pruned_loss=0.05915, over 2573903.21 frames. ], batch size: 202, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:52:31,504 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.123e+02 2.212e+02 2.361e+02 3.097e+02, threshold=4.424e+02, percent-clipped=0.0 2024-06-21 22:52:33,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=474061.5, ans=0.025 2024-06-21 22:52:34,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=474061.5, ans=0.0 2024-06-21 22:52:59,807 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.15 vs. limit=22.5 2024-06-21 22:53:02,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=474116.5, ans=0.125 2024-06-21 22:53:09,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=474134.8333333333, ans=0.1 2024-06-21 22:53:11,188 INFO [train.py:1028] (1/2) Epoch 26, batch 5700, loss[loss=0.1662, simple_loss=0.2202, pruned_loss=0.05607, over 13273.00 frames. ], tot_loss[loss=0.1722, simple_loss=0.2258, pruned_loss=0.05925, over 2577638.68 frames. ], batch size: 63, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:53:11,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=474153.1666666667, ans=0.125 2024-06-21 22:53:12,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=474153.1666666667, ans=0.125 2024-06-21 22:53:32,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=474208.1666666667, ans=0.125 2024-06-21 22:53:33,914 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.55 vs. limit=8.0 2024-06-21 22:53:39,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=474226.5, ans=0.2 2024-06-21 22:53:44,148 INFO [train.py:1028] (1/2) Epoch 26, batch 5750, loss[loss=0.1846, simple_loss=0.2337, pruned_loss=0.06776, over 12751.00 frames. ], tot_loss[loss=0.173, simple_loss=0.2267, pruned_loss=0.05962, over 2578649.33 frames. ], batch size: 176, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:53:45,532 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.189e+02 2.312e+02 2.522e+02 2.990e+02, threshold=4.623e+02, percent-clipped=0.0 2024-06-21 22:53:45,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=474244.8333333333, ans=0.025 2024-06-21 22:53:48,363 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2024-06-21 22:53:49,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=474244.8333333333, ans=0.0 2024-06-21 22:53:59,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=474281.5, ans=0.95 2024-06-21 22:54:07,196 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.58 vs. limit=22.5 2024-06-21 22:54:11,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=474318.1666666667, ans=0.0 2024-06-21 22:54:12,328 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.13 vs. limit=22.5 2024-06-21 22:54:13,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=474318.1666666667, ans=0.1 2024-06-21 22:54:16,937 INFO [train.py:1028] (1/2) Epoch 26, batch 5800, loss[loss=0.1806, simple_loss=0.2286, pruned_loss=0.06631, over 12721.00 frames. ], tot_loss[loss=0.1745, simple_loss=0.2282, pruned_loss=0.06044, over 2577574.48 frames. ], batch size: 176, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:54:21,301 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.89 vs. limit=15.0 2024-06-21 22:54:23,473 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:54:26,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=474354.8333333333, ans=0.1 2024-06-21 22:54:36,270 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.28 vs. limit=15.0 2024-06-21 22:54:37,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=474373.1666666667, ans=0.04949747468305833 2024-06-21 22:54:38,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=474373.1666666667, ans=0.2 2024-06-21 22:54:39,722 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=474391.5, ans=10.0 2024-06-21 22:54:52,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=474409.8333333333, ans=0.95 2024-06-21 22:54:56,003 INFO [train.py:1028] (1/2) Epoch 26, batch 5850, loss[loss=0.1876, simple_loss=0.2361, pruned_loss=0.06952, over 12567.00 frames. ], tot_loss[loss=0.1761, simple_loss=0.2301, pruned_loss=0.06103, over 2577978.98 frames. ], batch size: 202, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:54:57,308 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.292e+02 2.420e+02 2.667e+02 3.986e+02, threshold=4.839e+02, percent-clipped=0.0 2024-06-21 22:55:01,046 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.39 vs. limit=10.0 2024-06-21 22:55:20,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=474483.1666666667, ans=0.125 2024-06-21 22:55:27,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=474501.5, ans=0.2 2024-06-21 22:55:29,109 INFO [train.py:1028] (1/2) Epoch 26, batch 5900, loss[loss=0.1825, simple_loss=0.2326, pruned_loss=0.06625, over 13073.00 frames. ], tot_loss[loss=0.1775, simple_loss=0.232, pruned_loss=0.06152, over 2578923.26 frames. ], batch size: 121, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:55:44,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=474556.5, ans=0.1 2024-06-21 22:55:51,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=474574.8333333333, ans=0.0 2024-06-21 22:55:56,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=474593.1666666667, ans=0.1 2024-06-21 22:56:02,431 INFO [train.py:1028] (1/2) Epoch 26, batch 5950, loss[loss=0.1724, simple_loss=0.2235, pruned_loss=0.0606, over 13115.00 frames. ], tot_loss[loss=0.1788, simple_loss=0.2335, pruned_loss=0.062, over 2582575.62 frames. ], batch size: 121, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:56:03,756 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.173e+02 2.381e+02 2.579e+02 3.557e+02, threshold=4.763e+02, percent-clipped=0.0 2024-06-21 22:56:09,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=474629.8333333333, ans=0.125 2024-06-21 22:56:10,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=474629.8333333333, ans=0.0 2024-06-21 22:56:12,972 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2024-06-21 22:56:29,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=474684.8333333333, ans=0.2 2024-06-21 22:56:38,567 INFO [train.py:1028] (1/2) Epoch 26, batch 6000, loss[loss=0.2038, simple_loss=0.2507, pruned_loss=0.07845, over 12247.00 frames. ], tot_loss[loss=0.179, simple_loss=0.234, pruned_loss=0.06199, over 2575782.77 frames. ], batch size: 240, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:56:38,567 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 22:56:46,501 INFO [train.py:1060] (1/2) Epoch 26, validation: loss=0.1911, simple_loss=0.2517, pruned_loss=0.06525, over 351949.00 frames. 2024-06-21 22:56:46,502 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 22:56:50,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=474703.1666666667, ans=0.2 2024-06-21 22:56:54,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=474721.5, ans=0.04949747468305833 2024-06-21 22:57:01,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=474721.5, ans=0.0 2024-06-21 22:57:04,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=474739.8333333333, ans=0.0 2024-06-21 22:57:05,018 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.46 vs. limit=22.5 2024-06-21 22:57:12,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=474758.1666666667, ans=0.125 2024-06-21 22:57:13,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1.whitening_limit, batch_count=474758.1666666667, ans=10.0 2024-06-21 22:57:14,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=474758.1666666667, ans=0.0 2024-06-21 22:57:16,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=474758.1666666667, ans=0.125 2024-06-21 22:57:25,190 INFO [train.py:1028] (1/2) Epoch 26, batch 6050, loss[loss=0.1876, simple_loss=0.2468, pruned_loss=0.06419, over 13250.00 frames. ], tot_loss[loss=0.1795, simple_loss=0.2349, pruned_loss=0.06208, over 2578319.04 frames. ], batch size: 40, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:57:26,544 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.254e+02 2.402e+02 2.605e+02 3.298e+02, threshold=4.803e+02, percent-clipped=0.0 2024-06-21 22:57:36,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=474813.1666666667, ans=0.025 2024-06-21 22:57:51,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=474868.1666666667, ans=0.0 2024-06-21 22:57:57,880 INFO [train.py:1028] (1/2) Epoch 26, batch 6100, loss[loss=0.181, simple_loss=0.232, pruned_loss=0.06499, over 13118.00 frames. ], tot_loss[loss=0.181, simple_loss=0.2362, pruned_loss=0.06295, over 2579737.61 frames. ], batch size: 121, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:58:02,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=474886.5, ans=0.125 2024-06-21 22:58:16,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=474923.1666666667, ans=0.125 2024-06-21 22:58:21,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=474941.5, ans=0.0 2024-06-21 22:58:32,091 INFO [train.py:1028] (1/2) Epoch 26, batch 6150, loss[loss=0.1827, simple_loss=0.2359, pruned_loss=0.06481, over 10773.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2381, pruned_loss=0.0639, over 2578442.81 frames. ], batch size: 303, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:58:33,469 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.248e+02 2.395e+02 2.691e+02 3.822e+02, threshold=4.791e+02, percent-clipped=0.0 2024-06-21 22:58:51,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=475014.8333333333, ans=0.05 2024-06-21 22:58:53,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=475014.8333333333, ans=0.125 2024-06-21 22:58:53,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=475014.8333333333, ans=0.0 2024-06-21 22:59:09,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=475051.5, ans=0.125 2024-06-21 22:59:09,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=475051.5, ans=0.0 2024-06-21 22:59:13,627 INFO [train.py:1028] (1/2) Epoch 26, batch 6200, loss[loss=0.2112, simple_loss=0.2735, pruned_loss=0.07444, over 13235.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.2393, pruned_loss=0.06383, over 2575754.56 frames. ], batch size: 89, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:59:14,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=475069.8333333333, ans=0.2 2024-06-21 22:59:30,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=475106.5, ans=0.125 2024-06-21 22:59:37,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=475124.8333333333, ans=0.125 2024-06-21 22:59:47,515 INFO [train.py:1028] (1/2) Epoch 26, batch 6250, loss[loss=0.1944, simple_loss=0.2497, pruned_loss=0.06953, over 13204.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2404, pruned_loss=0.06438, over 2568511.58 frames. ], batch size: 83, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:59:48,980 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.271e+02 2.465e+02 2.690e+02 3.732e+02, threshold=4.931e+02, percent-clipped=0.0 2024-06-21 22:59:50,142 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2024-06-21 23:00:14,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=475234.8333333333, ans=0.125 2024-06-21 23:00:16,647 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.25 vs. limit=5.0 2024-06-21 23:00:19,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=475234.8333333333, ans=0.0 2024-06-21 23:00:21,239 INFO [train.py:1028] (1/2) Epoch 26, batch 6300, loss[loss=0.1664, simple_loss=0.232, pruned_loss=0.05041, over 11846.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2421, pruned_loss=0.06514, over 2564004.92 frames. ], batch size: 17, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 23:00:31,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=475271.5, ans=0.125 2024-06-21 23:00:31,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=475271.5, ans=0.1 2024-06-21 23:00:31,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=475271.5, ans=0.0 2024-06-21 23:00:39,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=475289.8333333333, ans=0.125 2024-06-21 23:00:40,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2024-06-21 23:00:57,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=475326.5, ans=0.125 2024-06-21 23:00:58,858 INFO [train.py:1028] (1/2) Epoch 26, batch 6350, loss[loss=0.2096, simple_loss=0.2599, pruned_loss=0.07971, over 12550.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2434, pruned_loss=0.0652, over 2574188.05 frames. ], batch size: 202, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 23:01:00,241 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.394e+02 2.641e+02 2.964e+02 4.205e+02, threshold=5.281e+02, percent-clipped=0.0 2024-06-21 23:01:00,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=475344.8333333333, ans=0.0 2024-06-21 23:01:03,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=475344.8333333333, ans=0.125 2024-06-21 23:01:11,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=475363.1666666667, ans=0.0 2024-06-21 23:01:26,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=475399.8333333333, ans=0.0 2024-06-21 23:01:32,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=475418.1666666667, ans=0.125 2024-06-21 23:01:35,889 INFO [train.py:1028] (1/2) Epoch 26, batch 6400, loss[loss=0.1628, simple_loss=0.2255, pruned_loss=0.05008, over 13228.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2455, pruned_loss=0.06584, over 2576773.58 frames. ], batch size: 67, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 23:01:41,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=475436.5, ans=0.0 2024-06-21 23:01:41,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=475454.8333333333, ans=0.125 2024-06-21 23:01:43,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=475454.8333333333, ans=0.125 2024-06-21 23:01:47,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=475454.8333333333, ans=0.0 2024-06-21 23:01:47,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=475454.8333333333, ans=0.1 2024-06-21 23:01:50,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=475473.1666666667, ans=0.0 2024-06-21 23:01:54,030 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.46 vs. limit=10.0 2024-06-21 23:01:56,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=475491.5, ans=0.125 2024-06-21 23:02:05,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=475509.8333333333, ans=10.0 2024-06-21 23:02:05,853 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.86 vs. limit=15.0 2024-06-21 23:02:08,594 INFO [train.py:1028] (1/2) Epoch 26, batch 6450, loss[loss=0.2364, simple_loss=0.2805, pruned_loss=0.09617, over 12584.00 frames. ], tot_loss[loss=0.19, simple_loss=0.247, pruned_loss=0.06647, over 2582196.13 frames. ], batch size: 202, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 23:02:09,876 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.309e+02 2.453e+02 2.695e+02 3.835e+02, threshold=4.905e+02, percent-clipped=0.0 2024-06-21 23:02:16,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=475546.5, ans=0.125 2024-06-21 23:02:17,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=475546.5, ans=15.0 2024-06-21 23:02:34,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=475601.5, ans=10.0 2024-06-21 23:02:35,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=475601.5, ans=0.125 2024-06-21 23:02:41,227 INFO [train.py:1028] (1/2) Epoch 26, batch 6500, loss[loss=0.2085, simple_loss=0.2539, pruned_loss=0.08152, over 10880.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2488, pruned_loss=0.06684, over 2584363.66 frames. ], batch size: 304, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 23:02:44,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=475619.8333333333, ans=0.0 2024-06-21 23:02:44,457 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.01 vs. limit=22.5 2024-06-21 23:02:45,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=475619.8333333333, ans=0.04949747468305833 2024-06-21 23:02:46,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=475619.8333333333, ans=0.0 2024-06-21 23:02:49,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=475638.1666666667, ans=0.0 2024-06-21 23:03:09,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=475674.8333333333, ans=0.0 2024-06-21 23:03:09,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2024-06-21 23:03:18,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=475693.1666666667, ans=0.0 2024-06-21 23:03:22,913 INFO [train.py:1028] (1/2) Epoch 26, batch 6550, loss[loss=0.191, simple_loss=0.2527, pruned_loss=0.06471, over 12566.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2495, pruned_loss=0.06661, over 2587884.05 frames. ], batch size: 22, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 23:03:24,275 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.416e+02 2.549e+02 2.811e+02 3.833e+02, threshold=5.097e+02, percent-clipped=0.0 2024-06-21 23:03:25,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=475711.5, ans=0.125 2024-06-21 23:03:26,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=475711.5, ans=15.0 2024-06-21 23:03:33,710 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.89 vs. limit=15.0 2024-06-21 23:03:44,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=475766.5, ans=0.0 2024-06-21 23:03:51,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=475784.8333333333, ans=0.125 2024-06-21 23:03:55,287 INFO [train.py:1028] (1/2) Epoch 26, batch 6600, loss[loss=0.1663, simple_loss=0.2257, pruned_loss=0.05349, over 13288.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2495, pruned_loss=0.06671, over 2590084.80 frames. ], batch size: 72, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:04:06,120 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.53 vs. limit=6.0 2024-06-21 23:04:07,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=475839.8333333333, ans=0.2 2024-06-21 23:04:09,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=475839.8333333333, ans=0.04949747468305833 2024-06-21 23:04:17,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=475858.1666666667, ans=0.125 2024-06-21 23:04:26,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=475876.5, ans=0.1 2024-06-21 23:04:27,287 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.73 vs. limit=15.0 2024-06-21 23:04:28,792 INFO [train.py:1028] (1/2) Epoch 26, batch 6650, loss[loss=0.2131, simple_loss=0.2704, pruned_loss=0.07793, over 12933.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2508, pruned_loss=0.06726, over 2584189.05 frames. ], batch size: 158, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:04:30,181 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.293e+02 2.453e+02 2.711e+02 3.442e+02, threshold=4.906e+02, percent-clipped=0.0 2024-06-21 23:05:02,548 INFO [train.py:1028] (1/2) Epoch 26, batch 6700, loss[loss=0.2091, simple_loss=0.2625, pruned_loss=0.07783, over 12700.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2518, pruned_loss=0.06749, over 2583703.81 frames. ], batch size: 176, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:05:09,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=475986.5, ans=0.125 2024-06-21 23:05:14,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=476004.8333333333, ans=10.0 2024-06-21 23:05:31,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=476041.5, ans=0.0 2024-06-21 23:05:35,434 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.43 vs. limit=15.0 2024-06-21 23:05:37,240 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:05:39,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=476059.8333333333, ans=0.0 2024-06-21 23:05:42,905 INFO [train.py:1028] (1/2) Epoch 26, batch 6750, loss[loss=0.222, simple_loss=0.265, pruned_loss=0.08947, over 12212.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2525, pruned_loss=0.0678, over 2576946.10 frames. ], batch size: 240, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:05:44,202 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.325e+02 2.494e+02 2.647e+02 3.717e+02, threshold=4.988e+02, percent-clipped=0.0 2024-06-21 23:05:46,643 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.53 vs. limit=5.0 2024-06-21 23:06:03,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=476133.1666666667, ans=0.0 2024-06-21 23:06:08,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=476151.5, ans=0.125 2024-06-21 23:06:11,376 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=476151.5, ans=0.125 2024-06-21 23:06:15,971 INFO [train.py:1028] (1/2) Epoch 26, batch 6800, loss[loss=0.1946, simple_loss=0.256, pruned_loss=0.06658, over 13240.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2534, pruned_loss=0.06786, over 2579547.29 frames. ], batch size: 67, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:06:19,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=476169.8333333333, ans=0.1 2024-06-21 23:06:23,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=476188.1666666667, ans=0.125 2024-06-21 23:06:23,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=476188.1666666667, ans=0.025 2024-06-21 23:06:24,721 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2024-06-21 23:06:26,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.21 vs. limit=15.0 2024-06-21 23:06:31,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=476206.5, ans=0.1 2024-06-21 23:06:35,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=476224.8333333333, ans=0.2 2024-06-21 23:06:39,252 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.67 vs. limit=15.0 2024-06-21 23:06:41,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=476243.1666666667, ans=0.0 2024-06-21 23:06:41,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=476243.1666666667, ans=0.07 2024-06-21 23:06:43,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=476243.1666666667, ans=0.1 2024-06-21 23:06:49,180 INFO [train.py:1028] (1/2) Epoch 26, batch 6850, loss[loss=0.206, simple_loss=0.2745, pruned_loss=0.06879, over 13250.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2542, pruned_loss=0.06786, over 2582635.77 frames. ], batch size: 63, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:06:50,497 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.377e+02 2.593e+02 2.979e+02 4.847e+02, threshold=5.186e+02, percent-clipped=0.0 2024-06-21 23:06:51,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=476261.5, ans=0.125 2024-06-21 23:06:51,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=476261.5, ans=0.125 2024-06-21 23:06:52,872 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.55 vs. limit=15.0 2024-06-21 23:07:11,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=476298.1666666667, ans=0.0 2024-06-21 23:07:30,891 INFO [train.py:1028] (1/2) Epoch 26, batch 6900, loss[loss=0.1892, simple_loss=0.2591, pruned_loss=0.05967, over 13011.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2556, pruned_loss=0.06843, over 2584255.09 frames. ], batch size: 48, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:07:35,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=476353.1666666667, ans=0.2 2024-06-21 23:07:36,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=476371.5, ans=0.1 2024-06-21 23:07:43,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=476389.8333333333, ans=0.125 2024-06-21 23:07:49,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=476389.8333333333, ans=0.125 2024-06-21 23:07:55,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=476408.1666666667, ans=0.0 2024-06-21 23:07:58,716 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.33 vs. limit=15.0 2024-06-21 23:08:00,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=476426.5, ans=0.2 2024-06-21 23:08:03,766 INFO [train.py:1028] (1/2) Epoch 26, batch 6950, loss[loss=0.1893, simple_loss=0.251, pruned_loss=0.06376, over 11615.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.256, pruned_loss=0.06836, over 2579871.57 frames. ], batch size: 17, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:08:04,967 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.351e+02 2.512e+02 2.813e+02 3.310e+02, threshold=5.025e+02, percent-clipped=0.0 2024-06-21 23:08:10,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=476463.1666666667, ans=0.125 2024-06-21 23:08:17,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=476481.5, ans=0.2 2024-06-21 23:08:23,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=476499.8333333333, ans=0.0 2024-06-21 23:08:36,770 INFO [train.py:1028] (1/2) Epoch 26, batch 7000, loss[loss=0.2183, simple_loss=0.2626, pruned_loss=0.087, over 12943.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2558, pruned_loss=0.06828, over 2576192.23 frames. ], batch size: 158, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:08:38,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=476536.5, ans=0.2 2024-06-21 23:08:50,871 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.82 vs. limit=6.0 2024-06-21 23:08:51,484 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=22.5 2024-06-21 23:09:09,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=476609.8333333333, ans=0.2 2024-06-21 23:09:10,726 INFO [train.py:1028] (1/2) Epoch 26, batch 7050, loss[loss=0.2237, simple_loss=0.2776, pruned_loss=0.08491, over 12729.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2568, pruned_loss=0.06858, over 2583501.22 frames. ], batch size: 176, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:09:12,049 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.318e+02 2.466e+02 2.634e+02 4.011e+02, threshold=4.932e+02, percent-clipped=0.0 2024-06-21 23:09:13,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=476628.1666666667, ans=0.025 2024-06-21 23:09:17,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=476628.1666666667, ans=0.125 2024-06-21 23:09:26,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=476646.5, ans=0.0 2024-06-21 23:09:30,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=476664.8333333333, ans=0.2 2024-06-21 23:09:39,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=476664.8333333333, ans=0.025 2024-06-21 23:09:39,984 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.07 vs. limit=15.0 2024-06-21 23:09:40,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=476664.8333333333, ans=0.125 2024-06-21 23:09:43,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=476683.1666666667, ans=0.09899494936611666 2024-06-21 23:09:43,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=476683.1666666667, ans=0.2 2024-06-21 23:09:45,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=476683.1666666667, ans=0.025 2024-06-21 23:09:45,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=476683.1666666667, ans=0.125 2024-06-21 23:09:55,351 INFO [train.py:1028] (1/2) Epoch 26, batch 7100, loss[loss=0.2091, simple_loss=0.2663, pruned_loss=0.07599, over 13163.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2575, pruned_loss=0.06906, over 2575538.43 frames. ], batch size: 112, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:10:00,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=476719.8333333333, ans=0.125 2024-06-21 23:10:11,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=476756.5, ans=0.125 2024-06-21 23:10:11,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=476756.5, ans=0.1 2024-06-21 23:10:15,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=476774.8333333333, ans=0.1 2024-06-21 23:10:16,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=476774.8333333333, ans=0.1 2024-06-21 23:10:21,186 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=476774.8333333333, ans=0.2 2024-06-21 23:10:23,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=476793.1666666667, ans=0.09899494936611666 2024-06-21 23:10:27,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=476793.1666666667, ans=0.125 2024-06-21 23:10:28,808 INFO [train.py:1028] (1/2) Epoch 26, batch 7150, loss[loss=0.2402, simple_loss=0.2982, pruned_loss=0.09115, over 12559.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2585, pruned_loss=0.06916, over 2574151.30 frames. ], batch size: 202, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:10:30,173 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.055e+02 2.338e+02 2.540e+02 2.719e+02 4.396e+02, threshold=5.080e+02, percent-clipped=0.0 2024-06-21 23:10:58,696 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.58 vs. limit=10.0 2024-06-21 23:11:01,780 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.58 vs. limit=12.0 2024-06-21 23:11:02,038 INFO [train.py:1028] (1/2) Epoch 26, batch 7200, loss[loss=0.207, simple_loss=0.2724, pruned_loss=0.07079, over 13141.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2597, pruned_loss=0.06969, over 2579199.91 frames. ], batch size: 112, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:11:04,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=476903.1666666667, ans=0.125 2024-06-21 23:11:04,799 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:11:16,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=476939.8333333333, ans=0.125 2024-06-21 23:11:19,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=476939.8333333333, ans=0.2 2024-06-21 23:11:34,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=476976.5, ans=0.125 2024-06-21 23:11:34,593 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.20 vs. limit=22.5 2024-06-21 23:11:44,023 INFO [train.py:1028] (1/2) Epoch 26, batch 7250, loss[loss=0.1816, simple_loss=0.2425, pruned_loss=0.06037, over 12872.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2603, pruned_loss=0.06977, over 2578550.31 frames. ], batch size: 36, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:11:45,229 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.363e+02 2.506e+02 2.774e+02 3.855e+02, threshold=5.012e+02, percent-clipped=0.0 2024-06-21 23:11:46,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=476994.8333333333, ans=0.125 2024-06-21 23:11:48,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=476994.8333333333, ans=0.1 2024-06-21 23:11:49,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=476994.8333333333, ans=0.2 2024-06-21 23:11:55,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=477013.1666666667, ans=0.125 2024-06-21 23:11:57,090 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.82 vs. limit=15.0 2024-06-21 23:12:00,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=477031.5, ans=0.09899494936611666 2024-06-21 23:12:13,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.98 vs. limit=15.0 2024-06-21 23:12:13,900 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2024-06-21 23:12:17,244 INFO [train.py:1028] (1/2) Epoch 26, batch 7300, loss[loss=0.1851, simple_loss=0.2527, pruned_loss=0.05876, over 13056.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2616, pruned_loss=0.07054, over 2578952.13 frames. ], batch size: 36, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:12:21,787 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.12 vs. limit=15.0 2024-06-21 23:12:25,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477104.8333333333, ans=0.1 2024-06-21 23:12:26,885 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=477104.8333333333, ans=10.0 2024-06-21 23:12:35,167 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.64 vs. limit=10.0 2024-06-21 23:12:45,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=477159.8333333333, ans=0.125 2024-06-21 23:12:50,730 INFO [train.py:1028] (1/2) Epoch 26, batch 7350, loss[loss=0.2047, simple_loss=0.2649, pruned_loss=0.07225, over 13324.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2627, pruned_loss=0.07114, over 2580180.10 frames. ], batch size: 46, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:12:52,026 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.395e+02 2.551e+02 2.719e+02 3.986e+02, threshold=5.103e+02, percent-clipped=0.0 2024-06-21 23:12:52,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=477178.1666666667, ans=0.0 2024-06-21 23:12:53,757 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2024-06-21 23:12:56,128 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.90 vs. limit=12.0 2024-06-21 23:13:23,359 INFO [train.py:1028] (1/2) Epoch 26, batch 7400, loss[loss=0.2139, simple_loss=0.2812, pruned_loss=0.07329, over 13287.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2629, pruned_loss=0.07126, over 2587096.36 frames. ], batch size: 63, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:13:24,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=477269.8333333333, ans=0.05 2024-06-21 23:13:26,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=477269.8333333333, ans=0.1 2024-06-21 23:13:26,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=477269.8333333333, ans=0.0 2024-06-21 23:13:34,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=477288.1666666667, ans=0.0 2024-06-21 23:13:36,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=477306.5, ans=0.125 2024-06-21 23:13:38,799 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=477306.5, ans=0.125 2024-06-21 23:13:43,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=477306.5, ans=0.1 2024-06-21 23:13:46,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=477324.8333333333, ans=0.125 2024-06-21 23:13:50,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=477324.8333333333, ans=0.2 2024-06-21 23:13:54,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=477324.8333333333, ans=0.0 2024-06-21 23:13:55,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=477324.8333333333, ans=0.0 2024-06-21 23:14:03,324 INFO [train.py:1028] (1/2) Epoch 26, batch 7450, loss[loss=0.1776, simple_loss=0.245, pruned_loss=0.05512, over 12678.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2624, pruned_loss=0.0708, over 2580242.77 frames. ], batch size: 29, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:14:04,750 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.389e+02 2.604e+02 2.860e+02 4.095e+02, threshold=5.207e+02, percent-clipped=0.0 2024-06-21 23:14:07,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=477361.5, ans=0.125 2024-06-21 23:14:09,945 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.93 vs. limit=15.0 2024-06-21 23:14:21,603 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.28 vs. limit=12.0 2024-06-21 23:14:28,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=477416.5, ans=0.125 2024-06-21 23:14:37,106 INFO [train.py:1028] (1/2) Epoch 26, batch 7500, loss[loss=0.2086, simple_loss=0.2677, pruned_loss=0.0748, over 10716.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2629, pruned_loss=0.07099, over 2577825.52 frames. ], batch size: 303, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:14:48,915 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.38 vs. limit=10.0 2024-06-21 23:14:50,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=477489.8333333333, ans=0.1 2024-06-21 23:14:52,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=477489.8333333333, ans=0.05 2024-06-21 23:14:54,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=477489.8333333333, ans=10.0 2024-06-21 23:14:55,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=477489.8333333333, ans=0.1 2024-06-21 23:14:56,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=477508.1666666667, ans=0.025 2024-06-21 23:14:59,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=477508.1666666667, ans=0.0 2024-06-21 23:15:05,662 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2024-06-21 23:15:10,130 INFO [train.py:1028] (1/2) Epoch 26, batch 7550, loss[loss=0.1986, simple_loss=0.2565, pruned_loss=0.07034, over 12926.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2639, pruned_loss=0.07195, over 2577218.51 frames. ], batch size: 158, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:15:11,389 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.368e+02 2.531e+02 2.781e+02 3.441e+02, threshold=5.062e+02, percent-clipped=0.0 2024-06-21 23:15:15,679 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.27 vs. limit=22.5 2024-06-21 23:15:16,967 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.36 vs. limit=12.0 2024-06-21 23:15:27,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=477581.5, ans=0.125 2024-06-21 23:15:40,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=477618.1666666667, ans=0.025 2024-06-21 23:15:42,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=477618.1666666667, ans=0.0 2024-06-21 23:15:51,110 INFO [train.py:1028] (1/2) Epoch 26, batch 7600, loss[loss=0.2102, simple_loss=0.2668, pruned_loss=0.0768, over 13247.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2642, pruned_loss=0.07213, over 2576622.11 frames. ], batch size: 83, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:15:52,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=477636.5, ans=0.0 2024-06-21 23:15:55,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=477636.5, ans=0.0 2024-06-21 23:15:57,477 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.75 vs. limit=10.0 2024-06-21 23:15:58,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477654.8333333333, ans=0.1 2024-06-21 23:16:03,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=477673.1666666667, ans=0.07 2024-06-21 23:16:10,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=477691.5, ans=0.1 2024-06-21 23:16:20,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477709.8333333333, ans=0.1 2024-06-21 23:16:21,426 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=477709.8333333333, ans=0.0 2024-06-21 23:16:22,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=477709.8333333333, ans=0.07 2024-06-21 23:16:24,631 INFO [train.py:1028] (1/2) Epoch 26, batch 7650, loss[loss=0.1932, simple_loss=0.2611, pruned_loss=0.06259, over 12947.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2647, pruned_loss=0.07225, over 2573153.70 frames. ], batch size: 33, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:16:26,045 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.373e+02 2.530e+02 2.713e+02 3.577e+02, threshold=5.061e+02, percent-clipped=0.0 2024-06-21 23:16:37,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=477764.8333333333, ans=0.125 2024-06-21 23:16:43,919 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=477764.8333333333, ans=0.0 2024-06-21 23:16:51,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=477801.5, ans=0.05 2024-06-21 23:16:55,133 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=477801.5, ans=0.125 2024-06-21 23:16:58,272 INFO [train.py:1028] (1/2) Epoch 26, batch 7700, loss[loss=0.2238, simple_loss=0.2889, pruned_loss=0.07938, over 13258.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2651, pruned_loss=0.0724, over 2568761.91 frames. ], batch size: 63, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:16:58,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=477819.8333333333, ans=0.2 2024-06-21 23:17:00,330 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:17:23,805 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2024-06-21 23:17:24,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=477893.1666666667, ans=0.125 2024-06-21 23:17:36,194 INFO [train.py:1028] (1/2) Epoch 26, batch 7750, loss[loss=0.2056, simple_loss=0.2722, pruned_loss=0.06948, over 13264.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2662, pruned_loss=0.07301, over 2572776.33 frames. ], batch size: 72, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:17:37,455 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.427e+02 2.634e+02 2.796e+02 3.850e+02, threshold=5.267e+02, percent-clipped=0.0 2024-06-21 23:17:50,842 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.83 vs. limit=15.0 2024-06-21 23:18:02,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=477966.5, ans=0.0 2024-06-21 23:18:09,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=477984.8333333333, ans=15.0 2024-06-21 23:18:13,394 INFO [train.py:1028] (1/2) Epoch 26, batch 7800, loss[loss=0.1994, simple_loss=0.2668, pruned_loss=0.06599, over 13183.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2668, pruned_loss=0.07275, over 2577038.22 frames. ], batch size: 95, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:18:19,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=478021.5, ans=0.025 2024-06-21 23:18:22,503 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=478021.5, ans=0.1 2024-06-21 23:18:28,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=478039.8333333333, ans=0.125 2024-06-21 23:18:35,542 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.52 vs. limit=15.0 2024-06-21 23:18:41,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=478076.5, ans=0.125 2024-06-21 23:18:46,165 INFO [train.py:1028] (1/2) Epoch 26, batch 7850, loss[loss=0.2237, simple_loss=0.2798, pruned_loss=0.08376, over 11400.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2673, pruned_loss=0.07289, over 2573318.38 frames. ], batch size: 17, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:18:47,342 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.363e+02 2.534e+02 2.788e+02 3.432e+02, threshold=5.068e+02, percent-clipped=0.0 2024-06-21 23:18:51,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=478094.8333333333, ans=0.125 2024-06-21 23:19:02,179 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.70 vs. limit=6.0 2024-06-21 23:19:03,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=478131.5, ans=0.125 2024-06-21 23:19:16,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=478168.1666666667, ans=0.2 2024-06-21 23:19:18,959 INFO [train.py:1028] (1/2) Epoch 26, batch 7900, loss[loss=0.2043, simple_loss=0.2722, pruned_loss=0.0682, over 13167.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2669, pruned_loss=0.07275, over 2573037.35 frames. ], batch size: 77, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:19:25,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=478186.5, ans=0.125 2024-06-21 23:19:58,229 INFO [train.py:1028] (1/2) Epoch 26, batch 7950, loss[loss=0.2218, simple_loss=0.2696, pruned_loss=0.08704, over 10549.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.267, pruned_loss=0.07263, over 2575634.15 frames. ], batch size: 303, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:19:59,685 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.386e+02 2.666e+02 2.885e+02 3.534e+02, threshold=5.332e+02, percent-clipped=0.0 2024-06-21 23:19:59,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=478278.1666666667, ans=0.125 2024-06-21 23:20:11,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=478296.5, ans=0.1 2024-06-21 23:20:14,304 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=478314.8333333333, ans=0.125 2024-06-21 23:20:24,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=478333.1666666667, ans=0.0 2024-06-21 23:20:32,749 INFO [train.py:1028] (1/2) Epoch 26, batch 8000, loss[loss=0.2283, simple_loss=0.2871, pruned_loss=0.08478, over 12564.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2678, pruned_loss=0.07297, over 2571860.66 frames. ], batch size: 29, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:20:34,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=478369.8333333333, ans=0.025 2024-06-21 23:21:02,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=478443.1666666667, ans=0.0 2024-06-21 23:21:04,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=478443.1666666667, ans=0.125 2024-06-21 23:21:06,762 INFO [train.py:1028] (1/2) Epoch 26, batch 8050, loss[loss=0.2247, simple_loss=0.2796, pruned_loss=0.0849, over 13227.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2672, pruned_loss=0.07244, over 2572139.40 frames. ], batch size: 83, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:21:08,081 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.407e+02 2.576e+02 2.935e+02 4.611e+02, threshold=5.153e+02, percent-clipped=0.0 2024-06-21 23:21:12,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=478479.8333333333, ans=0.0 2024-06-21 23:21:19,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=478498.1666666667, ans=0.125 2024-06-21 23:21:27,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=478516.5, ans=0.0 2024-06-21 23:21:34,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=478534.8333333333, ans=0.0 2024-06-21 23:21:40,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=478534.8333333333, ans=0.0 2024-06-21 23:21:43,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=478534.8333333333, ans=0.1 2024-06-21 23:21:45,170 INFO [train.py:1028] (1/2) Epoch 26, batch 8100, loss[loss=0.203, simple_loss=0.2628, pruned_loss=0.07166, over 13136.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2678, pruned_loss=0.07291, over 2576519.34 frames. ], batch size: 112, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:21:45,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=478553.1666666667, ans=0.1 2024-06-21 23:21:47,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=478553.1666666667, ans=0.025 2024-06-21 23:21:48,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=478553.1666666667, ans=0.125 2024-06-21 23:21:48,818 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.35 vs. limit=15.0 2024-06-21 23:21:49,478 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.09 vs. limit=12.0 2024-06-21 23:21:51,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=478571.5, ans=0.2 2024-06-21 23:22:07,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=478608.1666666667, ans=0.5 2024-06-21 23:22:10,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=478626.5, ans=0.1 2024-06-21 23:22:13,869 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.58 vs. limit=22.5 2024-06-21 23:22:14,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=478626.5, ans=0.0 2024-06-21 23:22:17,521 INFO [train.py:1028] (1/2) Epoch 26, batch 8150, loss[loss=0.2074, simple_loss=0.2691, pruned_loss=0.07287, over 13127.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2679, pruned_loss=0.07253, over 2579496.96 frames. ], batch size: 121, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:22:18,714 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.432e+02 2.535e+02 2.797e+02 3.241e+02, threshold=5.069e+02, percent-clipped=0.0 2024-06-21 23:22:23,094 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.49 vs. limit=15.0 2024-06-21 23:22:28,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=478663.1666666667, ans=0.125 2024-06-21 23:22:35,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=478681.5, ans=0.0 2024-06-21 23:22:43,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=478718.1666666667, ans=0.2 2024-06-21 23:22:50,281 INFO [train.py:1028] (1/2) Epoch 26, batch 8200, loss[loss=0.22, simple_loss=0.2808, pruned_loss=0.07963, over 13124.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2681, pruned_loss=0.07248, over 2583735.60 frames. ], batch size: 112, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:23:21,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=478809.8333333333, ans=0.1 2024-06-21 23:23:23,618 INFO [train.py:1028] (1/2) Epoch 26, batch 8250, loss[loss=0.2034, simple_loss=0.2698, pruned_loss=0.06847, over 13325.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2676, pruned_loss=0.07219, over 2584605.46 frames. ], batch size: 52, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:23:24,882 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.399e+02 2.555e+02 2.806e+02 3.763e+02, threshold=5.109e+02, percent-clipped=0.0 2024-06-21 23:23:28,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=478828.1666666667, ans=0.125 2024-06-21 23:23:39,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=478846.5, ans=15.0 2024-06-21 23:23:44,029 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=478864.8333333333, ans=0.1 2024-06-21 23:23:51,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=478883.1666666667, ans=0.0 2024-06-21 23:23:54,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=478883.1666666667, ans=0.125 2024-06-21 23:23:54,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=478883.1666666667, ans=10.0 2024-06-21 23:23:55,866 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.09 vs. limit=15.0 2024-06-21 23:23:58,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=478901.5, ans=0.125 2024-06-21 23:24:02,124 INFO [train.py:1028] (1/2) Epoch 26, batch 8300, loss[loss=0.2146, simple_loss=0.2734, pruned_loss=0.07792, over 13016.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2673, pruned_loss=0.07204, over 2582467.15 frames. ], batch size: 102, lr: 2.21e-03, grad_scale: 16.0 2024-06-21 23:24:04,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=478919.8333333333, ans=0.1 2024-06-21 23:24:27,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=478993.1666666667, ans=0.125 2024-06-21 23:24:34,814 INFO [train.py:1028] (1/2) Epoch 26, batch 8350, loss[loss=0.2138, simple_loss=0.2631, pruned_loss=0.08223, over 13165.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2676, pruned_loss=0.07205, over 2582075.76 frames. ], batch size: 112, lr: 2.21e-03, grad_scale: 16.0 2024-06-21 23:24:35,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=479011.5, ans=0.1 2024-06-21 23:24:37,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=479011.5, ans=0.0 2024-06-21 23:24:37,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=479011.5, ans=0.125 2024-06-21 23:24:37,558 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.355e+02 2.485e+02 2.698e+02 3.793e+02, threshold=4.971e+02, percent-clipped=0.0 2024-06-21 23:24:39,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=479011.5, ans=0.125 2024-06-21 23:24:41,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=479029.8333333333, ans=10.0 2024-06-21 23:24:46,476 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.47 vs. limit=6.0 2024-06-21 23:24:46,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=479029.8333333333, ans=0.2 2024-06-21 23:24:48,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=479048.1666666667, ans=0.1 2024-06-21 23:24:53,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=479048.1666666667, ans=0.125 2024-06-21 23:25:00,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=479084.8333333333, ans=0.0 2024-06-21 23:25:07,788 INFO [train.py:1028] (1/2) Epoch 26, batch 8400, loss[loss=0.2149, simple_loss=0.2817, pruned_loss=0.07402, over 12958.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2686, pruned_loss=0.07269, over 2577730.82 frames. ], batch size: 39, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:25:11,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=479103.1666666667, ans=0.125 2024-06-21 23:25:14,876 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.21 vs. limit=22.5 2024-06-21 23:25:15,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=479121.5, ans=0.125 2024-06-21 23:25:19,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=479121.5, ans=0.0 2024-06-21 23:25:35,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=479158.1666666667, ans=0.125 2024-06-21 23:25:42,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=479176.5, ans=0.0 2024-06-21 23:25:45,752 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2024-06-21 23:25:47,342 INFO [train.py:1028] (1/2) Epoch 26, batch 8450, loss[loss=0.2131, simple_loss=0.2742, pruned_loss=0.07602, over 13172.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2691, pruned_loss=0.07271, over 2579226.13 frames. ], batch size: 112, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:25:49,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=479194.8333333333, ans=0.0 2024-06-21 23:25:49,846 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 2.434e+02 2.540e+02 2.817e+02 3.923e+02, threshold=5.080e+02, percent-clipped=0.0 2024-06-21 23:25:53,410 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2024-06-21 23:26:13,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=479268.1666666667, ans=0.0 2024-06-21 23:26:15,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=479268.1666666667, ans=0.125 2024-06-21 23:26:16,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=479268.1666666667, ans=0.1 2024-06-21 23:26:19,903 INFO [train.py:1028] (1/2) Epoch 26, batch 8500, loss[loss=0.2034, simple_loss=0.2714, pruned_loss=0.0677, over 12592.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2702, pruned_loss=0.07314, over 2577414.05 frames. ], batch size: 29, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:26:33,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=479323.1666666667, ans=0.125 2024-06-21 23:26:37,566 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=15.0 2024-06-21 23:26:43,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=479341.5, ans=0.2 2024-06-21 23:26:46,746 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:26:52,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=479378.1666666667, ans=0.0 2024-06-21 23:26:53,177 INFO [train.py:1028] (1/2) Epoch 26, batch 8550, loss[loss=0.2121, simple_loss=0.2707, pruned_loss=0.07674, over 12466.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2695, pruned_loss=0.07258, over 2576649.63 frames. ], batch size: 22, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:26:55,809 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.456e+02 2.626e+02 2.937e+02 4.705e+02, threshold=5.251e+02, percent-clipped=0.0 2024-06-21 23:27:03,688 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.41 vs. limit=12.0 2024-06-21 23:27:26,429 INFO [train.py:1028] (1/2) Epoch 26, batch 8600, loss[loss=0.209, simple_loss=0.2626, pruned_loss=0.0777, over 13087.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2697, pruned_loss=0.07261, over 2575060.67 frames. ], batch size: 121, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:27:49,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=479506.5, ans=10.0 2024-06-21 23:28:06,538 INFO [train.py:1028] (1/2) Epoch 26, batch 8650, loss[loss=0.1826, simple_loss=0.2429, pruned_loss=0.0612, over 13043.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.269, pruned_loss=0.07207, over 2577385.35 frames. ], batch size: 102, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:28:07,760 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=12.0 2024-06-21 23:28:09,197 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.285e+02 2.452e+02 2.622e+02 3.077e+02, threshold=4.904e+02, percent-clipped=0.0 2024-06-21 23:28:15,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=479579.8333333333, ans=0.0 2024-06-21 23:28:15,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=479579.8333333333, ans=0.0 2024-06-21 23:28:19,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=479598.1666666667, ans=0.1 2024-06-21 23:28:25,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=479616.5, ans=0.125 2024-06-21 23:28:32,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=479634.8333333333, ans=0.0 2024-06-21 23:28:39,680 INFO [train.py:1028] (1/2) Epoch 26, batch 8700, loss[loss=0.2032, simple_loss=0.282, pruned_loss=0.06219, over 13196.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2702, pruned_loss=0.07292, over 2574273.48 frames. ], batch size: 59, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:28:41,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=479653.1666666667, ans=0.0 2024-06-21 23:28:50,989 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.86 vs. limit=15.0 2024-06-21 23:29:04,033 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=479708.1666666667, ans=0.0 2024-06-21 23:29:05,650 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.94 vs. limit=15.0 2024-06-21 23:29:10,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=479726.5, ans=0.125 2024-06-21 23:29:10,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=479726.5, ans=0.0 2024-06-21 23:29:13,131 INFO [train.py:1028] (1/2) Epoch 26, batch 8750, loss[loss=0.2151, simple_loss=0.2741, pruned_loss=0.07803, over 13073.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2704, pruned_loss=0.07322, over 2570596.59 frames. ], batch size: 121, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:29:16,061 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.430e+02 2.527e+02 2.709e+02 3.613e+02, threshold=5.055e+02, percent-clipped=0.0 2024-06-21 23:29:20,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=479763.1666666667, ans=0.125 2024-06-21 23:29:21,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=479763.1666666667, ans=0.1 2024-06-21 23:29:45,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=479799.8333333333, ans=0.0 2024-06-21 23:29:48,078 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.42 vs. limit=15.0 2024-06-21 23:29:50,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=479818.1666666667, ans=0.125 2024-06-21 23:29:54,475 INFO [train.py:1028] (1/2) Epoch 26, batch 8800, loss[loss=0.1957, simple_loss=0.271, pruned_loss=0.06019, over 13212.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2706, pruned_loss=0.07315, over 2575126.26 frames. ], batch size: 72, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:29:54,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=479836.5, ans=0.0 2024-06-21 23:29:56,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=479836.5, ans=0.125 2024-06-21 23:30:01,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=479854.8333333333, ans=0.125 2024-06-21 23:30:12,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=479873.1666666667, ans=0.1 2024-06-21 23:30:14,823 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=479891.5, ans=0.125 2024-06-21 23:30:17,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=479891.5, ans=10.0 2024-06-21 23:30:28,936 INFO [train.py:1028] (1/2) Epoch 26, batch 8850, loss[loss=0.2214, simple_loss=0.2742, pruned_loss=0.0843, over 12574.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2707, pruned_loss=0.07348, over 2565014.16 frames. ], batch size: 202, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:30:31,847 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.391e+02 2.542e+02 2.731e+02 3.641e+02, threshold=5.084e+02, percent-clipped=0.0 2024-06-21 23:30:37,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=479946.5, ans=0.125 2024-06-21 23:30:49,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=479983.1666666667, ans=0.125 2024-06-21 23:30:58,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=480001.5, ans=0.1 2024-06-21 23:31:01,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=480001.5, ans=0.125 2024-06-21 23:31:02,912 INFO [train.py:1028] (1/2) Epoch 26, batch 8900, loss[loss=0.214, simple_loss=0.2748, pruned_loss=0.07654, over 12889.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2717, pruned_loss=0.07417, over 2562973.47 frames. ], batch size: 33, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:31:06,118 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.89 vs. limit=15.0 2024-06-21 23:31:06,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=480019.8333333333, ans=0.0 2024-06-21 23:31:07,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=480019.8333333333, ans=0.2 2024-06-21 23:31:10,518 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.67 vs. limit=10.0 2024-06-21 23:31:12,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=480038.1666666667, ans=0.09899494936611666 2024-06-21 23:31:13,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=480038.1666666667, ans=0.1 2024-06-21 23:31:14,984 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=480038.1666666667, ans=0.0 2024-06-21 23:31:15,293 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.36 vs. limit=15.0 2024-06-21 23:31:19,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=480056.5, ans=0.0 2024-06-21 23:31:39,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=480093.1666666667, ans=0.125 2024-06-21 23:31:42,976 INFO [train.py:1028] (1/2) Epoch 26, batch 8950, loss[loss=0.232, simple_loss=0.2883, pruned_loss=0.08787, over 12563.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2717, pruned_loss=0.07346, over 2562549.55 frames. ], batch size: 202, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:31:44,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=480111.5, ans=0.0 2024-06-21 23:31:45,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=480111.5, ans=0.1 2024-06-21 23:31:45,693 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.385e+02 2.524e+02 2.691e+02 4.301e+02, threshold=5.048e+02, percent-clipped=0.0 2024-06-21 23:31:49,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=480129.8333333333, ans=0.0 2024-06-21 23:32:08,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=480166.5, ans=0.025 2024-06-21 23:32:16,644 INFO [train.py:1028] (1/2) Epoch 26, batch 9000, loss[loss=0.1974, simple_loss=0.2642, pruned_loss=0.06535, over 13273.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2718, pruned_loss=0.0732, over 2568830.32 frames. ], batch size: 46, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:32:16,645 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 23:32:24,673 INFO [train.py:1060] (1/2) Epoch 26, validation: loss=0.1911, simple_loss=0.2514, pruned_loss=0.0654, over 351949.00 frames. 2024-06-21 23:32:24,673 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 23:32:25,694 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.99 vs. limit=15.0 2024-06-21 23:32:29,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=480203.1666666667, ans=0.125 2024-06-21 23:32:30,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=480203.1666666667, ans=0.125 2024-06-21 23:32:33,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=480221.5, ans=0.2 2024-06-21 23:32:42,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=480239.8333333333, ans=0.2 2024-06-21 23:32:43,156 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2024-06-21 23:32:49,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480258.1666666667, ans=0.1 2024-06-21 23:32:50,978 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2024-06-21 23:32:54,797 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=480276.5, ans=0.125 2024-06-21 23:32:55,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=480276.5, ans=0.125 2024-06-21 23:32:58,679 INFO [train.py:1028] (1/2) Epoch 26, batch 9050, loss[loss=0.1613, simple_loss=0.2248, pruned_loss=0.04896, over 11517.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2726, pruned_loss=0.07385, over 2566955.13 frames. ], batch size: 16, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:32:59,114 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.60 vs. limit=6.0 2024-06-21 23:33:00,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=480294.8333333333, ans=0.125 2024-06-21 23:33:01,219 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 2.428e+02 2.600e+02 2.750e+02 3.494e+02, threshold=5.200e+02, percent-clipped=0.0 2024-06-21 23:33:09,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=480313.1666666667, ans=0.125 2024-06-21 23:33:15,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=480331.5, ans=0.125 2024-06-21 23:33:16,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480331.5, ans=0.1 2024-06-21 23:33:31,394 INFO [train.py:1028] (1/2) Epoch 26, batch 9100, loss[loss=0.2007, simple_loss=0.2729, pruned_loss=0.0643, over 13266.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2721, pruned_loss=0.07369, over 2566274.65 frames. ], batch size: 72, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:33:39,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480404.8333333333, ans=0.1 2024-06-21 23:33:48,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=480423.1666666667, ans=0.0 2024-06-21 23:33:48,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=480423.1666666667, ans=0.2 2024-06-21 23:33:49,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=480423.1666666667, ans=0.125 2024-06-21 23:33:50,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=480441.5, ans=0.125 2024-06-21 23:33:51,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=480441.5, ans=0.125 2024-06-21 23:33:55,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=480441.5, ans=0.125 2024-06-21 23:34:03,502 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.94 vs. limit=15.0 2024-06-21 23:34:03,768 INFO [train.py:1028] (1/2) Epoch 26, batch 9150, loss[loss=0.2209, simple_loss=0.2892, pruned_loss=0.07624, over 13213.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2725, pruned_loss=0.07385, over 2567719.10 frames. ], batch size: 77, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:34:06,280 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.386e+02 2.534e+02 2.682e+02 3.276e+02, threshold=5.068e+02, percent-clipped=0.0 2024-06-21 23:34:07,384 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.45 vs. limit=15.0 2024-06-21 23:34:21,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=480514.8333333333, ans=0.2 2024-06-21 23:34:29,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=480551.5, ans=0.0 2024-06-21 23:34:29,695 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:34:29,961 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.94 vs. limit=10.0 2024-06-21 23:34:31,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=480551.5, ans=0.025 2024-06-21 23:34:35,984 INFO [train.py:1028] (1/2) Epoch 26, batch 9200, loss[loss=0.1873, simple_loss=0.2619, pruned_loss=0.05635, over 12964.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2722, pruned_loss=0.07338, over 2571192.27 frames. ], batch size: 36, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:34:36,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480569.8333333333, ans=0.1 2024-06-21 23:34:38,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=480569.8333333333, ans=0.125 2024-06-21 23:34:38,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=480569.8333333333, ans=0.125 2024-06-21 23:35:02,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=480624.8333333333, ans=0.0 2024-06-21 23:35:02,426 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.77 vs. limit=15.0 2024-06-21 23:35:12,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=480643.1666666667, ans=0.2 2024-06-21 23:35:14,258 INFO [train.py:1028] (1/2) Epoch 26, batch 9250, loss[loss=0.1934, simple_loss=0.2628, pruned_loss=0.06202, over 13226.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2723, pruned_loss=0.07318, over 2572584.24 frames. ], batch size: 67, lr: 2.20e-03, grad_scale: 16.0 2024-06-21 23:35:17,693 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.388e+02 2.505e+02 2.711e+02 3.177e+02, threshold=5.010e+02, percent-clipped=0.0 2024-06-21 23:35:19,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=480661.5, ans=0.125 2024-06-21 23:35:21,737 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:35:23,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=480679.8333333333, ans=0.0 2024-06-21 23:35:24,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=480679.8333333333, ans=0.1 2024-06-21 23:35:27,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=480698.1666666667, ans=0.125 2024-06-21 23:35:32,237 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:35:33,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=480716.5, ans=0.125 2024-06-21 23:35:40,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=480734.8333333333, ans=0.2 2024-06-21 23:35:41,026 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=15.0 2024-06-21 23:35:44,147 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2024-06-21 23:35:45,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480734.8333333333, ans=0.1 2024-06-21 23:35:46,967 INFO [train.py:1028] (1/2) Epoch 26, batch 9300, loss[loss=0.1931, simple_loss=0.2488, pruned_loss=0.06866, over 12926.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2721, pruned_loss=0.0731, over 2569341.82 frames. ], batch size: 39, lr: 2.20e-03, grad_scale: 16.0 2024-06-21 23:35:47,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480753.1666666667, ans=0.1 2024-06-21 23:35:49,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=480753.1666666667, ans=0.125 2024-06-21 23:35:56,219 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.04 vs. limit=22.5 2024-06-21 23:36:11,980 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2024-06-21 23:36:14,351 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.14 vs. limit=22.5 2024-06-21 23:36:16,231 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.22 vs. limit=15.0 2024-06-21 23:36:18,284 INFO [train.py:1028] (1/2) Epoch 26, batch 9350, loss[loss=0.2083, simple_loss=0.2805, pruned_loss=0.06799, over 12672.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2721, pruned_loss=0.07302, over 2566233.65 frames. ], batch size: 22, lr: 2.20e-03, grad_scale: 16.0 2024-06-21 23:36:21,207 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.425e+02 2.540e+02 2.739e+02 4.025e+02, threshold=5.080e+02, percent-clipped=0.0 2024-06-21 23:36:23,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=480863.1666666667, ans=0.2 2024-06-21 23:36:26,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=480863.1666666667, ans=0.125 2024-06-21 23:36:38,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=480899.8333333333, ans=0.0 2024-06-21 23:36:40,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=480899.8333333333, ans=0.125 2024-06-21 23:36:40,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=480899.8333333333, ans=0.125 2024-06-21 23:36:42,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=480918.1666666667, ans=0.0 2024-06-21 23:36:49,080 INFO [train.py:1028] (1/2) Epoch 26, batch 9400, loss[loss=0.2243, simple_loss=0.2911, pruned_loss=0.07876, over 13257.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2716, pruned_loss=0.07308, over 2566314.18 frames. ], batch size: 52, lr: 2.20e-03, grad_scale: 16.0 2024-06-21 23:36:54,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=480954.8333333333, ans=0.0 2024-06-21 23:37:03,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.33 vs. limit=15.0 2024-06-21 23:37:07,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=480991.5, ans=0.125 2024-06-21 23:37:14,789 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.15 vs. limit=10.0 2024-06-21 23:37:16,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=481009.8333333333, ans=0.125 2024-06-21 23:37:19,870 INFO [train.py:1028] (1/2) Epoch 26, batch 9450, loss[loss=0.2123, simple_loss=0.2804, pruned_loss=0.07207, over 12786.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2726, pruned_loss=0.07369, over 2567196.50 frames. ], batch size: 22, lr: 2.20e-03, grad_scale: 16.0 2024-06-21 23:37:21,840 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:37:23,087 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.436e+02 2.620e+02 2.964e+02 4.092e+02, threshold=5.241e+02, percent-clipped=0.0 2024-06-21 23:37:27,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=481046.5, ans=0.09899494936611666 2024-06-21 23:37:32,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=481064.8333333333, ans=0.125 2024-06-21 23:37:37,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=481083.1666666667, ans=0.125 2024-06-21 23:37:43,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=481083.1666666667, ans=0.125 2024-06-21 23:37:43,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=481101.5, ans=0.0 2024-06-21 23:37:48,782 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:37:52,858 INFO [train.py:1028] (1/2) Epoch 26, batch 9500, loss[loss=0.2033, simple_loss=0.274, pruned_loss=0.06628, over 13234.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2725, pruned_loss=0.07323, over 2575756.55 frames. ], batch size: 43, lr: 2.20e-03, grad_scale: 16.0 2024-06-21 23:38:02,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=481138.1666666667, ans=0.0 2024-06-21 23:38:09,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=481156.5, ans=0.1 2024-06-21 23:38:22,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=481193.1666666667, ans=0.0 2024-06-21 23:38:24,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=481193.1666666667, ans=0.2 2024-06-21 23:38:26,523 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.24 vs. limit=15.0 2024-06-21 23:38:26,954 INFO [train.py:1028] (1/2) Epoch 26, batch 9550, loss[loss=0.1886, simple_loss=0.2619, pruned_loss=0.05764, over 12799.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2728, pruned_loss=0.07361, over 2570448.35 frames. ], batch size: 39, lr: 2.20e-03, grad_scale: 16.0 2024-06-21 23:38:28,842 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.19 vs. limit=12.0 2024-06-21 23:38:30,519 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.407e+02 2.602e+02 2.851e+02 3.547e+02, threshold=5.205e+02, percent-clipped=0.0 2024-06-21 23:38:33,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=481229.8333333333, ans=0.2 2024-06-21 23:38:38,907 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.40 vs. limit=6.0 2024-06-21 23:38:39,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=481248.1666666667, ans=0.0 2024-06-21 23:38:42,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=481248.1666666667, ans=0.125 2024-06-21 23:38:46,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=481266.5, ans=0.0 2024-06-21 23:38:47,678 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2024-06-21 23:38:50,199 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.28 vs. limit=22.5 2024-06-21 23:38:57,946 INFO [train.py:1028] (1/2) Epoch 26, batch 9600, loss[loss=0.2183, simple_loss=0.2659, pruned_loss=0.08537, over 10452.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2724, pruned_loss=0.07348, over 2568530.78 frames. ], batch size: 303, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:39:06,375 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.61 vs. limit=10.0 2024-06-21 23:39:07,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=481321.5, ans=0.125 2024-06-21 23:39:19,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=481358.1666666667, ans=0.125 2024-06-21 23:39:28,309 INFO [train.py:1028] (1/2) Epoch 26, batch 9650, loss[loss=0.2091, simple_loss=0.2618, pruned_loss=0.07826, over 13059.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2725, pruned_loss=0.07392, over 2559097.82 frames. ], batch size: 132, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:39:31,363 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.366e+02 2.510e+02 2.692e+02 3.858e+02, threshold=5.019e+02, percent-clipped=0.0 2024-06-21 23:39:32,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=481394.8333333333, ans=0.2 2024-06-21 23:39:39,376 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:39:46,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=481449.8333333333, ans=0.0 2024-06-21 23:39:47,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=481449.8333333333, ans=0.2 2024-06-21 23:39:51,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=481449.8333333333, ans=0.125 2024-06-21 23:39:59,054 INFO [train.py:1028] (1/2) Epoch 26, batch 9700, loss[loss=0.211, simple_loss=0.2697, pruned_loss=0.07614, over 13034.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2719, pruned_loss=0.07385, over 2554083.75 frames. ], batch size: 144, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:39:59,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=481486.5, ans=0.1 2024-06-21 23:40:12,542 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.16 vs. limit=15.0 2024-06-21 23:40:22,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=481541.5, ans=0.2 2024-06-21 23:40:22,826 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.07 vs. limit=15.0 2024-06-21 23:40:27,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=481559.8333333333, ans=0.1 2024-06-21 23:40:32,254 INFO [train.py:1028] (1/2) Epoch 26, batch 9750, loss[loss=0.1911, simple_loss=0.2544, pruned_loss=0.06389, over 13077.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2708, pruned_loss=0.07297, over 2549383.67 frames. ], batch size: 132, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:40:35,389 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.390e+02 2.551e+02 2.783e+02 3.640e+02, threshold=5.103e+02, percent-clipped=0.0 2024-06-21 23:40:42,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.94 vs. limit=15.0 2024-06-21 23:40:49,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=481614.8333333333, ans=0.025 2024-06-21 23:41:03,621 INFO [train.py:1028] (1/2) Epoch 26, batch 9800, loss[loss=0.2157, simple_loss=0.2812, pruned_loss=0.07509, over 12901.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2705, pruned_loss=0.07268, over 2542683.62 frames. ], batch size: 39, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:41:08,205 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.44 vs. limit=6.0 2024-06-21 23:41:14,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=481688.1666666667, ans=0.025 2024-06-21 23:41:15,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=481706.5, ans=0.125 2024-06-21 23:41:19,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=481706.5, ans=0.0 2024-06-21 23:41:23,834 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.37 vs. limit=15.0 2024-06-21 23:41:24,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=481724.8333333333, ans=0.0 2024-06-21 23:41:33,926 INFO [train.py:1028] (1/2) Epoch 26, batch 9850, loss[loss=0.2104, simple_loss=0.2692, pruned_loss=0.0758, over 13031.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2701, pruned_loss=0.07269, over 2535385.78 frames. ], batch size: 102, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:41:37,026 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.400e+02 2.540e+02 2.759e+02 3.633e+02, threshold=5.080e+02, percent-clipped=0.0 2024-06-21 23:41:51,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=481798.1666666667, ans=0.035 2024-06-21 23:42:07,204 INFO [train.py:1028] (1/2) Epoch 26, batch 9900, loss[loss=0.2139, simple_loss=0.2762, pruned_loss=0.07576, over 12979.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2696, pruned_loss=0.07291, over 2528805.11 frames. ], batch size: 39, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:42:07,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=481853.1666666667, ans=10.0 2024-06-21 23:42:25,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=481889.8333333333, ans=0.035 2024-06-21 23:42:25,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=481889.8333333333, ans=0.1 2024-06-21 23:42:27,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=481908.1666666667, ans=0.0 2024-06-21 23:42:35,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=481926.5, ans=0.125 2024-06-21 23:42:38,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=481944.8333333333, ans=0.1 2024-06-21 23:42:39,062 INFO [train.py:1028] (1/2) Epoch 26, batch 9950, loss[loss=0.2367, simple_loss=0.2959, pruned_loss=0.08875, over 12664.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.268, pruned_loss=0.0727, over 2521735.27 frames. ], batch size: 29, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:42:42,171 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 2.405e+02 2.511e+02 2.748e+02 3.618e+02, threshold=5.023e+02, percent-clipped=0.0 2024-06-21 23:42:52,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=481981.5, ans=0.0 2024-06-21 23:42:53,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=481981.5, ans=0.0 2024-06-21 23:43:00,501 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.80 vs. limit=10.0 2024-06-21 23:43:06,679 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.29 vs. limit=22.5 2024-06-21 23:43:11,228 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.96 vs. limit=15.0 2024-06-21 23:43:12,056 INFO [train.py:1028] (1/2) Epoch 26, batch 10000, loss[loss=0.2076, simple_loss=0.2767, pruned_loss=0.06926, over 12581.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2692, pruned_loss=0.07327, over 2483117.98 frames. ], batch size: 22, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:43:21,464 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.67 vs. limit=6.0 2024-06-21 23:43:25,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=482073.1666666667, ans=0.0 2024-06-21 23:43:27,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=482073.1666666667, ans=0.1 2024-06-21 23:43:33,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=482091.5, ans=0.025 2024-06-21 23:43:38,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=482109.8333333333, ans=0.125 2024-06-21 23:43:43,219 INFO [train.py:1028] (1/2) Epoch 26, batch 10050, loss[loss=0.2269, simple_loss=0.2944, pruned_loss=0.07972, over 12531.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2697, pruned_loss=0.07432, over 2440925.78 frames. ], batch size: 22, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:43:43,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=482128.1666666667, ans=0.015 2024-06-21 23:43:43,871 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:43:46,109 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.426e+02 2.531e+02 2.669e+02 3.402e+02, threshold=5.061e+02, percent-clipped=0.0 2024-06-21 23:43:50,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=482146.5, ans=0.0 2024-06-21 23:43:58,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=482164.8333333333, ans=0.125 2024-06-21 23:44:03,069 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=482183.1666666667, ans=0.1 2024-06-21 23:44:04,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=482183.1666666667, ans=0.04949747468305833 2024-06-21 23:44:05,592 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.23 vs. limit=10.0 2024-06-21 23:44:07,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=482201.5, ans=0.025 2024-06-21 23:44:13,528 INFO [train.py:1028] (1/2) Epoch 26, batch 10100, loss[loss=0.1687, simple_loss=0.2348, pruned_loss=0.05132, over 11097.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2684, pruned_loss=0.07362, over 2423213.76 frames. ], batch size: 16, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:44:19,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=482238.1666666667, ans=0.015 2024-06-21 23:44:19,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=482238.1666666667, ans=0.1 2024-06-21 23:44:20,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=482238.1666666667, ans=0.0 2024-06-21 23:44:21,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=482238.1666666667, ans=0.125 2024-06-21 23:46:22,415 INFO [train.py:1028] (1/2) Epoch 27, batch 0, loss[loss=0.1684, simple_loss=0.2368, pruned_loss=0.05003, over 12911.00 frames. ], tot_loss[loss=0.1684, simple_loss=0.2368, pruned_loss=0.05003, over 12911.00 frames. ], batch size: 36, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:46:22,415 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-21 23:46:29,651 INFO [train.py:1060] (1/2) Epoch 27, validation: loss=0.192, simple_loss=0.2529, pruned_loss=0.06551, over 351949.00 frames. 2024-06-21 23:46:29,652 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-21 23:46:51,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=482306.0, ans=0.0 2024-06-21 23:46:53,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=482306.0, ans=0.125 2024-06-21 23:46:55,717 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.271e+02 2.447e+02 2.740e+02 3.969e+02, threshold=4.894e+02, percent-clipped=0.0 2024-06-21 23:47:03,923 INFO [train.py:1028] (1/2) Epoch 27, batch 50, loss[loss=0.1921, simple_loss=0.2562, pruned_loss=0.06401, over 12691.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2504, pruned_loss=0.06623, over 574230.21 frames. ], batch size: 29, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:47:04,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=482342.6666666667, ans=0.025 2024-06-21 23:47:07,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=482342.6666666667, ans=0.125 2024-06-21 23:47:07,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=482342.6666666667, ans=0.2 2024-06-21 23:47:25,394 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.082e+01 2024-06-21 23:47:38,115 INFO [train.py:1028] (1/2) Epoch 27, batch 100, loss[loss=0.1687, simple_loss=0.2362, pruned_loss=0.05063, over 13337.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2493, pruned_loss=0.06556, over 1016812.63 frames. ], batch size: 46, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:47:38,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=482434.3333333333, ans=0.125 2024-06-21 23:47:51,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=482452.6666666667, ans=0.05 2024-06-21 23:47:52,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=482452.6666666667, ans=0.125 2024-06-21 23:47:57,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482471.0, ans=0.1 2024-06-21 23:47:59,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=482471.0, ans=0.125 2024-06-21 23:48:07,854 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.198e+02 2.353e+02 2.553e+02 3.563e+02, threshold=4.707e+02, percent-clipped=0.0 2024-06-21 23:48:15,325 INFO [train.py:1028] (1/2) Epoch 27, batch 150, loss[loss=0.1883, simple_loss=0.2546, pruned_loss=0.06099, over 12584.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2489, pruned_loss=0.06483, over 1364537.70 frames. ], batch size: 29, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:48:27,200 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.65 vs. limit=15.0 2024-06-21 23:48:30,407 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482562.6666666667, ans=0.1 2024-06-21 23:48:34,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=482581.0, ans=0.04949747468305833 2024-06-21 23:48:35,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482581.0, ans=0.1 2024-06-21 23:48:47,596 INFO [train.py:1028] (1/2) Epoch 27, batch 200, loss[loss=0.2148, simple_loss=0.274, pruned_loss=0.07776, over 12539.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2496, pruned_loss=0.06495, over 1634735.21 frames. ], batch size: 202, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:48:49,903 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.27 vs. limit=15.0 2024-06-21 23:48:51,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=482617.6666666667, ans=0.1 2024-06-21 23:48:52,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=482617.6666666667, ans=0.1 2024-06-21 23:48:57,753 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=482636.0, ans=0.0 2024-06-21 23:49:11,151 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.206e+02 2.368e+02 2.540e+02 3.047e+02, threshold=4.736e+02, percent-clipped=0.0 2024-06-21 23:49:12,125 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.03 vs. limit=15.0 2024-06-21 23:49:19,140 INFO [train.py:1028] (1/2) Epoch 27, batch 250, loss[loss=0.199, simple_loss=0.2509, pruned_loss=0.07352, over 13054.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.249, pruned_loss=0.06471, over 1846206.45 frames. ], batch size: 144, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:49:19,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=482709.3333333333, ans=0.2 2024-06-21 23:49:24,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=482709.3333333333, ans=0.125 2024-06-21 23:49:26,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=482709.3333333333, ans=0.015 2024-06-21 23:49:34,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=482727.6666666667, ans=0.0 2024-06-21 23:49:35,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=482746.0, ans=0.0 2024-06-21 23:49:37,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=482746.0, ans=0.2 2024-06-21 23:49:40,710 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.38 vs. limit=15.0 2024-06-21 23:49:55,288 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2024-06-21 23:49:59,463 INFO [train.py:1028] (1/2) Epoch 27, batch 300, loss[loss=0.2014, simple_loss=0.2527, pruned_loss=0.07509, over 13167.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2498, pruned_loss=0.06516, over 2008427.09 frames. ], batch size: 112, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:50:05,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=482819.3333333333, ans=0.125 2024-06-21 23:50:16,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=482837.6666666667, ans=0.1 2024-06-21 23:50:16,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=482837.6666666667, ans=0.2 2024-06-21 23:50:17,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=482856.0, ans=0.1 2024-06-21 23:50:22,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=482856.0, ans=0.0 2024-06-21 23:50:23,407 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.208e+02 2.288e+02 2.435e+02 3.423e+02, threshold=4.577e+02, percent-clipped=0.0 2024-06-21 23:50:30,945 INFO [train.py:1028] (1/2) Epoch 27, batch 350, loss[loss=0.1862, simple_loss=0.2475, pruned_loss=0.06239, over 12937.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2493, pruned_loss=0.06504, over 2137686.77 frames. ], batch size: 33, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:50:35,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=482892.6666666667, ans=0.125 2024-06-21 23:50:35,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=482892.6666666667, ans=0.0 2024-06-21 23:50:36,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=482892.6666666667, ans=0.125 2024-06-21 23:50:40,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=482911.0, ans=0.025 2024-06-21 23:50:40,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=482911.0, ans=0.2 2024-06-21 23:50:43,033 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=482911.0, ans=0.2 2024-06-21 23:50:52,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=482947.6666666667, ans=0.025 2024-06-21 23:50:54,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=482947.6666666667, ans=0.2 2024-06-21 23:51:02,804 INFO [train.py:1028] (1/2) Epoch 27, batch 400, loss[loss=0.1799, simple_loss=0.2395, pruned_loss=0.06021, over 13245.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2494, pruned_loss=0.06468, over 2238984.42 frames. ], batch size: 63, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:51:10,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=483002.6666666667, ans=0.125 2024-06-21 23:51:23,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=483039.3333333333, ans=0.1 2024-06-21 23:51:30,024 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.288e+02 2.459e+02 2.692e+02 3.203e+02, threshold=4.918e+02, percent-clipped=0.0 2024-06-21 23:51:30,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=483057.6666666667, ans=0.1 2024-06-21 23:51:37,904 INFO [train.py:1028] (1/2) Epoch 27, batch 450, loss[loss=0.1822, simple_loss=0.2439, pruned_loss=0.06027, over 13275.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2493, pruned_loss=0.06466, over 2312584.85 frames. ], batch size: 67, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:51:40,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=483076.0, ans=0.125 2024-06-21 23:52:14,795 INFO [train.py:1028] (1/2) Epoch 27, batch 500, loss[loss=0.2051, simple_loss=0.258, pruned_loss=0.0761, over 13155.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2499, pruned_loss=0.06448, over 2375612.41 frames. ], batch size: 121, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:52:17,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=483167.6666666667, ans=0.125 2024-06-21 23:52:18,295 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.17 vs. limit=22.5 2024-06-21 23:52:22,822 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2024-06-21 23:52:28,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=483204.3333333333, ans=0.1 2024-06-21 23:52:29,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=483204.3333333333, ans=0.0 2024-06-21 23:52:38,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=483222.6666666667, ans=0.2 2024-06-21 23:52:39,407 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.247e+02 2.361e+02 2.623e+02 4.004e+02, threshold=4.722e+02, percent-clipped=0.0 2024-06-21 23:52:40,224 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:52:47,186 INFO [train.py:1028] (1/2) Epoch 27, batch 550, loss[loss=0.1782, simple_loss=0.2334, pruned_loss=0.06152, over 12955.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2493, pruned_loss=0.06449, over 2419055.76 frames. ], batch size: 158, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:52:50,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=483259.3333333333, ans=0.125 2024-06-21 23:52:58,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=483277.6666666667, ans=0.2 2024-06-21 23:53:03,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=483296.0, ans=0.125 2024-06-21 23:53:16,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=483332.6666666667, ans=0.125 2024-06-21 23:53:19,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=483351.0, ans=10.0 2024-06-21 23:53:19,546 INFO [train.py:1028] (1/2) Epoch 27, batch 600, loss[loss=0.1721, simple_loss=0.2275, pruned_loss=0.05839, over 13051.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2487, pruned_loss=0.06424, over 2457644.60 frames. ], batch size: 144, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:53:20,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=483351.0, ans=0.125 2024-06-21 23:53:46,738 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.213e+02 2.306e+02 2.466e+02 3.302e+02, threshold=4.612e+02, percent-clipped=0.0 2024-06-21 23:53:53,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=483424.3333333333, ans=0.0 2024-06-21 23:53:54,399 INFO [train.py:1028] (1/2) Epoch 27, batch 650, loss[loss=0.1689, simple_loss=0.231, pruned_loss=0.05341, over 13171.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2484, pruned_loss=0.06385, over 2488472.37 frames. ], batch size: 59, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:53:56,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=483442.6666666667, ans=0.125 2024-06-21 23:54:15,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=483479.3333333333, ans=0.125 2024-06-21 23:54:22,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2024-06-21 23:54:29,114 INFO [train.py:1028] (1/2) Epoch 27, batch 700, loss[loss=0.1956, simple_loss=0.2565, pruned_loss=0.06732, over 13264.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2484, pruned_loss=0.06405, over 2512505.88 frames. ], batch size: 46, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:54:36,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=483552.6666666667, ans=0.025 2024-06-21 23:54:36,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=483552.6666666667, ans=0.125 2024-06-21 23:54:40,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=483552.6666666667, ans=0.125 2024-06-21 23:54:50,833 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2024-06-21 23:54:53,753 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.264e+02 2.360e+02 2.555e+02 3.306e+02, threshold=4.720e+02, percent-clipped=0.0 2024-06-21 23:55:01,300 INFO [train.py:1028] (1/2) Epoch 27, batch 750, loss[loss=0.17, simple_loss=0.2362, pruned_loss=0.05193, over 13268.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2484, pruned_loss=0.06378, over 2528738.40 frames. ], batch size: 63, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:55:04,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=483626.0, ans=0.125 2024-06-21 23:55:17,767 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.01 vs. limit=15.0 2024-06-21 23:55:25,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=483681.0, ans=0.125 2024-06-21 23:55:27,487 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.91 vs. limit=10.0 2024-06-21 23:55:27,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=483699.3333333333, ans=0.5 2024-06-21 23:55:33,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=483717.6666666667, ans=0.09899494936611666 2024-06-21 23:55:33,561 INFO [train.py:1028] (1/2) Epoch 27, batch 800, loss[loss=0.1691, simple_loss=0.2389, pruned_loss=0.04961, over 12858.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2488, pruned_loss=0.06407, over 2541894.07 frames. ], batch size: 36, lr: 2.15e-03, grad_scale: 32.0 2024-06-21 23:55:39,531 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2024-06-21 23:55:45,815 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.56 vs. limit=15.0 2024-06-21 23:55:47,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=483736.0, ans=0.125 2024-06-21 23:55:51,831 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=483754.3333333333, ans=0.2 2024-06-21 23:55:55,358 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.14 vs. limit=22.5 2024-06-21 23:55:57,121 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.70 vs. limit=15.0 2024-06-21 23:55:58,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=483772.6666666667, ans=0.125 2024-06-21 23:56:01,176 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.196e+02 2.314e+02 2.476e+02 3.202e+02, threshold=4.628e+02, percent-clipped=0.0 2024-06-21 23:56:04,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=483791.0, ans=0.125 2024-06-21 23:56:08,012 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=483791.0, ans=0.1 2024-06-21 23:56:09,356 INFO [train.py:1028] (1/2) Epoch 27, batch 850, loss[loss=0.2105, simple_loss=0.257, pruned_loss=0.08195, over 13101.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2483, pruned_loss=0.06399, over 2552306.41 frames. ], batch size: 95, lr: 2.15e-03, grad_scale: 32.0 2024-06-21 23:56:25,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=483827.6666666667, ans=0.125 2024-06-21 23:56:28,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=483846.0, ans=0.125 2024-06-21 23:56:39,316 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:56:39,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=483882.6666666667, ans=0.125 2024-06-21 23:56:44,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=483882.6666666667, ans=0.0 2024-06-21 23:56:46,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=483901.0, ans=0.125 2024-06-21 23:56:46,669 INFO [train.py:1028] (1/2) Epoch 27, batch 900, loss[loss=0.1902, simple_loss=0.2501, pruned_loss=0.06516, over 12847.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2486, pruned_loss=0.06429, over 2557068.67 frames. ], batch size: 36, lr: 2.15e-03, grad_scale: 32.0 2024-06-21 23:56:48,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=483901.0, ans=0.1 2024-06-21 23:56:50,580 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.88 vs. limit=10.0 2024-06-21 23:56:58,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=483937.6666666667, ans=0.125 2024-06-21 23:57:04,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=483937.6666666667, ans=0.0 2024-06-21 23:57:07,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=483956.0, ans=0.02 2024-06-21 23:57:11,689 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.316e+02 2.441e+02 2.622e+02 3.299e+02, threshold=4.882e+02, percent-clipped=0.0 2024-06-21 23:57:14,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=483974.3333333333, ans=0.125 2024-06-21 23:57:19,534 INFO [train.py:1028] (1/2) Epoch 27, batch 950, loss[loss=0.1811, simple_loss=0.2439, pruned_loss=0.0591, over 12890.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2485, pruned_loss=0.06409, over 2559613.05 frames. ], batch size: 39, lr: 2.15e-03, grad_scale: 32.0 2024-06-21 23:57:19,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=483992.6666666667, ans=0.125 2024-06-21 23:57:30,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=484011.0, ans=0.2 2024-06-21 23:57:30,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=484011.0, ans=0.1 2024-06-21 23:57:31,865 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=12.0 2024-06-21 23:57:40,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=484029.3333333333, ans=0.1 2024-06-21 23:57:55,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=484066.0, ans=0.125 2024-06-21 23:57:57,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=484066.0, ans=0.0 2024-06-21 23:57:58,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=484084.3333333333, ans=0.0 2024-06-21 23:57:58,914 INFO [train.py:1028] (1/2) Epoch 27, batch 1000, loss[loss=0.2096, simple_loss=0.2729, pruned_loss=0.07313, over 13279.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2481, pruned_loss=0.06419, over 2561373.24 frames. ], batch size: 49, lr: 2.15e-03, grad_scale: 32.0 2024-06-21 23:58:06,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=484102.6666666667, ans=0.2 2024-06-21 23:58:25,577 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.272e+02 2.402e+02 2.595e+02 3.426e+02, threshold=4.804e+02, percent-clipped=0.0 2024-06-21 23:58:28,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=484157.6666666667, ans=0.0 2024-06-21 23:58:30,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=484157.6666666667, ans=0.04949747468305833 2024-06-21 23:58:33,263 INFO [train.py:1028] (1/2) Epoch 27, batch 1050, loss[loss=0.175, simple_loss=0.2456, pruned_loss=0.05222, over 13139.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2489, pruned_loss=0.06471, over 2564451.58 frames. ], batch size: 77, lr: 2.15e-03, grad_scale: 32.0 2024-06-21 23:58:34,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=484176.0, ans=0.125 2024-06-21 23:58:34,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=484176.0, ans=0.2 2024-06-21 23:58:45,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=484194.3333333333, ans=0.1 2024-06-21 23:58:48,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=484212.6666666667, ans=0.125 2024-06-21 23:58:54,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=484231.0, ans=0.1 2024-06-21 23:58:59,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=484249.3333333333, ans=0.125 2024-06-21 23:59:05,947 INFO [train.py:1028] (1/2) Epoch 27, batch 1100, loss[loss=0.1813, simple_loss=0.2468, pruned_loss=0.05791, over 13263.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2488, pruned_loss=0.06438, over 2568685.61 frames. ], batch size: 52, lr: 2.15e-03, grad_scale: 32.0 2024-06-21 23:59:10,032 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2024-06-21 23:59:30,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=484322.6666666667, ans=0.125 2024-06-21 23:59:30,738 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.265e+02 2.353e+02 2.568e+02 3.425e+02, threshold=4.707e+02, percent-clipped=0.0 2024-06-21 23:59:34,347 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.61 vs. limit=12.0 2024-06-21 23:59:34,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=484341.0, ans=0.125 2024-06-21 23:59:42,517 INFO [train.py:1028] (1/2) Epoch 27, batch 1150, loss[loss=0.1857, simple_loss=0.2517, pruned_loss=0.05987, over 13300.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2491, pruned_loss=0.06484, over 2570754.08 frames. ], batch size: 52, lr: 2.15e-03, grad_scale: 64.0 2024-06-21 23:59:44,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=484359.3333333333, ans=0.04949747468305833 2024-06-21 23:59:51,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=484377.6666666667, ans=0.1 2024-06-22 00:00:11,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=484432.6666666667, ans=0.0 2024-06-22 00:00:18,600 INFO [train.py:1028] (1/2) Epoch 27, batch 1200, loss[loss=0.1805, simple_loss=0.2501, pruned_loss=0.05549, over 13160.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.249, pruned_loss=0.06493, over 2572543.42 frames. ], batch size: 77, lr: 2.15e-03, grad_scale: 64.0 2024-06-22 00:00:22,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=484451.0, ans=0.05 2024-06-22 00:00:26,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=484469.3333333333, ans=0.125 2024-06-22 00:00:28,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=484469.3333333333, ans=0.125 2024-06-22 00:00:28,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=484469.3333333333, ans=0.0 2024-06-22 00:00:37,468 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=484487.6666666667, ans=0.0 2024-06-22 00:00:38,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=484506.0, ans=0.125 2024-06-22 00:00:39,651 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.72 vs. limit=22.5 2024-06-22 00:00:43,751 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.303e+02 2.410e+02 2.566e+02 3.458e+02, threshold=4.820e+02, percent-clipped=0.0 2024-06-22 00:00:51,466 INFO [train.py:1028] (1/2) Epoch 27, batch 1250, loss[loss=0.1774, simple_loss=0.232, pruned_loss=0.06146, over 13180.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2489, pruned_loss=0.06483, over 2582162.65 frames. ], batch size: 112, lr: 2.15e-03, grad_scale: 64.0 2024-06-22 00:01:00,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=484561.0, ans=0.1 2024-06-22 00:01:05,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=484579.3333333333, ans=0.0 2024-06-22 00:01:07,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=484579.3333333333, ans=0.2 2024-06-22 00:01:19,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=484616.0, ans=0.125 2024-06-22 00:01:24,060 INFO [train.py:1028] (1/2) Epoch 27, batch 1300, loss[loss=0.2052, simple_loss=0.2597, pruned_loss=0.07534, over 12739.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2493, pruned_loss=0.0649, over 2582522.17 frames. ], batch size: 176, lr: 2.15e-03, grad_scale: 64.0 2024-06-22 00:01:25,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=484634.3333333333, ans=0.2 2024-06-22 00:01:31,602 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.99 vs. limit=15.0 2024-06-22 00:01:37,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=484671.0, ans=0.0 2024-06-22 00:01:38,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=484671.0, ans=0.125 2024-06-22 00:01:39,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=484671.0, ans=0.2 2024-06-22 00:01:49,022 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.317e+02 2.469e+02 2.739e+02 3.690e+02, threshold=4.938e+02, percent-clipped=0.0 2024-06-22 00:02:01,387 INFO [train.py:1028] (1/2) Epoch 27, batch 1350, loss[loss=0.1732, simple_loss=0.2466, pruned_loss=0.04987, over 13251.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2487, pruned_loss=0.06427, over 2584599.63 frames. ], batch size: 59, lr: 2.15e-03, grad_scale: 64.0 2024-06-22 00:02:07,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=484726.0, ans=0.125 2024-06-22 00:02:29,298 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:02:30,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=484799.3333333333, ans=0.05 2024-06-22 00:02:33,071 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=484799.3333333333, ans=0.2 2024-06-22 00:02:35,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=484799.3333333333, ans=0.0 2024-06-22 00:02:37,399 INFO [train.py:1028] (1/2) Epoch 27, batch 1400, loss[loss=0.1883, simple_loss=0.2538, pruned_loss=0.06136, over 12467.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2491, pruned_loss=0.06459, over 2586535.34 frames. ], batch size: 25, lr: 2.15e-03, grad_scale: 64.0 2024-06-22 00:02:43,334 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.86 vs. limit=22.5 2024-06-22 00:02:51,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=484854.3333333333, ans=0.125 2024-06-22 00:02:53,723 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2024-06-22 00:03:02,073 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.301e+02 2.446e+02 2.727e+02 3.466e+02, threshold=4.893e+02, percent-clipped=0.0 2024-06-22 00:03:09,908 INFO [train.py:1028] (1/2) Epoch 27, batch 1450, loss[loss=0.1793, simple_loss=0.2323, pruned_loss=0.06313, over 13131.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2488, pruned_loss=0.06462, over 2585865.49 frames. ], batch size: 121, lr: 2.15e-03, grad_scale: 64.0 2024-06-22 00:03:11,558 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.09 vs. limit=15.0 2024-06-22 00:03:34,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=484964.3333333333, ans=0.025 2024-06-22 00:03:40,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=484982.6666666667, ans=0.1 2024-06-22 00:03:42,168 INFO [train.py:1028] (1/2) Epoch 27, batch 1500, loss[loss=0.2007, simple_loss=0.2602, pruned_loss=0.07058, over 13192.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2488, pruned_loss=0.0648, over 2588806.58 frames. ], batch size: 83, lr: 2.15e-03, grad_scale: 64.0 2024-06-22 00:03:49,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=485001.0, ans=0.1 2024-06-22 00:03:51,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=485019.3333333333, ans=0.125 2024-06-22 00:04:10,953 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.287e+02 2.420e+02 2.674e+02 3.535e+02, threshold=4.841e+02, percent-clipped=0.0 2024-06-22 00:04:18,452 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.24 vs. limit=15.0 2024-06-22 00:04:20,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=485074.3333333333, ans=0.0 2024-06-22 00:04:22,432 INFO [train.py:1028] (1/2) Epoch 27, batch 1550, loss[loss=0.1882, simple_loss=0.2406, pruned_loss=0.06786, over 13232.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2494, pruned_loss=0.06518, over 2584527.04 frames. ], batch size: 103, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:04:27,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=485092.6666666667, ans=0.0 2024-06-22 00:04:32,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=485111.0, ans=0.0 2024-06-22 00:04:32,234 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.94 vs. limit=6.0 2024-06-22 00:04:47,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=485147.6666666667, ans=0.0 2024-06-22 00:04:55,288 INFO [train.py:1028] (1/2) Epoch 27, batch 1600, loss[loss=0.1767, simple_loss=0.2389, pruned_loss=0.05724, over 13141.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.249, pruned_loss=0.06479, over 2579665.34 frames. ], batch size: 77, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:05:00,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=485184.3333333333, ans=0.125 2024-06-22 00:05:04,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=485202.6666666667, ans=0.0 2024-06-22 00:05:07,116 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.41 vs. limit=12.0 2024-06-22 00:05:09,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=485221.0, ans=0.0 2024-06-22 00:05:20,424 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.330e+02 2.446e+02 2.662e+02 3.887e+02, threshold=4.892e+02, percent-clipped=0.0 2024-06-22 00:05:27,712 INFO [train.py:1028] (1/2) Epoch 27, batch 1650, loss[loss=0.2047, simple_loss=0.2487, pruned_loss=0.08033, over 13151.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2492, pruned_loss=0.06521, over 2576927.19 frames. ], batch size: 95, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:05:29,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=485276.0, ans=0.025 2024-06-22 00:05:33,190 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.39 vs. limit=12.0 2024-06-22 00:05:39,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=485294.3333333333, ans=0.0 2024-06-22 00:05:41,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=485312.6666666667, ans=0.0 2024-06-22 00:05:42,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=485312.6666666667, ans=0.125 2024-06-22 00:05:45,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=485312.6666666667, ans=0.125 2024-06-22 00:05:54,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=485331.0, ans=0.125 2024-06-22 00:05:54,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=485331.0, ans=0.125 2024-06-22 00:06:02,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=485367.6666666667, ans=0.0 2024-06-22 00:06:02,623 INFO [train.py:1028] (1/2) Epoch 27, batch 1700, loss[loss=0.1822, simple_loss=0.2465, pruned_loss=0.05893, over 12659.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2491, pruned_loss=0.06493, over 2581533.89 frames. ], batch size: 26, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:06:31,554 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.228e+02 2.384e+02 2.554e+02 3.363e+02, threshold=4.768e+02, percent-clipped=0.0 2024-06-22 00:06:34,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=485441.0, ans=0.125 2024-06-22 00:06:38,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=485459.3333333333, ans=0.0 2024-06-22 00:06:38,668 INFO [train.py:1028] (1/2) Epoch 27, batch 1750, loss[loss=0.2013, simple_loss=0.2649, pruned_loss=0.06887, over 12661.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2493, pruned_loss=0.06488, over 2582488.29 frames. ], batch size: 22, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:07:04,106 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2024-06-22 00:07:06,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=485532.6666666667, ans=0.0 2024-06-22 00:07:08,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=485532.6666666667, ans=0.125 2024-06-22 00:07:11,728 INFO [train.py:1028] (1/2) Epoch 27, batch 1800, loss[loss=0.1826, simple_loss=0.243, pruned_loss=0.06107, over 13265.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2498, pruned_loss=0.06496, over 2583057.36 frames. ], batch size: 67, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:07:13,247 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485551.0, ans=0.1 2024-06-22 00:07:26,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=485587.6666666667, ans=0.125 2024-06-22 00:07:37,077 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.051e+02 2.269e+02 2.412e+02 2.546e+02 3.108e+02, threshold=4.824e+02, percent-clipped=0.0 2024-06-22 00:07:44,216 INFO [train.py:1028] (1/2) Epoch 27, batch 1850, loss[loss=0.1797, simple_loss=0.2312, pruned_loss=0.06415, over 13208.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2495, pruned_loss=0.06481, over 2583524.17 frames. ], batch size: 83, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:07:46,517 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.53 vs. limit=15.0 2024-06-22 00:07:46,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.02 vs. limit=15.0 2024-06-22 00:07:50,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=485642.6666666667, ans=0.125 2024-06-22 00:07:52,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=485642.6666666667, ans=0.2 2024-06-22 00:07:58,791 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:08:06,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.99 vs. limit=15.0 2024-06-22 00:08:08,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=485697.6666666667, ans=0.0 2024-06-22 00:08:22,436 INFO [train.py:1028] (1/2) Epoch 27, batch 1900, loss[loss=0.1879, simple_loss=0.2382, pruned_loss=0.06875, over 13118.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.249, pruned_loss=0.06488, over 2586379.47 frames. ], batch size: 95, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:08:30,722 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=485752.6666666667, ans=0.0 2024-06-22 00:08:35,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=485771.0, ans=0.0 2024-06-22 00:08:47,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=485789.3333333333, ans=0.125 2024-06-22 00:08:48,185 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.304e+02 2.481e+02 2.722e+02 3.607e+02, threshold=4.961e+02, percent-clipped=0.0 2024-06-22 00:08:55,059 INFO [train.py:1028] (1/2) Epoch 27, batch 1950, loss[loss=0.1699, simple_loss=0.2324, pruned_loss=0.05373, over 13244.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2485, pruned_loss=0.06499, over 2592483.25 frames. ], batch size: 52, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:08:58,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=485826.0, ans=0.125 2024-06-22 00:09:11,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=485862.6666666667, ans=0.125 2024-06-22 00:09:17,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=485881.0, ans=0.125 2024-06-22 00:09:21,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=485899.3333333333, ans=0.125 2024-06-22 00:09:26,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=485899.3333333333, ans=0.05 2024-06-22 00:09:28,039 INFO [train.py:1028] (1/2) Epoch 27, batch 2000, loss[loss=0.1843, simple_loss=0.2514, pruned_loss=0.05862, over 12626.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2484, pruned_loss=0.06495, over 2588527.34 frames. ], batch size: 22, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:09:31,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=485917.6666666667, ans=0.125 2024-06-22 00:09:32,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=485917.6666666667, ans=0.125 2024-06-22 00:09:39,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=485936.0, ans=0.0 2024-06-22 00:09:44,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=485954.3333333333, ans=0.0 2024-06-22 00:09:51,497 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:09:52,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=485972.6666666667, ans=0.125 2024-06-22 00:09:53,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=485972.6666666667, ans=0.125 2024-06-22 00:09:56,674 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.331e+02 2.448e+02 2.583e+02 3.438e+02, threshold=4.896e+02, percent-clipped=0.0 2024-06-22 00:09:58,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=485991.0, ans=0.125 2024-06-22 00:10:01,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=485991.0, ans=0.95 2024-06-22 00:10:03,667 INFO [train.py:1028] (1/2) Epoch 27, batch 2050, loss[loss=0.1751, simple_loss=0.2331, pruned_loss=0.05854, over 12712.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2491, pruned_loss=0.06523, over 2583685.96 frames. ], batch size: 29, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:10:04,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=486009.3333333333, ans=0.125 2024-06-22 00:10:04,713 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.79 vs. limit=15.0 2024-06-22 00:10:05,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=486009.3333333333, ans=0.125 2024-06-22 00:10:23,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=486046.0, ans=0.2 2024-06-22 00:10:31,024 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2024-06-22 00:10:37,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486101.0, ans=0.1 2024-06-22 00:10:38,329 INFO [train.py:1028] (1/2) Epoch 27, batch 2100, loss[loss=0.1787, simple_loss=0.244, pruned_loss=0.05677, over 13240.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2495, pruned_loss=0.06485, over 2585602.17 frames. ], batch size: 59, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:10:50,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=486119.3333333333, ans=0.0 2024-06-22 00:10:57,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=486156.0, ans=0.125 2024-06-22 00:10:57,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=486156.0, ans=0.0 2024-06-22 00:11:03,783 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.303e+02 2.446e+02 2.668e+02 4.274e+02, threshold=4.892e+02, percent-clipped=0.0 2024-06-22 00:11:03,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=486174.3333333333, ans=0.125 2024-06-22 00:11:04,868 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.04 vs. limit=10.0 2024-06-22 00:11:10,982 INFO [train.py:1028] (1/2) Epoch 27, batch 2150, loss[loss=0.1742, simple_loss=0.2386, pruned_loss=0.05489, over 13273.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2496, pruned_loss=0.06483, over 2588432.35 frames. ], batch size: 52, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:11:22,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=486211.0, ans=0.125 2024-06-22 00:11:34,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=486247.6666666667, ans=0.125 2024-06-22 00:11:38,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=486266.0, ans=0.125 2024-06-22 00:11:40,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=486266.0, ans=0.125 2024-06-22 00:11:44,280 INFO [train.py:1028] (1/2) Epoch 27, batch 2200, loss[loss=0.2018, simple_loss=0.2568, pruned_loss=0.07339, over 13226.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.25, pruned_loss=0.06511, over 2588536.56 frames. ], batch size: 83, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:11:50,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=486284.3333333333, ans=0.0 2024-06-22 00:11:53,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=486302.6666666667, ans=0.1 2024-06-22 00:11:55,096 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.05 vs. limit=10.0 2024-06-22 00:12:04,300 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2024-06-22 00:12:12,373 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.238e+02 2.351e+02 2.504e+02 3.721e+02, threshold=4.703e+02, percent-clipped=0.0 2024-06-22 00:12:24,370 INFO [train.py:1028] (1/2) Epoch 27, batch 2250, loss[loss=0.2008, simple_loss=0.2591, pruned_loss=0.07123, over 13260.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2498, pruned_loss=0.06506, over 2587762.47 frames. ], batch size: 63, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:12:31,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=486394.3333333333, ans=0.125 2024-06-22 00:12:33,001 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2024-06-22 00:12:38,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=486412.6666666667, ans=0.2 2024-06-22 00:12:40,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=486412.6666666667, ans=0.0 2024-06-22 00:12:46,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486431.0, ans=0.1 2024-06-22 00:12:52,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=486449.3333333333, ans=0.125 2024-06-22 00:12:54,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=486449.3333333333, ans=0.1 2024-06-22 00:12:56,173 INFO [train.py:1028] (1/2) Epoch 27, batch 2300, loss[loss=0.1725, simple_loss=0.2318, pruned_loss=0.05657, over 12879.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2496, pruned_loss=0.06487, over 2582181.66 frames. ], batch size: 33, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:13:00,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=486467.6666666667, ans=0.125 2024-06-22 00:13:03,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=486486.0, ans=0.0 2024-06-22 00:13:03,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=486486.0, ans=0.125 2024-06-22 00:13:22,679 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.240e+02 2.403e+02 2.568e+02 3.306e+02, threshold=4.805e+02, percent-clipped=0.0 2024-06-22 00:13:27,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=486541.0, ans=0.2 2024-06-22 00:13:28,934 INFO [train.py:1028] (1/2) Epoch 27, batch 2350, loss[loss=0.1844, simple_loss=0.2443, pruned_loss=0.06225, over 13242.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.2506, pruned_loss=0.06531, over 2585699.39 frames. ], batch size: 67, lr: 2.15e-03, grad_scale: 16.0 2024-06-22 00:13:38,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=486577.6666666667, ans=0.125 2024-06-22 00:13:40,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=486577.6666666667, ans=0.125 2024-06-22 00:13:52,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=486614.3333333333, ans=0.0 2024-06-22 00:14:05,525 INFO [train.py:1028] (1/2) Epoch 27, batch 2400, loss[loss=0.2044, simple_loss=0.2656, pruned_loss=0.07162, over 13298.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2496, pruned_loss=0.0651, over 2588641.93 frames. ], batch size: 46, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:14:09,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=486651.0, ans=0.0 2024-06-22 00:14:20,309 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2024-06-22 00:14:25,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=486687.6666666667, ans=0.0 2024-06-22 00:14:28,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486706.0, ans=0.1 2024-06-22 00:14:29,259 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.13 vs. limit=15.0 2024-06-22 00:14:33,804 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.240e+02 2.342e+02 2.480e+02 2.978e+02, threshold=4.685e+02, percent-clipped=0.0 2024-06-22 00:14:34,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=486724.3333333333, ans=0.1 2024-06-22 00:14:39,988 INFO [train.py:1028] (1/2) Epoch 27, batch 2450, loss[loss=0.1745, simple_loss=0.2334, pruned_loss=0.0578, over 13305.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2488, pruned_loss=0.06514, over 2584111.10 frames. ], batch size: 63, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:14:54,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=486779.3333333333, ans=0.125 2024-06-22 00:15:04,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=486797.6666666667, ans=0.025 2024-06-22 00:15:10,487 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.24 vs. limit=15.0 2024-06-22 00:15:12,572 INFO [train.py:1028] (1/2) Epoch 27, batch 2500, loss[loss=0.1853, simple_loss=0.2407, pruned_loss=0.06494, over 13230.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2481, pruned_loss=0.06499, over 2587977.98 frames. ], batch size: 83, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:15:17,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=486834.3333333333, ans=0.125 2024-06-22 00:15:18,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=486852.6666666667, ans=0.2 2024-06-22 00:15:19,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=486852.6666666667, ans=0.125 2024-06-22 00:15:41,455 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.221e+02 2.326e+02 2.555e+02 3.079e+02, threshold=4.652e+02, percent-clipped=0.0 2024-06-22 00:15:47,938 INFO [train.py:1028] (1/2) Epoch 27, batch 2550, loss[loss=0.1835, simple_loss=0.2497, pruned_loss=0.05868, over 12405.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2475, pruned_loss=0.06479, over 2587543.18 frames. ], batch size: 22, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:15:55,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=486944.3333333333, ans=0.125 2024-06-22 00:15:55,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=486944.3333333333, ans=0.125 2024-06-22 00:16:00,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486962.6666666667, ans=0.1 2024-06-22 00:16:03,493 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.79 vs. limit=15.0 2024-06-22 00:16:08,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=486981.0, ans=0.1 2024-06-22 00:16:23,385 INFO [train.py:1028] (1/2) Epoch 27, batch 2600, loss[loss=0.1724, simple_loss=0.2404, pruned_loss=0.0522, over 13241.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.246, pruned_loss=0.06436, over 2588005.31 frames. ], batch size: 52, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:16:29,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=487036.0, ans=0.2 2024-06-22 00:16:30,176 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.31 vs. limit=15.0 2024-06-22 00:16:31,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=487036.0, ans=0.0 2024-06-22 00:16:33,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=487036.0, ans=0.125 2024-06-22 00:16:37,254 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.59 vs. limit=22.5 2024-06-22 00:16:38,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=487054.3333333333, ans=12.0 2024-06-22 00:16:41,696 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:16:46,859 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=487072.6666666667, ans=0.2 2024-06-22 00:16:49,386 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.217e+02 2.364e+02 2.495e+02 3.597e+02, threshold=4.728e+02, percent-clipped=0.0 2024-06-22 00:16:56,094 INFO [train.py:1028] (1/2) Epoch 27, batch 2650, loss[loss=0.1919, simple_loss=0.2446, pruned_loss=0.06963, over 13008.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2451, pruned_loss=0.06422, over 2587691.61 frames. ], batch size: 144, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:17:14,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=487146.0, ans=0.0 2024-06-22 00:17:19,750 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.03 vs. limit=15.0 2024-06-22 00:17:27,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=487182.6666666667, ans=0.125 2024-06-22 00:17:31,285 INFO [train.py:1028] (1/2) Epoch 27, batch 2700, loss[loss=0.178, simple_loss=0.2347, pruned_loss=0.06062, over 13186.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2438, pruned_loss=0.06397, over 2586182.38 frames. ], batch size: 89, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:17:35,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=487201.0, ans=0.0 2024-06-22 00:17:48,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=487237.6666666667, ans=0.5 2024-06-22 00:17:51,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=487256.0, ans=0.125 2024-06-22 00:17:55,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=487256.0, ans=0.2 2024-06-22 00:17:57,025 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.313e+02 2.445e+02 2.618e+02 3.487e+02, threshold=4.890e+02, percent-clipped=0.0 2024-06-22 00:17:58,893 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.97 vs. limit=10.0 2024-06-22 00:18:07,194 INFO [train.py:1028] (1/2) Epoch 27, batch 2750, loss[loss=0.2026, simple_loss=0.2615, pruned_loss=0.07184, over 13276.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2425, pruned_loss=0.06319, over 2582953.42 frames. ], batch size: 43, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:18:09,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=487292.6666666667, ans=0.0 2024-06-22 00:18:18,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=487311.0, ans=0.1 2024-06-22 00:18:21,746 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:18:25,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=487329.3333333333, ans=0.07 2024-06-22 00:18:35,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=487366.0, ans=0.0 2024-06-22 00:18:40,506 INFO [train.py:1028] (1/2) Epoch 27, batch 2800, loss[loss=0.1841, simple_loss=0.2336, pruned_loss=0.06731, over 10803.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2417, pruned_loss=0.06299, over 2580069.46 frames. ], batch size: 304, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:18:41,520 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2024-06-22 00:18:43,863 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:18:44,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=487384.3333333333, ans=0.0 2024-06-22 00:18:46,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=487402.6666666667, ans=0.125 2024-06-22 00:18:50,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=487402.6666666667, ans=0.125 2024-06-22 00:18:53,154 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.13 vs. limit=15.0 2024-06-22 00:19:00,786 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.17 vs. limit=15.0 2024-06-22 00:19:05,980 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.299e+02 2.434e+02 2.749e+02 3.524e+02, threshold=4.867e+02, percent-clipped=0.0 2024-06-22 00:19:08,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=487457.6666666667, ans=0.125 2024-06-22 00:19:12,255 INFO [train.py:1028] (1/2) Epoch 27, batch 2850, loss[loss=0.1726, simple_loss=0.2279, pruned_loss=0.05859, over 13006.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2413, pruned_loss=0.06317, over 2578671.48 frames. ], batch size: 48, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:19:19,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=487476.0, ans=0.125 2024-06-22 00:19:30,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=487512.6666666667, ans=0.0 2024-06-22 00:19:30,826 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.61 vs. limit=22.5 2024-06-22 00:19:42,674 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:19:44,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=487549.3333333333, ans=0.1 2024-06-22 00:19:46,727 INFO [train.py:1028] (1/2) Epoch 27, batch 2900, loss[loss=0.178, simple_loss=0.2384, pruned_loss=0.05883, over 13177.00 frames. ], tot_loss[loss=0.1826, simple_loss=0.2399, pruned_loss=0.06264, over 2587113.43 frames. ], batch size: 55, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:19:58,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=487586.0, ans=0.025 2024-06-22 00:20:00,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=487586.0, ans=0.1 2024-06-22 00:20:10,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=487622.6666666667, ans=0.1 2024-06-22 00:20:15,584 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.234e+02 2.347e+02 2.565e+02 3.663e+02, threshold=4.694e+02, percent-clipped=0.0 2024-06-22 00:20:16,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=487641.0, ans=0.0 2024-06-22 00:20:18,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=487641.0, ans=0.2 2024-06-22 00:20:18,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=487641.0, ans=0.125 2024-06-22 00:20:22,203 INFO [train.py:1028] (1/2) Epoch 27, batch 2950, loss[loss=0.1658, simple_loss=0.2237, pruned_loss=0.05401, over 13258.00 frames. ], tot_loss[loss=0.182, simple_loss=0.2393, pruned_loss=0.06232, over 2581893.48 frames. ], batch size: 43, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:20:25,446 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.85 vs. limit=15.0 2024-06-22 00:20:35,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=487696.0, ans=0.125 2024-06-22 00:20:38,641 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.57 vs. limit=6.0 2024-06-22 00:20:44,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=487714.3333333333, ans=0.0 2024-06-22 00:20:55,992 INFO [train.py:1028] (1/2) Epoch 27, batch 3000, loss[loss=0.1921, simple_loss=0.2521, pruned_loss=0.066, over 13253.00 frames. ], tot_loss[loss=0.1814, simple_loss=0.2386, pruned_loss=0.06205, over 2579602.69 frames. ], batch size: 59, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:20:55,993 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 00:21:03,890 INFO [train.py:1060] (1/2) Epoch 27, validation: loss=0.1903, simple_loss=0.2509, pruned_loss=0.06482, over 351949.00 frames. 2024-06-22 00:21:03,891 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 00:21:05,011 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2024-06-22 00:21:34,039 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.298e+02 2.509e+02 2.774e+02 3.609e+02, threshold=5.018e+02, percent-clipped=0.0 2024-06-22 00:21:36,981 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2024-06-22 00:21:39,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=487824.3333333333, ans=0.125 2024-06-22 00:21:40,638 INFO [train.py:1028] (1/2) Epoch 27, batch 3050, loss[loss=0.1746, simple_loss=0.2314, pruned_loss=0.05884, over 13323.00 frames. ], tot_loss[loss=0.1818, simple_loss=0.2386, pruned_loss=0.06252, over 2579628.70 frames. ], batch size: 46, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:21:47,366 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.86 vs. limit=15.0 2024-06-22 00:21:51,584 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2024-06-22 00:21:55,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=487861.0, ans=0.2 2024-06-22 00:21:58,264 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.48 vs. limit=15.0 2024-06-22 00:22:05,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=487897.6666666667, ans=0.025 2024-06-22 00:22:14,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.47 vs. limit=15.0 2024-06-22 00:22:16,121 INFO [train.py:1028] (1/2) Epoch 27, batch 3100, loss[loss=0.1783, simple_loss=0.2271, pruned_loss=0.06478, over 13019.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2374, pruned_loss=0.0618, over 2580265.48 frames. ], batch size: 144, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:22:17,208 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.60 vs. limit=12.0 2024-06-22 00:22:22,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=487952.6666666667, ans=0.2 2024-06-22 00:22:23,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=487952.6666666667, ans=0.0 2024-06-22 00:22:26,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=487952.6666666667, ans=0.1 2024-06-22 00:22:41,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=487989.3333333333, ans=0.125 2024-06-22 00:22:42,651 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.167e+02 2.307e+02 2.523e+02 3.204e+02, threshold=4.615e+02, percent-clipped=0.0 2024-06-22 00:22:44,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=488007.6666666667, ans=0.1 2024-06-22 00:22:45,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=488007.6666666667, ans=0.0 2024-06-22 00:22:49,582 INFO [train.py:1028] (1/2) Epoch 27, batch 3150, loss[loss=0.1756, simple_loss=0.2266, pruned_loss=0.0623, over 12975.00 frames. ], tot_loss[loss=0.1798, simple_loss=0.2367, pruned_loss=0.06149, over 2582205.32 frames. ], batch size: 158, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:22:50,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488026.0, ans=0.1 2024-06-22 00:22:53,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=488026.0, ans=0.125 2024-06-22 00:23:01,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=488044.3333333333, ans=0.0 2024-06-22 00:23:07,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=488062.6666666667, ans=0.125 2024-06-22 00:23:07,917 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:23:12,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=488081.0, ans=0.125 2024-06-22 00:23:15,271 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.64 vs. limit=15.0 2024-06-22 00:23:21,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=488099.3333333333, ans=0.125 2024-06-22 00:23:25,951 INFO [train.py:1028] (1/2) Epoch 27, batch 3200, loss[loss=0.1605, simple_loss=0.2198, pruned_loss=0.05056, over 13173.00 frames. ], tot_loss[loss=0.1796, simple_loss=0.2364, pruned_loss=0.06133, over 2583927.07 frames. ], batch size: 55, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:23:29,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=488117.6666666667, ans=0.025 2024-06-22 00:23:30,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=488117.6666666667, ans=0.2 2024-06-22 00:23:37,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488136.0, ans=0.1 2024-06-22 00:23:39,123 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.82 vs. limit=6.0 2024-06-22 00:23:42,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488154.3333333333, ans=0.1 2024-06-22 00:23:45,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=488172.6666666667, ans=0.025 2024-06-22 00:23:46,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=488172.6666666667, ans=0.125 2024-06-22 00:23:51,828 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.269e+02 2.369e+02 2.509e+02 2.933e+02, threshold=4.738e+02, percent-clipped=0.0 2024-06-22 00:23:51,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488191.0, ans=0.1 2024-06-22 00:23:59,765 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.78 vs. limit=12.0 2024-06-22 00:24:01,413 INFO [train.py:1028] (1/2) Epoch 27, batch 3250, loss[loss=0.1579, simple_loss=0.2206, pruned_loss=0.04756, over 13231.00 frames. ], tot_loss[loss=0.1791, simple_loss=0.2356, pruned_loss=0.06128, over 2588353.29 frames. ], batch size: 72, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:24:03,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=488209.3333333333, ans=0.05 2024-06-22 00:24:11,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=488227.6666666667, ans=0.125 2024-06-22 00:24:20,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=488246.0, ans=0.025 2024-06-22 00:24:34,894 INFO [train.py:1028] (1/2) Epoch 27, batch 3300, loss[loss=0.1757, simple_loss=0.2277, pruned_loss=0.06184, over 12733.00 frames. ], tot_loss[loss=0.1781, simple_loss=0.2347, pruned_loss=0.06073, over 2584360.32 frames. ], batch size: 176, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:24:44,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488319.3333333333, ans=0.1 2024-06-22 00:25:00,399 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.224e+02 2.386e+02 2.522e+02 3.680e+02, threshold=4.772e+02, percent-clipped=0.0 2024-06-22 00:25:11,742 INFO [train.py:1028] (1/2) Epoch 27, batch 3350, loss[loss=0.1884, simple_loss=0.2392, pruned_loss=0.06881, over 12987.00 frames. ], tot_loss[loss=0.1781, simple_loss=0.2345, pruned_loss=0.06087, over 2579511.28 frames. ], batch size: 158, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:25:16,762 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.54 vs. limit=15.0 2024-06-22 00:25:18,481 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=488411.0, ans=0.2 2024-06-22 00:25:22,243 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=15.0 2024-06-22 00:25:28,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=488429.3333333333, ans=0.125 2024-06-22 00:25:32,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=488447.6666666667, ans=0.0 2024-06-22 00:25:34,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488447.6666666667, ans=0.1 2024-06-22 00:25:48,467 INFO [train.py:1028] (1/2) Epoch 27, batch 3400, loss[loss=0.1712, simple_loss=0.2345, pruned_loss=0.05392, over 12643.00 frames. ], tot_loss[loss=0.1781, simple_loss=0.2341, pruned_loss=0.06109, over 2576508.80 frames. ], batch size: 22, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:25:58,700 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.27 vs. limit=15.0 2024-06-22 00:26:07,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=488539.3333333333, ans=0.125 2024-06-22 00:26:14,900 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.261e+02 2.571e+02 2.867e+02 3.936e+02, threshold=5.142e+02, percent-clipped=0.0 2024-06-22 00:26:15,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=488557.6666666667, ans=0.125 2024-06-22 00:26:20,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=488557.6666666667, ans=0.125 2024-06-22 00:26:21,673 INFO [train.py:1028] (1/2) Epoch 27, batch 3450, loss[loss=0.1905, simple_loss=0.2381, pruned_loss=0.07144, over 12828.00 frames. ], tot_loss[loss=0.1775, simple_loss=0.2335, pruned_loss=0.06073, over 2576883.95 frames. ], batch size: 176, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:26:23,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=488576.0, ans=0.125 2024-06-22 00:26:24,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=488576.0, ans=0.0 2024-06-22 00:26:29,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=488594.3333333333, ans=0.0 2024-06-22 00:26:53,946 INFO [train.py:1028] (1/2) Epoch 27, batch 3500, loss[loss=0.1511, simple_loss=0.2093, pruned_loss=0.04651, over 12940.00 frames. ], tot_loss[loss=0.1767, simple_loss=0.2329, pruned_loss=0.06025, over 2577156.45 frames. ], batch size: 33, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:27:00,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=488686.0, ans=0.025 2024-06-22 00:27:15,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=488704.3333333333, ans=0.09899494936611666 2024-06-22 00:27:15,309 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2024-06-22 00:27:15,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=488704.3333333333, ans=0.09899494936611666 2024-06-22 00:27:17,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=488722.6666666667, ans=0.0 2024-06-22 00:27:19,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=488722.6666666667, ans=0.2 2024-06-22 00:27:23,146 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.171e+02 2.339e+02 2.506e+02 3.144e+02, threshold=4.678e+02, percent-clipped=0.0 2024-06-22 00:27:29,659 INFO [train.py:1028] (1/2) Epoch 27, batch 3550, loss[loss=0.1779, simple_loss=0.2257, pruned_loss=0.06504, over 13208.00 frames. ], tot_loss[loss=0.1761, simple_loss=0.2322, pruned_loss=0.05998, over 2577865.53 frames. ], batch size: 95, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:27:31,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=488759.3333333333, ans=0.1 2024-06-22 00:27:33,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=488759.3333333333, ans=0.125 2024-06-22 00:27:34,749 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.89 vs. limit=10.0 2024-06-22 00:27:35,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=488759.3333333333, ans=15.0 2024-06-22 00:27:35,395 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.68 vs. limit=22.5 2024-06-22 00:27:38,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=488777.6666666667, ans=0.125 2024-06-22 00:27:38,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=488777.6666666667, ans=0.0 2024-06-22 00:27:40,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=488777.6666666667, ans=0.125 2024-06-22 00:27:50,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=488796.0, ans=0.125 2024-06-22 00:27:52,469 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.79 vs. limit=10.0 2024-06-22 00:28:03,663 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=488832.6666666667, ans=0.125 2024-06-22 00:28:06,141 INFO [train.py:1028] (1/2) Epoch 27, batch 3600, loss[loss=0.1606, simple_loss=0.2171, pruned_loss=0.05203, over 13327.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.2316, pruned_loss=0.05972, over 2581595.35 frames. ], batch size: 49, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:28:10,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=488851.0, ans=0.0 2024-06-22 00:28:11,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=488851.0, ans=0.125 2024-06-22 00:28:24,018 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=488887.6666666667, ans=0.125 2024-06-22 00:28:32,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=488924.3333333333, ans=0.0 2024-06-22 00:28:32,922 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.161e+02 2.277e+02 2.417e+02 3.501e+02, threshold=4.554e+02, percent-clipped=0.0 2024-06-22 00:28:35,975 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.52 vs. limit=15.0 2024-06-22 00:28:39,524 INFO [train.py:1028] (1/2) Epoch 27, batch 3650, loss[loss=0.1709, simple_loss=0.2269, pruned_loss=0.05743, over 13038.00 frames. ], tot_loss[loss=0.1749, simple_loss=0.2311, pruned_loss=0.05937, over 2579072.20 frames. ], batch size: 102, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:28:41,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=488942.6666666667, ans=0.02 2024-06-22 00:28:45,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=488961.0, ans=0.0 2024-06-22 00:28:48,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=488961.0, ans=0.0 2024-06-22 00:28:50,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488961.0, ans=0.1 2024-06-22 00:28:50,586 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:28:57,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=488979.3333333333, ans=22.5 2024-06-22 00:29:04,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=489016.0, ans=0.035 2024-06-22 00:29:06,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=489016.0, ans=0.2 2024-06-22 00:29:14,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=489016.0, ans=0.125 2024-06-22 00:29:15,414 INFO [train.py:1028] (1/2) Epoch 27, batch 3700, loss[loss=0.1679, simple_loss=0.2326, pruned_loss=0.05162, over 13230.00 frames. ], tot_loss[loss=0.1743, simple_loss=0.2305, pruned_loss=0.05903, over 2584842.34 frames. ], batch size: 72, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:29:16,115 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:29:19,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=489034.3333333333, ans=0.0 2024-06-22 00:29:21,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=489052.6666666667, ans=0.0 2024-06-22 00:29:24,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=489052.6666666667, ans=0.2 2024-06-22 00:29:25,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=489052.6666666667, ans=0.1 2024-06-22 00:29:27,412 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=489052.6666666667, ans=0.125 2024-06-22 00:29:28,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=489071.0, ans=0.2 2024-06-22 00:29:41,349 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.140e+02 2.266e+02 2.443e+02 3.403e+02, threshold=4.532e+02, percent-clipped=0.0 2024-06-22 00:29:46,403 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.29 vs. limit=10.0 2024-06-22 00:29:48,072 INFO [train.py:1028] (1/2) Epoch 27, batch 3750, loss[loss=0.1587, simple_loss=0.2145, pruned_loss=0.05147, over 12750.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.2301, pruned_loss=0.05903, over 2586486.96 frames. ], batch size: 22, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:29:48,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=489126.0, ans=0.125 2024-06-22 00:29:54,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=489126.0, ans=0.125 2024-06-22 00:30:00,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=489144.3333333333, ans=0.125 2024-06-22 00:30:16,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=489181.0, ans=0.07 2024-06-22 00:30:16,786 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:30:20,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489199.3333333333, ans=0.1 2024-06-22 00:30:25,345 INFO [train.py:1028] (1/2) Epoch 27, batch 3800, loss[loss=0.1757, simple_loss=0.2292, pruned_loss=0.0611, over 13244.00 frames. ], tot_loss[loss=0.1739, simple_loss=0.23, pruned_loss=0.0589, over 2585063.88 frames. ], batch size: 83, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:30:27,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=489217.6666666667, ans=0.125 2024-06-22 00:30:30,045 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=489217.6666666667, ans=0.5 2024-06-22 00:30:37,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=489236.0, ans=0.125 2024-06-22 00:30:40,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=489254.3333333333, ans=0.2 2024-06-22 00:30:41,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=489254.3333333333, ans=0.025 2024-06-22 00:30:46,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=489272.6666666667, ans=0.025 2024-06-22 00:30:52,178 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.160e+02 2.283e+02 2.492e+02 3.279e+02, threshold=4.566e+02, percent-clipped=0.0 2024-06-22 00:30:58,974 INFO [train.py:1028] (1/2) Epoch 27, batch 3850, loss[loss=0.1574, simple_loss=0.2124, pruned_loss=0.05123, over 13046.00 frames. ], tot_loss[loss=0.1737, simple_loss=0.23, pruned_loss=0.05876, over 2583235.38 frames. ], batch size: 144, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:31:09,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.11 vs. limit=15.0 2024-06-22 00:31:15,689 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=489346.0, ans=0.125 2024-06-22 00:31:16,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=489346.0, ans=0.0 2024-06-22 00:31:21,637 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=22.5 2024-06-22 00:31:22,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=489364.3333333333, ans=0.125 2024-06-22 00:31:25,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=489364.3333333333, ans=0.125 2024-06-22 00:31:27,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=489364.3333333333, ans=0.0 2024-06-22 00:31:33,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=489382.6666666667, ans=0.1 2024-06-22 00:31:33,563 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=489382.6666666667, ans=0.125 2024-06-22 00:31:34,783 INFO [train.py:1028] (1/2) Epoch 27, batch 3900, loss[loss=0.1656, simple_loss=0.2198, pruned_loss=0.05573, over 13247.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.2301, pruned_loss=0.05905, over 2586328.37 frames. ], batch size: 83, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:31:52,448 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.91 vs. limit=22.5 2024-06-22 00:31:54,090 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=489456.0, ans=0.0 2024-06-22 00:31:59,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=489456.0, ans=0.0 2024-06-22 00:32:01,617 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.195e+02 2.372e+02 2.567e+02 3.418e+02, threshold=4.743e+02, percent-clipped=0.0 2024-06-22 00:32:03,058 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=489474.3333333333, ans=0.025 2024-06-22 00:32:10,205 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.96 vs. limit=15.0 2024-06-22 00:32:10,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=489474.3333333333, ans=0.125 2024-06-22 00:32:11,805 INFO [train.py:1028] (1/2) Epoch 27, batch 3950, loss[loss=0.1505, simple_loss=0.1977, pruned_loss=0.05171, over 13077.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2292, pruned_loss=0.05862, over 2588353.36 frames. ], batch size: 132, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:32:14,925 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:32:16,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=489492.6666666667, ans=0.125 2024-06-22 00:32:24,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=489511.0, ans=0.125 2024-06-22 00:32:27,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=489529.3333333333, ans=0.0 2024-06-22 00:32:29,260 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.24 vs. limit=15.0 2024-06-22 00:32:35,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=489547.6666666667, ans=0.04949747468305833 2024-06-22 00:32:35,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=489547.6666666667, ans=0.2 2024-06-22 00:32:45,630 INFO [train.py:1028] (1/2) Epoch 27, batch 4000, loss[loss=0.1849, simple_loss=0.243, pruned_loss=0.06337, over 12935.00 frames. ], tot_loss[loss=0.1736, simple_loss=0.2294, pruned_loss=0.05889, over 2583315.69 frames. ], batch size: 39, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:32:51,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489584.3333333333, ans=0.1 2024-06-22 00:32:52,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=489602.6666666667, ans=0.2 2024-06-22 00:32:58,922 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:32:59,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=489621.0, ans=0.09899494936611666 2024-06-22 00:33:12,092 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.199e+02 2.291e+02 2.486e+02 3.722e+02, threshold=4.582e+02, percent-clipped=0.0 2024-06-22 00:33:21,792 INFO [train.py:1028] (1/2) Epoch 27, batch 4050, loss[loss=0.1651, simple_loss=0.2096, pruned_loss=0.06027, over 10855.00 frames. ], tot_loss[loss=0.174, simple_loss=0.2296, pruned_loss=0.0592, over 2581606.66 frames. ], batch size: 304, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:33:23,007 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.36 vs. limit=12.0 2024-06-22 00:33:24,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=489676.0, ans=0.0 2024-06-22 00:33:33,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=489694.3333333333, ans=0.0 2024-06-22 00:33:45,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=489731.0, ans=0.025 2024-06-22 00:33:55,234 INFO [train.py:1028] (1/2) Epoch 27, batch 4100, loss[loss=0.1687, simple_loss=0.2153, pruned_loss=0.06099, over 13031.00 frames. ], tot_loss[loss=0.1746, simple_loss=0.23, pruned_loss=0.05954, over 2577526.65 frames. ], batch size: 102, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:33:55,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=489767.6666666667, ans=0.2 2024-06-22 00:33:57,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=489767.6666666667, ans=0.2 2024-06-22 00:34:09,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489786.0, ans=0.1 2024-06-22 00:34:20,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=489822.6666666667, ans=0.125 2024-06-22 00:34:24,951 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:34:25,441 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.250e+02 2.436e+02 2.619e+02 3.709e+02, threshold=4.872e+02, percent-clipped=0.0 2024-06-22 00:34:26,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=489841.0, ans=0.125 2024-06-22 00:34:29,877 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:34:32,515 INFO [train.py:1028] (1/2) Epoch 27, batch 4150, loss[loss=0.1789, simple_loss=0.2334, pruned_loss=0.06221, over 13084.00 frames. ], tot_loss[loss=0.1743, simple_loss=0.2298, pruned_loss=0.05937, over 2575068.04 frames. ], batch size: 55, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:35:04,998 INFO [train.py:1028] (1/2) Epoch 27, batch 4200, loss[loss=0.1721, simple_loss=0.2169, pruned_loss=0.06366, over 13148.00 frames. ], tot_loss[loss=0.1735, simple_loss=0.2289, pruned_loss=0.05899, over 2578042.54 frames. ], batch size: 103, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:35:07,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=489951.0, ans=0.125 2024-06-22 00:35:10,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=489951.0, ans=0.125 2024-06-22 00:35:16,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=489969.3333333333, ans=0.125 2024-06-22 00:35:22,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=489987.6666666667, ans=0.1 2024-06-22 00:35:28,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=490006.0, ans=0.125 2024-06-22 00:35:34,532 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.151e+02 2.269e+02 2.470e+02 3.623e+02, threshold=4.538e+02, percent-clipped=0.0 2024-06-22 00:35:41,288 INFO [train.py:1028] (1/2) Epoch 27, batch 4250, loss[loss=0.1447, simple_loss=0.2074, pruned_loss=0.04104, over 13258.00 frames. ], tot_loss[loss=0.1728, simple_loss=0.2285, pruned_loss=0.05861, over 2581106.69 frames. ], batch size: 46, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:35:55,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=490079.3333333333, ans=0.05 2024-06-22 00:35:55,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=490079.3333333333, ans=0.125 2024-06-22 00:35:57,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=490079.3333333333, ans=0.125 2024-06-22 00:36:05,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=490097.6666666667, ans=0.04949747468305833 2024-06-22 00:36:09,919 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.95 vs. limit=12.0 2024-06-22 00:36:15,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=490116.0, ans=0.0 2024-06-22 00:36:16,796 INFO [train.py:1028] (1/2) Epoch 27, batch 4300, loss[loss=0.1604, simple_loss=0.2123, pruned_loss=0.05426, over 13226.00 frames. ], tot_loss[loss=0.1722, simple_loss=0.2279, pruned_loss=0.05823, over 2580113.85 frames. ], batch size: 59, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:36:22,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=490152.6666666667, ans=0.0 2024-06-22 00:36:23,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=490152.6666666667, ans=0.125 2024-06-22 00:36:24,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=490152.6666666667, ans=0.0 2024-06-22 00:36:30,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=490171.0, ans=0.0 2024-06-22 00:36:34,321 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:36:42,304 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 2.164e+02 2.277e+02 2.472e+02 3.951e+02, threshold=4.554e+02, percent-clipped=0.0 2024-06-22 00:36:46,018 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.60 vs. limit=6.0 2024-06-22 00:36:48,720 INFO [train.py:1028] (1/2) Epoch 27, batch 4350, loss[loss=0.2035, simple_loss=0.2543, pruned_loss=0.0764, over 13240.00 frames. ], tot_loss[loss=0.1719, simple_loss=0.2273, pruned_loss=0.05826, over 2584749.29 frames. ], batch size: 59, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:36:49,484 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=490226.0, ans=0.125 2024-06-22 00:36:51,752 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2024-06-22 00:36:55,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=490244.3333333333, ans=0.2 2024-06-22 00:36:58,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=490244.3333333333, ans=0.1 2024-06-22 00:37:15,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=490299.3333333333, ans=0.0 2024-06-22 00:37:23,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=490299.3333333333, ans=0.125 2024-06-22 00:37:24,936 INFO [train.py:1028] (1/2) Epoch 27, batch 4400, loss[loss=0.1579, simple_loss=0.2157, pruned_loss=0.0501, over 13215.00 frames. ], tot_loss[loss=0.1719, simple_loss=0.2271, pruned_loss=0.05828, over 2584042.65 frames. ], batch size: 83, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:37:30,352 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.54 vs. limit=22.5 2024-06-22 00:37:51,403 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.175e+02 2.331e+02 2.533e+02 3.142e+02, threshold=4.662e+02, percent-clipped=0.0 2024-06-22 00:37:57,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=490409.3333333333, ans=0.0 2024-06-22 00:37:58,037 INFO [train.py:1028] (1/2) Epoch 27, batch 4450, loss[loss=0.1492, simple_loss=0.2085, pruned_loss=0.04495, over 12968.00 frames. ], tot_loss[loss=0.1719, simple_loss=0.227, pruned_loss=0.05842, over 2578104.02 frames. ], batch size: 33, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:38:01,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=490409.3333333333, ans=0.0 2024-06-22 00:38:21,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=490464.3333333333, ans=0.125 2024-06-22 00:38:26,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=490464.3333333333, ans=0.125 2024-06-22 00:38:35,107 INFO [train.py:1028] (1/2) Epoch 27, batch 4500, loss[loss=0.1815, simple_loss=0.2297, pruned_loss=0.06667, over 13230.00 frames. ], tot_loss[loss=0.1717, simple_loss=0.2268, pruned_loss=0.05835, over 2582881.79 frames. ], batch size: 89, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:38:45,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=490519.3333333333, ans=0.025 2024-06-22 00:38:52,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=490537.6666666667, ans=0.0 2024-06-22 00:38:53,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=490537.6666666667, ans=0.0 2024-06-22 00:38:59,481 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2024-06-22 00:39:01,741 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.179e+02 2.263e+02 2.441e+02 3.064e+02, threshold=4.527e+02, percent-clipped=0.0 2024-06-22 00:39:08,474 INFO [train.py:1028] (1/2) Epoch 27, batch 4550, loss[loss=0.1699, simple_loss=0.2304, pruned_loss=0.05467, over 13297.00 frames. ], tot_loss[loss=0.1714, simple_loss=0.2264, pruned_loss=0.05819, over 2587085.79 frames. ], batch size: 52, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:39:31,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=490647.6666666667, ans=0.125 2024-06-22 00:39:38,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=490666.0, ans=0.125 2024-06-22 00:39:41,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=490666.0, ans=0.125 2024-06-22 00:39:45,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=490684.3333333333, ans=0.0 2024-06-22 00:39:45,643 INFO [train.py:1028] (1/2) Epoch 27, batch 4600, loss[loss=0.1735, simple_loss=0.2258, pruned_loss=0.06055, over 12568.00 frames. ], tot_loss[loss=0.1714, simple_loss=0.2265, pruned_loss=0.05817, over 2583525.50 frames. ], batch size: 202, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:39:45,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=490684.3333333333, ans=0.025 2024-06-22 00:39:48,848 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.26 vs. limit=6.0 2024-06-22 00:39:49,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=490684.3333333333, ans=0.1 2024-06-22 00:39:52,032 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.37 vs. limit=15.0 2024-06-22 00:40:03,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=490721.0, ans=0.125 2024-06-22 00:40:05,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=490739.3333333333, ans=0.2 2024-06-22 00:40:05,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=490739.3333333333, ans=0.125 2024-06-22 00:40:10,026 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.94 vs. limit=10.0 2024-06-22 00:40:14,793 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 2.179e+02 2.352e+02 2.573e+02 3.426e+02, threshold=4.703e+02, percent-clipped=0.0 2024-06-22 00:40:20,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=490757.6666666667, ans=0.125 2024-06-22 00:40:21,288 INFO [train.py:1028] (1/2) Epoch 27, batch 4650, loss[loss=0.1623, simple_loss=0.2207, pruned_loss=0.05196, over 13100.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.2264, pruned_loss=0.0581, over 2586595.34 frames. ], batch size: 132, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:40:31,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=490794.3333333333, ans=0.0 2024-06-22 00:40:36,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=490812.6666666667, ans=0.2 2024-06-22 00:40:50,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=490849.3333333333, ans=0.1 2024-06-22 00:40:51,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=490849.3333333333, ans=0.125 2024-06-22 00:40:54,450 INFO [train.py:1028] (1/2) Epoch 27, batch 4700, loss[loss=0.1588, simple_loss=0.218, pruned_loss=0.04976, over 12679.00 frames. ], tot_loss[loss=0.1715, simple_loss=0.2267, pruned_loss=0.05811, over 2583482.06 frames. ], batch size: 26, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:40:55,713 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=12.0 2024-06-22 00:40:58,878 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.28 vs. limit=15.0 2024-06-22 00:41:00,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=490867.6666666667, ans=0.0 2024-06-22 00:41:14,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=490922.6666666667, ans=0.0 2024-06-22 00:41:23,340 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.50 vs. limit=12.0 2024-06-22 00:41:23,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=490922.6666666667, ans=0.125 2024-06-22 00:41:26,419 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.130e+02 2.290e+02 2.558e+02 3.458e+02, threshold=4.581e+02, percent-clipped=0.0 2024-06-22 00:41:32,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=490959.3333333333, ans=0.0 2024-06-22 00:41:33,096 INFO [train.py:1028] (1/2) Epoch 27, batch 4750, loss[loss=0.1905, simple_loss=0.2352, pruned_loss=0.07293, over 12505.00 frames. ], tot_loss[loss=0.1715, simple_loss=0.2263, pruned_loss=0.05835, over 2580593.46 frames. ], batch size: 202, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:41:36,815 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=490959.3333333333, ans=0.125 2024-06-22 00:41:40,321 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.25 vs. limit=15.0 2024-06-22 00:41:42,200 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.30 vs. limit=12.0 2024-06-22 00:41:44,320 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.01 vs. limit=15.0 2024-06-22 00:41:46,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=490996.0, ans=0.125 2024-06-22 00:41:55,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=491014.3333333333, ans=0.2 2024-06-22 00:41:59,551 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.39 vs. limit=15.0 2024-06-22 00:42:06,835 INFO [train.py:1028] (1/2) Epoch 27, batch 4800, loss[loss=0.1777, simple_loss=0.2396, pruned_loss=0.05786, over 13242.00 frames. ], tot_loss[loss=0.1716, simple_loss=0.2263, pruned_loss=0.05842, over 2576599.96 frames. ], batch size: 63, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:42:22,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=491069.3333333333, ans=0.125 2024-06-22 00:42:37,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=491124.3333333333, ans=0.0 2024-06-22 00:42:38,264 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.189e+02 2.350e+02 2.488e+02 3.716e+02, threshold=4.699e+02, percent-clipped=0.0 2024-06-22 00:42:38,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=491124.3333333333, ans=0.125 2024-06-22 00:42:44,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=491142.6666666667, ans=0.5 2024-06-22 00:42:44,858 INFO [train.py:1028] (1/2) Epoch 27, batch 4850, loss[loss=0.1706, simple_loss=0.2246, pruned_loss=0.05832, over 13204.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2258, pruned_loss=0.0582, over 2574649.51 frames. ], batch size: 89, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:42:50,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=491142.6666666667, ans=22.5 2024-06-22 00:43:11,663 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2024-06-22 00:43:15,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=491216.0, ans=0.125 2024-06-22 00:43:22,423 INFO [train.py:1028] (1/2) Epoch 27, batch 4900, loss[loss=0.1665, simple_loss=0.2247, pruned_loss=0.05414, over 13196.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2259, pruned_loss=0.05812, over 2575564.52 frames. ], batch size: 59, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:43:40,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=491271.0, ans=0.125 2024-06-22 00:43:42,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=491289.3333333333, ans=6.0 2024-06-22 00:43:44,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=491289.3333333333, ans=0.125 2024-06-22 00:43:46,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=491289.3333333333, ans=0.1 2024-06-22 00:43:48,753 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.194e+02 2.336e+02 2.481e+02 3.240e+02, threshold=4.673e+02, percent-clipped=0.0 2024-06-22 00:43:50,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=491307.6666666667, ans=0.125 2024-06-22 00:43:55,034 INFO [train.py:1028] (1/2) Epoch 27, batch 4950, loss[loss=0.1803, simple_loss=0.2194, pruned_loss=0.0706, over 11127.00 frames. ], tot_loss[loss=0.1714, simple_loss=0.226, pruned_loss=0.05843, over 2569071.85 frames. ], batch size: 304, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:44:21,524 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2024-06-22 00:44:25,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=491381.0, ans=0.2 2024-06-22 00:44:34,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=491399.3333333333, ans=0.1 2024-06-22 00:44:34,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=491399.3333333333, ans=0.125 2024-06-22 00:44:37,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=491417.6666666667, ans=0.1 2024-06-22 00:44:37,599 INFO [train.py:1028] (1/2) Epoch 27, batch 5000, loss[loss=0.1599, simple_loss=0.2131, pruned_loss=0.05336, over 13140.00 frames. ], tot_loss[loss=0.1709, simple_loss=0.2254, pruned_loss=0.0582, over 2574874.44 frames. ], batch size: 95, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:44:50,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=491436.0, ans=0.0 2024-06-22 00:44:56,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=491454.3333333333, ans=0.125 2024-06-22 00:45:02,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=491472.6666666667, ans=0.125 2024-06-22 00:45:03,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=491472.6666666667, ans=0.125 2024-06-22 00:45:03,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=491472.6666666667, ans=0.0 2024-06-22 00:45:05,558 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.136e+02 2.249e+02 2.409e+02 3.064e+02, threshold=4.499e+02, percent-clipped=0.0 2024-06-22 00:45:11,373 INFO [train.py:1028] (1/2) Epoch 27, batch 5050, loss[loss=0.1494, simple_loss=0.2087, pruned_loss=0.04504, over 12882.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2251, pruned_loss=0.05796, over 2571967.79 frames. ], batch size: 36, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:45:11,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=491509.3333333333, ans=0.125 2024-06-22 00:45:23,708 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.69 vs. limit=6.0 2024-06-22 00:45:40,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=491582.6666666667, ans=0.125 2024-06-22 00:45:40,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=491582.6666666667, ans=0.09899494936611666 2024-06-22 00:45:47,461 INFO [train.py:1028] (1/2) Epoch 27, batch 5100, loss[loss=0.175, simple_loss=0.2391, pruned_loss=0.05543, over 12980.00 frames. ], tot_loss[loss=0.1714, simple_loss=0.2257, pruned_loss=0.05859, over 2568477.73 frames. ], batch size: 39, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:45:50,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=491601.0, ans=0.125 2024-06-22 00:45:59,232 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.43 vs. limit=15.0 2024-06-22 00:46:09,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=491656.0, ans=0.125 2024-06-22 00:46:14,312 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:46:18,190 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.854e+02 2.175e+02 2.347e+02 2.592e+02 3.756e+02, threshold=4.695e+02, percent-clipped=0.0 2024-06-22 00:46:20,510 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.89 vs. limit=10.0 2024-06-22 00:46:23,869 INFO [train.py:1028] (1/2) Epoch 27, batch 5150, loss[loss=0.1695, simple_loss=0.2178, pruned_loss=0.06058, over 13145.00 frames. ], tot_loss[loss=0.1721, simple_loss=0.2262, pruned_loss=0.05898, over 2570201.24 frames. ], batch size: 132, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:46:27,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=491692.6666666667, ans=0.125 2024-06-22 00:46:30,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=491711.0, ans=0.0 2024-06-22 00:46:31,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=491711.0, ans=0.2 2024-06-22 00:46:32,000 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.99 vs. limit=22.5 2024-06-22 00:46:34,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=491711.0, ans=0.125 2024-06-22 00:46:45,302 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.51 vs. limit=22.5 2024-06-22 00:46:48,153 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.87 vs. limit=6.0 2024-06-22 00:46:50,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=491766.0, ans=0.125 2024-06-22 00:46:56,771 INFO [train.py:1028] (1/2) Epoch 27, batch 5200, loss[loss=0.1901, simple_loss=0.2407, pruned_loss=0.06976, over 13111.00 frames. ], tot_loss[loss=0.1715, simple_loss=0.2258, pruned_loss=0.05861, over 2573319.91 frames. ], batch size: 95, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:47:05,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=491802.6666666667, ans=0.125 2024-06-22 00:47:25,086 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2024-06-22 00:47:27,360 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.136e+02 2.273e+02 2.409e+02 3.380e+02, threshold=4.545e+02, percent-clipped=0.0 2024-06-22 00:47:33,130 INFO [train.py:1028] (1/2) Epoch 27, batch 5250, loss[loss=0.1669, simple_loss=0.2236, pruned_loss=0.05512, over 13304.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.2258, pruned_loss=0.05843, over 2568190.28 frames. ], batch size: 52, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:47:35,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=491876.0, ans=0.125 2024-06-22 00:47:46,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=491912.6666666667, ans=0.0 2024-06-22 00:47:58,050 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.10 vs. limit=15.0 2024-06-22 00:47:59,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=491949.3333333333, ans=0.125 2024-06-22 00:48:00,404 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:48:06,330 INFO [train.py:1028] (1/2) Epoch 27, batch 5300, loss[loss=0.1782, simple_loss=0.2221, pruned_loss=0.06713, over 13040.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.2258, pruned_loss=0.05836, over 2564854.04 frames. ], batch size: 144, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:48:06,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=491967.6666666667, ans=15.0 2024-06-22 00:48:10,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=491967.6666666667, ans=0.0 2024-06-22 00:48:36,452 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.243e+02 2.365e+02 2.589e+02 3.156e+02, threshold=4.730e+02, percent-clipped=0.0 2024-06-22 00:48:37,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=492041.0, ans=0.125 2024-06-22 00:48:38,785 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.56 vs. limit=6.0 2024-06-22 00:48:39,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492041.0, ans=0.1 2024-06-22 00:48:40,328 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.87 vs. limit=22.5 2024-06-22 00:48:42,842 INFO [train.py:1028] (1/2) Epoch 27, batch 5350, loss[loss=0.1508, simple_loss=0.2248, pruned_loss=0.03838, over 10929.00 frames. ], tot_loss[loss=0.1712, simple_loss=0.2258, pruned_loss=0.05828, over 2571602.34 frames. ], batch size: 16, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:48:45,432 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.23 vs. limit=22.5 2024-06-22 00:48:47,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=492059.3333333333, ans=0.2 2024-06-22 00:49:03,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=492114.3333333333, ans=0.025 2024-06-22 00:49:03,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=492114.3333333333, ans=0.0 2024-06-22 00:49:04,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=492114.3333333333, ans=0.125 2024-06-22 00:49:09,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=492132.6666666667, ans=10.0 2024-06-22 00:49:18,780 INFO [train.py:1028] (1/2) Epoch 27, batch 5400, loss[loss=0.1765, simple_loss=0.2174, pruned_loss=0.0678, over 12128.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.2258, pruned_loss=0.05843, over 2564788.42 frames. ], batch size: 240, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:49:20,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=492151.0, ans=0.125 2024-06-22 00:49:27,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=492169.3333333333, ans=0.2 2024-06-22 00:49:34,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=492187.6666666667, ans=0.1 2024-06-22 00:49:43,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=492206.0, ans=0.1 2024-06-22 00:49:45,510 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.142e+02 2.275e+02 2.452e+02 3.433e+02, threshold=4.550e+02, percent-clipped=0.0 2024-06-22 00:49:51,401 INFO [train.py:1028] (1/2) Epoch 27, batch 5450, loss[loss=0.1822, simple_loss=0.2392, pruned_loss=0.06258, over 12903.00 frames. ], tot_loss[loss=0.1712, simple_loss=0.2258, pruned_loss=0.05827, over 2569181.39 frames. ], batch size: 26, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:50:07,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=492279.3333333333, ans=0.0 2024-06-22 00:50:18,591 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.35 vs. limit=15.0 2024-06-22 00:50:26,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=492316.0, ans=0.0 2024-06-22 00:50:27,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=492334.3333333333, ans=0.125 2024-06-22 00:50:28,063 INFO [train.py:1028] (1/2) Epoch 27, batch 5500, loss[loss=0.1861, simple_loss=0.2313, pruned_loss=0.07043, over 12212.00 frames. ], tot_loss[loss=0.1707, simple_loss=0.2255, pruned_loss=0.05801, over 2562568.04 frames. ], batch size: 241, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:50:30,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=492334.3333333333, ans=0.05 2024-06-22 00:50:48,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=492389.3333333333, ans=0.09899494936611666 2024-06-22 00:50:52,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=492389.3333333333, ans=0.125 2024-06-22 00:50:55,049 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.149e+02 2.269e+02 2.464e+02 3.927e+02, threshold=4.538e+02, percent-clipped=0.0 2024-06-22 00:50:58,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=492407.6666666667, ans=0.1 2024-06-22 00:51:01,155 INFO [train.py:1028] (1/2) Epoch 27, batch 5550, loss[loss=0.1598, simple_loss=0.221, pruned_loss=0.04928, over 13256.00 frames. ], tot_loss[loss=0.1699, simple_loss=0.2248, pruned_loss=0.05747, over 2566012.76 frames. ], batch size: 43, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:51:21,601 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:51:23,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=492481.0, ans=0.2 2024-06-22 00:51:27,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=492481.0, ans=0.125 2024-06-22 00:51:36,806 INFO [train.py:1028] (1/2) Epoch 27, batch 5600, loss[loss=0.1703, simple_loss=0.2238, pruned_loss=0.05843, over 13182.00 frames. ], tot_loss[loss=0.1703, simple_loss=0.2252, pruned_loss=0.05773, over 2568082.32 frames. ], batch size: 89, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:51:47,726 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2024-06-22 00:51:54,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.75 vs. limit=15.0 2024-06-22 00:51:58,560 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=492572.6666666667, ans=0.125 2024-06-22 00:51:59,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=492572.6666666667, ans=0.125 2024-06-22 00:52:04,964 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.142e+02 2.278e+02 2.413e+02 3.066e+02, threshold=4.557e+02, percent-clipped=0.0 2024-06-22 00:52:13,970 INFO [train.py:1028] (1/2) Epoch 27, batch 5650, loss[loss=0.1798, simple_loss=0.231, pruned_loss=0.06432, over 12513.00 frames. ], tot_loss[loss=0.1703, simple_loss=0.2252, pruned_loss=0.05766, over 2574029.76 frames. ], batch size: 202, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:52:15,009 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.02 vs. limit=15.0 2024-06-22 00:52:15,489 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:52:20,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=492627.6666666667, ans=0.125 2024-06-22 00:52:20,429 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.98 vs. limit=10.0 2024-06-22 00:52:23,791 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.19 vs. limit=12.0 2024-06-22 00:52:34,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=492664.3333333333, ans=0.2 2024-06-22 00:52:43,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=492682.6666666667, ans=0.125 2024-06-22 00:52:46,902 INFO [train.py:1028] (1/2) Epoch 27, batch 5700, loss[loss=0.163, simple_loss=0.2213, pruned_loss=0.05233, over 13253.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2254, pruned_loss=0.05781, over 2577924.89 frames. ], batch size: 63, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:52:51,506 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:52:55,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=492719.3333333333, ans=0.125 2024-06-22 00:52:59,611 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2024-06-22 00:53:00,868 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.80 vs. limit=12.0 2024-06-22 00:53:04,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=492737.6666666667, ans=0.0 2024-06-22 00:53:04,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=492737.6666666667, ans=0.025 2024-06-22 00:53:17,046 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.211e+02 2.385e+02 2.602e+02 3.510e+02, threshold=4.770e+02, percent-clipped=0.0 2024-06-22 00:53:17,769 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:53:22,952 INFO [train.py:1028] (1/2) Epoch 27, batch 5750, loss[loss=0.1748, simple_loss=0.228, pruned_loss=0.06083, over 12738.00 frames. ], tot_loss[loss=0.171, simple_loss=0.2258, pruned_loss=0.05809, over 2578437.27 frames. ], batch size: 176, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:53:24,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=492792.6666666667, ans=0.125 2024-06-22 00:53:37,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=492829.3333333333, ans=0.1 2024-06-22 00:53:45,987 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.64 vs. limit=15.0 2024-06-22 00:53:54,570 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.67 vs. limit=22.5 2024-06-22 00:53:55,407 INFO [train.py:1028] (1/2) Epoch 27, batch 5800, loss[loss=0.1817, simple_loss=0.2325, pruned_loss=0.06544, over 12793.00 frames. ], tot_loss[loss=0.1727, simple_loss=0.2271, pruned_loss=0.05915, over 2577927.20 frames. ], batch size: 177, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:54:04,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=492902.6666666667, ans=0.125 2024-06-22 00:54:09,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=492902.6666666667, ans=0.04949747468305833 2024-06-22 00:54:09,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=492902.6666666667, ans=0.125 2024-06-22 00:54:10,326 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.18 vs. limit=6.0 2024-06-22 00:54:15,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492921.0, ans=0.1 2024-06-22 00:54:18,404 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.88 vs. limit=15.0 2024-06-22 00:54:19,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=492939.3333333333, ans=0.0 2024-06-22 00:54:25,867 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.240e+02 2.321e+02 2.510e+02 3.563e+02, threshold=4.643e+02, percent-clipped=0.0 2024-06-22 00:54:30,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=492957.6666666667, ans=0.0 2024-06-22 00:54:31,688 INFO [train.py:1028] (1/2) Epoch 27, batch 5850, loss[loss=0.201, simple_loss=0.2475, pruned_loss=0.07721, over 12570.00 frames. ], tot_loss[loss=0.1745, simple_loss=0.2292, pruned_loss=0.05985, over 2575195.58 frames. ], batch size: 202, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:54:37,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=492976.0, ans=0.2 2024-06-22 00:55:05,019 INFO [train.py:1028] (1/2) Epoch 27, batch 5900, loss[loss=0.1652, simple_loss=0.215, pruned_loss=0.05773, over 13143.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2308, pruned_loss=0.06023, over 2574944.29 frames. ], batch size: 121, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:55:15,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=493086.0, ans=0.125 2024-06-22 00:55:22,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=493104.3333333333, ans=0.0 2024-06-22 00:55:28,471 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.53 vs. limit=22.5 2024-06-22 00:55:29,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=493122.6666666667, ans=0.125 2024-06-22 00:55:35,092 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.219e+02 2.360e+02 2.571e+02 3.960e+02, threshold=4.720e+02, percent-clipped=0.0 2024-06-22 00:55:41,084 INFO [train.py:1028] (1/2) Epoch 27, batch 5950, loss[loss=0.1674, simple_loss=0.2109, pruned_loss=0.06191, over 13103.00 frames. ], tot_loss[loss=0.176, simple_loss=0.2315, pruned_loss=0.06023, over 2579459.22 frames. ], batch size: 121, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:55:52,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=493177.6666666667, ans=0.2 2024-06-22 00:56:17,021 INFO [train.py:1028] (1/2) Epoch 27, batch 6000, loss[loss=0.218, simple_loss=0.2644, pruned_loss=0.08576, over 12189.00 frames. ], tot_loss[loss=0.1769, simple_loss=0.2326, pruned_loss=0.0606, over 2573050.62 frames. ], batch size: 241, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:56:17,022 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 00:56:24,913 INFO [train.py:1060] (1/2) Epoch 27, validation: loss=0.192, simple_loss=0.2517, pruned_loss=0.06616, over 351949.00 frames. 2024-06-22 00:56:24,914 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 00:56:29,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=493251.0, ans=0.0 2024-06-22 00:56:35,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=493269.3333333333, ans=0.125 2024-06-22 00:56:37,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=493287.6666666667, ans=0.1 2024-06-22 00:56:43,419 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=493287.6666666667, ans=0.02 2024-06-22 00:56:51,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=493324.3333333333, ans=0.125 2024-06-22 00:56:52,595 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.247e+02 2.417e+02 2.618e+02 3.248e+02, threshold=4.834e+02, percent-clipped=0.0 2024-06-22 00:56:58,367 INFO [train.py:1028] (1/2) Epoch 27, batch 6050, loss[loss=0.1852, simple_loss=0.2503, pruned_loss=0.06007, over 12979.00 frames. ], tot_loss[loss=0.1784, simple_loss=0.2346, pruned_loss=0.06112, over 2576181.72 frames. ], batch size: 39, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:57:03,570 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2024-06-22 00:57:31,092 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2024-06-22 00:57:34,693 INFO [train.py:1028] (1/2) Epoch 27, batch 6100, loss[loss=0.1892, simple_loss=0.2405, pruned_loss=0.06895, over 13105.00 frames. ], tot_loss[loss=0.1792, simple_loss=0.2354, pruned_loss=0.06147, over 2579371.30 frames. ], batch size: 121, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:57:43,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=493452.6666666667, ans=0.0 2024-06-22 00:57:46,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=493452.6666666667, ans=0.125 2024-06-22 00:57:49,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=493471.0, ans=0.125 2024-06-22 00:57:50,661 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.69 vs. limit=15.0 2024-06-22 00:58:02,087 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.230e+02 2.423e+02 2.675e+02 3.878e+02, threshold=4.846e+02, percent-clipped=0.0 2024-06-22 00:58:07,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=493526.0, ans=0.0 2024-06-22 00:58:07,886 INFO [train.py:1028] (1/2) Epoch 27, batch 6150, loss[loss=0.1862, simple_loss=0.2326, pruned_loss=0.0699, over 11065.00 frames. ], tot_loss[loss=0.1807, simple_loss=0.2372, pruned_loss=0.06216, over 2577623.48 frames. ], batch size: 304, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:58:21,354 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:58:21,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=493544.3333333333, ans=0.125 2024-06-22 00:58:31,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493581.0, ans=0.1 2024-06-22 00:58:32,775 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:58:41,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=493599.3333333333, ans=0.125 2024-06-22 00:58:44,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=493617.6666666667, ans=0.0 2024-06-22 00:58:45,063 INFO [train.py:1028] (1/2) Epoch 27, batch 6200, loss[loss=0.2073, simple_loss=0.2616, pruned_loss=0.07647, over 13249.00 frames. ], tot_loss[loss=0.1818, simple_loss=0.2384, pruned_loss=0.06263, over 2574499.04 frames. ], batch size: 89, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:58:47,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493617.6666666667, ans=0.1 2024-06-22 00:58:53,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=493636.0, ans=0.125 2024-06-22 00:59:00,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493654.3333333333, ans=0.1 2024-06-22 00:59:02,938 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.54 vs. limit=15.0 2024-06-22 00:59:05,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=493672.6666666667, ans=0.125 2024-06-22 00:59:17,525 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.292e+02 2.478e+02 2.849e+02 3.917e+02, threshold=4.956e+02, percent-clipped=0.0 2024-06-22 00:59:22,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=493691.0, ans=0.0 2024-06-22 00:59:23,802 INFO [train.py:1028] (1/2) Epoch 27, batch 6250, loss[loss=0.1581, simple_loss=0.2172, pruned_loss=0.04946, over 13196.00 frames. ], tot_loss[loss=0.1829, simple_loss=0.2397, pruned_loss=0.06307, over 2566930.06 frames. ], batch size: 83, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:59:28,522 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2024-06-22 00:59:42,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=493746.0, ans=0.0 2024-06-22 00:59:43,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=493764.3333333333, ans=0.125 2024-06-22 00:59:45,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=493764.3333333333, ans=0.0 2024-06-22 00:59:53,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493782.6666666667, ans=0.1 2024-06-22 00:59:56,218 INFO [train.py:1028] (1/2) Epoch 27, batch 6300, loss[loss=0.1787, simple_loss=0.2288, pruned_loss=0.06427, over 11073.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2409, pruned_loss=0.06331, over 2561733.58 frames. ], batch size: 16, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:00:03,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=493819.3333333333, ans=0.5 2024-06-22 01:00:04,809 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.40 vs. limit=15.0 2024-06-22 01:00:06,729 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.63 vs. limit=12.0 2024-06-22 01:00:09,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=493837.6666666667, ans=0.0 2024-06-22 01:00:19,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=493856.0, ans=0.0 2024-06-22 01:00:23,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=493856.0, ans=0.0 2024-06-22 01:00:24,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=493856.0, ans=0.125 2024-06-22 01:00:26,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=493874.3333333333, ans=0.125 2024-06-22 01:00:27,218 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.390e+02 2.552e+02 2.867e+02 4.454e+02, threshold=5.104e+02, percent-clipped=0.0 2024-06-22 01:00:27,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=493874.3333333333, ans=0.05 2024-06-22 01:00:33,451 INFO [train.py:1028] (1/2) Epoch 27, batch 6350, loss[loss=0.2272, simple_loss=0.2759, pruned_loss=0.08925, over 12588.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2425, pruned_loss=0.06357, over 2571501.28 frames. ], batch size: 202, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:00:41,150 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.53 vs. limit=15.0 2024-06-22 01:00:47,455 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:00:53,333 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.37 vs. limit=22.5 2024-06-22 01:00:53,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493947.6666666667, ans=0.1 2024-06-22 01:01:04,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=493966.0, ans=0.125 2024-06-22 01:01:07,291 INFO [train.py:1028] (1/2) Epoch 27, batch 6400, loss[loss=0.1821, simple_loss=0.2501, pruned_loss=0.05706, over 13212.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2445, pruned_loss=0.06424, over 2573113.22 frames. ], batch size: 67, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:01:08,997 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=493984.3333333333, ans=0.125 2024-06-22 01:01:12,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=493984.3333333333, ans=0.125 2024-06-22 01:01:18,376 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=494002.6666666667, ans=0.0 2024-06-22 01:01:27,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=494021.0, ans=10.0 2024-06-22 01:01:31,764 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=494021.0, ans=0.1 2024-06-22 01:01:35,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=494039.3333333333, ans=0.125 2024-06-22 01:01:38,303 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.12 vs. limit=15.0 2024-06-22 01:01:39,075 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2024-06-22 01:01:39,806 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.299e+02 2.451e+02 2.641e+02 5.215e+02, threshold=4.902e+02, percent-clipped=1.0 2024-06-22 01:01:42,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=494057.6666666667, ans=0.125 2024-06-22 01:01:42,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=494057.6666666667, ans=0.125 2024-06-22 01:01:45,962 INFO [train.py:1028] (1/2) Epoch 27, batch 6450, loss[loss=0.2262, simple_loss=0.2774, pruned_loss=0.08749, over 12605.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2459, pruned_loss=0.06455, over 2578819.21 frames. ], batch size: 202, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:01:54,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=494094.3333333333, ans=0.1 2024-06-22 01:02:06,160 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=494131.0, ans=0.0 2024-06-22 01:02:15,453 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=494149.3333333333, ans=0.04949747468305833 2024-06-22 01:02:19,698 INFO [train.py:1028] (1/2) Epoch 27, batch 6500, loss[loss=0.2117, simple_loss=0.2569, pruned_loss=0.08323, over 10748.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2479, pruned_loss=0.06525, over 2582504.29 frames. ], batch size: 303, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:02:21,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=494167.6666666667, ans=0.0 2024-06-22 01:02:22,349 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.28 vs. limit=15.0 2024-06-22 01:02:37,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=494204.3333333333, ans=0.125 2024-06-22 01:02:37,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=494204.3333333333, ans=0.125 2024-06-22 01:02:50,258 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.298e+02 2.509e+02 2.796e+02 4.179e+02, threshold=5.018e+02, percent-clipped=0.0 2024-06-22 01:02:56,232 INFO [train.py:1028] (1/2) Epoch 27, batch 6550, loss[loss=0.1759, simple_loss=0.2348, pruned_loss=0.05849, over 12546.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2483, pruned_loss=0.06504, over 2586057.22 frames. ], batch size: 22, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:03:01,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=494259.3333333333, ans=0.125 2024-06-22 01:03:02,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=494277.6666666667, ans=0.1 2024-06-22 01:03:07,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=494277.6666666667, ans=0.2 2024-06-22 01:03:10,501 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2024-06-22 01:03:10,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=494296.0, ans=0.2 2024-06-22 01:03:12,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=494296.0, ans=0.125 2024-06-22 01:03:23,672 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:03:32,613 INFO [train.py:1028] (1/2) Epoch 27, batch 6600, loss[loss=0.1777, simple_loss=0.2403, pruned_loss=0.05751, over 13207.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2486, pruned_loss=0.06528, over 2588841.42 frames. ], batch size: 72, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:03:34,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.63 vs. limit=15.0 2024-06-22 01:03:39,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=494369.3333333333, ans=0.2 2024-06-22 01:03:41,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=494369.3333333333, ans=0.1 2024-06-22 01:03:42,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=494369.3333333333, ans=10.0 2024-06-22 01:03:48,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=494387.6666666667, ans=0.1 2024-06-22 01:03:56,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=494406.0, ans=0.2 2024-06-22 01:04:00,571 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.338e+02 2.524e+02 2.736e+02 4.038e+02, threshold=5.047e+02, percent-clipped=0.0 2024-06-22 01:04:00,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=494424.3333333333, ans=0.125 2024-06-22 01:04:00,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=494424.3333333333, ans=0.2 2024-06-22 01:04:00,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=494424.3333333333, ans=0.09899494936611666 2024-06-22 01:04:06,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=494442.6666666667, ans=0.05 2024-06-22 01:04:06,455 INFO [train.py:1028] (1/2) Epoch 27, batch 6650, loss[loss=0.1944, simple_loss=0.2505, pruned_loss=0.06916, over 12946.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2503, pruned_loss=0.06598, over 2582615.36 frames. ], batch size: 158, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:04:06,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=494442.6666666667, ans=0.1 2024-06-22 01:04:23,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=494479.3333333333, ans=0.0 2024-06-22 01:04:24,052 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.45 vs. limit=15.0 2024-06-22 01:04:44,734 INFO [train.py:1028] (1/2) Epoch 27, batch 6700, loss[loss=0.2008, simple_loss=0.2532, pruned_loss=0.07419, over 12712.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2509, pruned_loss=0.06602, over 2583088.12 frames. ], batch size: 176, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:04:46,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=494534.3333333333, ans=0.125 2024-06-22 01:04:47,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=494534.3333333333, ans=0.025 2024-06-22 01:05:12,742 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.336e+02 2.507e+02 2.782e+02 3.537e+02, threshold=5.014e+02, percent-clipped=0.0 2024-06-22 01:05:14,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=494607.6666666667, ans=0.125 2024-06-22 01:05:14,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=494607.6666666667, ans=0.1 2024-06-22 01:05:19,062 INFO [train.py:1028] (1/2) Epoch 27, batch 6750, loss[loss=0.2401, simple_loss=0.2924, pruned_loss=0.09395, over 12230.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2515, pruned_loss=0.06662, over 2578180.18 frames. ], batch size: 240, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:05:19,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=494626.0, ans=0.0 2024-06-22 01:05:19,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=494626.0, ans=0.1 2024-06-22 01:05:20,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=494626.0, ans=0.125 2024-06-22 01:05:21,296 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.40 vs. limit=22.5 2024-06-22 01:05:23,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=494626.0, ans=0.5 2024-06-22 01:05:27,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=494644.3333333333, ans=0.125 2024-06-22 01:05:44,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=494681.0, ans=0.125 2024-06-22 01:05:55,152 INFO [train.py:1028] (1/2) Epoch 27, batch 6800, loss[loss=0.1823, simple_loss=0.2426, pruned_loss=0.06105, over 13162.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2532, pruned_loss=0.06715, over 2579684.03 frames. ], batch size: 67, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:06:08,257 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=15.0 2024-06-22 01:06:10,636 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.45 vs. limit=22.5 2024-06-22 01:06:12,026 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2024-06-22 01:06:17,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=494772.6666666667, ans=0.125 2024-06-22 01:06:22,544 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 2.335e+02 2.508e+02 2.798e+02 3.827e+02, threshold=5.017e+02, percent-clipped=0.0 2024-06-22 01:06:28,682 INFO [train.py:1028] (1/2) Epoch 27, batch 6850, loss[loss=0.208, simple_loss=0.2759, pruned_loss=0.07004, over 13301.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2538, pruned_loss=0.06722, over 2583529.34 frames. ], batch size: 63, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:06:29,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=494809.3333333333, ans=0.0 2024-06-22 01:06:46,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=494846.0, ans=0.125 2024-06-22 01:06:48,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=494846.0, ans=0.125 2024-06-22 01:06:50,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=494846.0, ans=0.2 2024-06-22 01:06:51,963 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=494864.3333333333, ans=0.2 2024-06-22 01:06:56,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=494864.3333333333, ans=0.125 2024-06-22 01:07:00,426 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2024-06-22 01:07:06,127 INFO [train.py:1028] (1/2) Epoch 27, batch 6900, loss[loss=0.2265, simple_loss=0.2878, pruned_loss=0.08263, over 13024.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2549, pruned_loss=0.06767, over 2585495.47 frames. ], batch size: 48, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:07:06,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=494901.0, ans=0.125 2024-06-22 01:07:08,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=494901.0, ans=0.0 2024-06-22 01:07:08,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=494901.0, ans=0.0 2024-06-22 01:07:22,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=494937.6666666667, ans=0.0 2024-06-22 01:07:28,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=494956.0, ans=0.125 2024-06-22 01:07:33,224 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.376e+02 2.560e+02 2.867e+02 3.923e+02, threshold=5.120e+02, percent-clipped=0.0 2024-06-22 01:07:42,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=494974.3333333333, ans=22.5 2024-06-22 01:07:43,763 INFO [train.py:1028] (1/2) Epoch 27, batch 6950, loss[loss=0.1752, simple_loss=0.2372, pruned_loss=0.05666, over 10941.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2549, pruned_loss=0.06745, over 2579362.85 frames. ], batch size: 16, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:07:57,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=495029.3333333333, ans=0.125 2024-06-22 01:08:02,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=495029.3333333333, ans=0.125 2024-06-22 01:08:05,911 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.80 vs. limit=15.0 2024-06-22 01:08:16,539 INFO [train.py:1028] (1/2) Epoch 27, batch 7000, loss[loss=0.201, simple_loss=0.2578, pruned_loss=0.07209, over 12986.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2552, pruned_loss=0.0674, over 2577248.51 frames. ], batch size: 158, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:08:38,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=495139.3333333333, ans=0.125 2024-06-22 01:08:44,896 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.423e+02 2.589e+02 2.850e+02 4.194e+02, threshold=5.178e+02, percent-clipped=0.0 2024-06-22 01:08:54,082 INFO [train.py:1028] (1/2) Epoch 27, batch 7050, loss[loss=0.2171, simple_loss=0.2718, pruned_loss=0.08122, over 12807.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2563, pruned_loss=0.06759, over 2584120.31 frames. ], batch size: 176, lr: 2.13e-03, grad_scale: 64.0 2024-06-22 01:09:07,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=495212.6666666667, ans=0.125 2024-06-22 01:09:13,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=495231.0, ans=0.125 2024-06-22 01:09:18,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=495231.0, ans=0.125 2024-06-22 01:09:20,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=495249.3333333333, ans=0.125 2024-06-22 01:09:23,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=495249.3333333333, ans=0.125 2024-06-22 01:09:25,769 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.45 vs. limit=10.0 2024-06-22 01:09:26,580 INFO [train.py:1028] (1/2) Epoch 27, batch 7100, loss[loss=0.2152, simple_loss=0.2751, pruned_loss=0.07769, over 13223.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2569, pruned_loss=0.06786, over 2574788.64 frames. ], batch size: 112, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:09:29,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=495267.6666666667, ans=0.0 2024-06-22 01:09:44,495 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:09:57,618 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.355e+02 2.476e+02 2.674e+02 3.483e+02, threshold=4.952e+02, percent-clipped=0.0 2024-06-22 01:10:01,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=495341.0, ans=0.0 2024-06-22 01:10:01,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=495341.0, ans=0.0 2024-06-22 01:10:02,808 INFO [train.py:1028] (1/2) Epoch 27, batch 7150, loss[loss=0.2474, simple_loss=0.299, pruned_loss=0.09795, over 12581.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2569, pruned_loss=0.06765, over 2573118.78 frames. ], batch size: 202, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:10:16,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=495396.0, ans=0.2 2024-06-22 01:10:22,144 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=495414.3333333333, ans=0.125 2024-06-22 01:10:35,506 INFO [train.py:1028] (1/2) Epoch 27, batch 7200, loss[loss=0.2112, simple_loss=0.279, pruned_loss=0.07174, over 13214.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2583, pruned_loss=0.06803, over 2578289.40 frames. ], batch size: 112, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:10:36,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=495451.0, ans=0.0 2024-06-22 01:10:42,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=495469.3333333333, ans=0.0 2024-06-22 01:10:43,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=495469.3333333333, ans=0.125 2024-06-22 01:10:52,159 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=495487.6666666667, ans=0.2 2024-06-22 01:10:59,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=495506.0, ans=0.125 2024-06-22 01:11:05,573 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.75 vs. limit=10.0 2024-06-22 01:11:08,524 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.366e+02 2.504e+02 2.631e+02 3.462e+02, threshold=5.009e+02, percent-clipped=0.0 2024-06-22 01:11:14,197 INFO [train.py:1028] (1/2) Epoch 27, batch 7250, loss[loss=0.2066, simple_loss=0.2657, pruned_loss=0.07378, over 12874.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2594, pruned_loss=0.06849, over 2579403.71 frames. ], batch size: 36, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:11:18,206 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:11:20,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=495561.0, ans=0.125 2024-06-22 01:11:25,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=495561.0, ans=0.1 2024-06-22 01:11:28,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=495579.3333333333, ans=0.2 2024-06-22 01:11:32,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=495579.3333333333, ans=0.5 2024-06-22 01:11:44,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=495616.0, ans=0.5 2024-06-22 01:11:46,854 INFO [train.py:1028] (1/2) Epoch 27, batch 7300, loss[loss=0.1871, simple_loss=0.2612, pruned_loss=0.0565, over 12867.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2602, pruned_loss=0.06878, over 2579848.63 frames. ], batch size: 36, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:11:54,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=495634.3333333333, ans=0.1 2024-06-22 01:12:01,220 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2024-06-22 01:12:05,341 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2024-06-22 01:12:18,285 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.453e+02 2.623e+02 2.936e+02 4.004e+02, threshold=5.246e+02, percent-clipped=0.0 2024-06-22 01:12:21,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=495707.6666666667, ans=0.0 2024-06-22 01:12:23,727 INFO [train.py:1028] (1/2) Epoch 27, batch 7350, loss[loss=0.215, simple_loss=0.284, pruned_loss=0.07306, over 13302.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2611, pruned_loss=0.06924, over 2582592.06 frames. ], batch size: 46, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:12:27,862 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:12:34,756 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2024-06-22 01:12:36,750 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.41 vs. limit=6.0 2024-06-22 01:12:39,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=495762.6666666667, ans=0.2 2024-06-22 01:12:57,212 INFO [train.py:1028] (1/2) Epoch 27, batch 7400, loss[loss=0.1926, simple_loss=0.2606, pruned_loss=0.0623, over 13240.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2612, pruned_loss=0.06893, over 2587456.60 frames. ], batch size: 63, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:13:11,402 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.73 vs. limit=22.5 2024-06-22 01:13:15,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=495854.3333333333, ans=0.0 2024-06-22 01:13:27,412 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.05 vs. limit=15.0 2024-06-22 01:13:28,984 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.415e+02 2.574e+02 2.907e+02 5.217e+02, threshold=5.147e+02, percent-clipped=0.0 2024-06-22 01:13:29,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=495891.0, ans=0.0 2024-06-22 01:13:34,440 INFO [train.py:1028] (1/2) Epoch 27, batch 7450, loss[loss=0.1839, simple_loss=0.2436, pruned_loss=0.06213, over 12595.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2608, pruned_loss=0.06861, over 2581381.60 frames. ], batch size: 29, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:13:42,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=495927.6666666667, ans=0.09899494936611666 2024-06-22 01:13:48,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=495946.0, ans=0.025 2024-06-22 01:13:49,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=495946.0, ans=0.0 2024-06-22 01:13:55,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=495964.3333333333, ans=0.125 2024-06-22 01:13:58,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=495964.3333333333, ans=0.125 2024-06-22 01:14:06,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=495982.6666666667, ans=0.0 2024-06-22 01:14:08,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=495982.6666666667, ans=0.0 2024-06-22 01:14:08,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=495982.6666666667, ans=0.025 2024-06-22 01:14:11,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=496001.0, ans=0.125 2024-06-22 01:14:12,156 INFO [train.py:1028] (1/2) Epoch 27, batch 7500, loss[loss=0.1855, simple_loss=0.2397, pruned_loss=0.06564, over 10540.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2616, pruned_loss=0.06922, over 2578088.21 frames. ], batch size: 304, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:14:16,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=496001.0, ans=0.125 2024-06-22 01:14:20,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=496019.3333333333, ans=0.0 2024-06-22 01:14:26,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=496037.6666666667, ans=0.125 2024-06-22 01:14:29,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=496037.6666666667, ans=0.125 2024-06-22 01:14:38,721 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:14:39,244 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.391e+02 2.544e+02 2.733e+02 3.554e+02, threshold=5.088e+02, percent-clipped=0.0 2024-06-22 01:14:39,811 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.63 vs. limit=5.0 2024-06-22 01:14:40,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=496074.3333333333, ans=0.09899494936611666 2024-06-22 01:14:42,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=496074.3333333333, ans=0.125 2024-06-22 01:14:42,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=496074.3333333333, ans=0.1 2024-06-22 01:14:44,407 INFO [train.py:1028] (1/2) Epoch 27, batch 7550, loss[loss=0.2005, simple_loss=0.2511, pruned_loss=0.07491, over 12939.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2623, pruned_loss=0.06996, over 2577150.85 frames. ], batch size: 158, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:14:49,856 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=12.70 vs. limit=15.0 2024-06-22 01:14:55,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=496111.0, ans=0.125 2024-06-22 01:14:56,329 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.54 vs. limit=15.0 2024-06-22 01:15:15,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=496166.0, ans=0.1 2024-06-22 01:15:17,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=496166.0, ans=0.125 2024-06-22 01:15:20,651 INFO [train.py:1028] (1/2) Epoch 27, batch 7600, loss[loss=0.2211, simple_loss=0.281, pruned_loss=0.08064, over 13212.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2627, pruned_loss=0.07005, over 2576634.87 frames. ], batch size: 83, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:15:25,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=496184.3333333333, ans=0.025 2024-06-22 01:15:37,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=496221.0, ans=0.0 2024-06-22 01:15:48,966 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.392e+02 2.581e+02 2.879e+02 3.969e+02, threshold=5.163e+02, percent-clipped=0.0 2024-06-22 01:15:57,672 INFO [train.py:1028] (1/2) Epoch 27, batch 7650, loss[loss=0.1774, simple_loss=0.2415, pruned_loss=0.05663, over 12903.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2629, pruned_loss=0.07014, over 2572627.60 frames. ], batch size: 33, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:16:02,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=496276.0, ans=0.125 2024-06-22 01:16:12,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=496312.6666666667, ans=0.0 2024-06-22 01:16:14,617 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2024-06-22 01:16:18,858 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.38 vs. limit=15.0 2024-06-22 01:16:22,155 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.15 vs. limit=15.0 2024-06-22 01:16:23,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=496331.0, ans=0.0 2024-06-22 01:16:30,850 INFO [train.py:1028] (1/2) Epoch 27, batch 7700, loss[loss=0.2104, simple_loss=0.281, pruned_loss=0.06985, over 13251.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.263, pruned_loss=0.07024, over 2570456.59 frames. ], batch size: 63, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:16:33,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=496367.6666666667, ans=0.1 2024-06-22 01:16:45,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=496404.3333333333, ans=0.04949747468305833 2024-06-22 01:16:46,505 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:16:53,909 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.22 vs. limit=15.0 2024-06-22 01:17:01,877 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.466e+02 2.691e+02 3.034e+02 3.925e+02, threshold=5.382e+02, percent-clipped=0.0 2024-06-22 01:17:05,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=496441.0, ans=0.05 2024-06-22 01:17:07,023 INFO [train.py:1028] (1/2) Epoch 27, batch 7750, loss[loss=0.186, simple_loss=0.2467, pruned_loss=0.06268, over 13284.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2631, pruned_loss=0.07043, over 2574243.42 frames. ], batch size: 72, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:17:07,351 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.29 vs. limit=15.0 2024-06-22 01:17:19,283 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.16 vs. limit=10.0 2024-06-22 01:17:20,574 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.86 vs. limit=15.0 2024-06-22 01:17:26,525 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.56 vs. limit=15.0 2024-06-22 01:17:27,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=496514.3333333333, ans=0.125 2024-06-22 01:17:27,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=496514.3333333333, ans=0.09899494936611666 2024-06-22 01:17:31,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=496514.3333333333, ans=0.025 2024-06-22 01:17:36,474 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.66 vs. limit=22.5 2024-06-22 01:17:38,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=496532.6666666667, ans=0.0 2024-06-22 01:17:40,065 INFO [train.py:1028] (1/2) Epoch 27, batch 7800, loss[loss=0.2011, simple_loss=0.2646, pruned_loss=0.06878, over 13208.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2634, pruned_loss=0.07035, over 2578830.99 frames. ], batch size: 95, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:17:40,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=496551.0, ans=0.09899494936611666 2024-06-22 01:17:58,075 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2024-06-22 01:17:59,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=496587.6666666667, ans=0.0 2024-06-22 01:17:59,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=496587.6666666667, ans=0.125 2024-06-22 01:18:06,498 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:18:08,072 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.65 vs. limit=22.5 2024-06-22 01:18:10,702 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.71 vs. limit=15.0 2024-06-22 01:18:11,589 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.380e+02 2.564e+02 2.778e+02 3.612e+02, threshold=5.127e+02, percent-clipped=0.0 2024-06-22 01:18:12,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=496624.3333333333, ans=0.09899494936611666 2024-06-22 01:18:16,726 INFO [train.py:1028] (1/2) Epoch 27, batch 7850, loss[loss=0.1687, simple_loss=0.2405, pruned_loss=0.04846, over 11632.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.264, pruned_loss=0.07058, over 2574086.49 frames. ], batch size: 17, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:18:16,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=496642.6666666667, ans=0.125 2024-06-22 01:18:24,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=496661.0, ans=0.125 2024-06-22 01:18:27,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=496661.0, ans=0.125 2024-06-22 01:18:28,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=496661.0, ans=0.1 2024-06-22 01:18:28,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=496661.0, ans=0.125 2024-06-22 01:18:40,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=496697.6666666667, ans=0.125 2024-06-22 01:18:54,771 INFO [train.py:1028] (1/2) Epoch 27, batch 7900, loss[loss=0.2145, simple_loss=0.2722, pruned_loss=0.07841, over 13168.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2642, pruned_loss=0.07071, over 2573163.34 frames. ], batch size: 77, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:19:02,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=496752.6666666667, ans=0.0 2024-06-22 01:19:08,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=496771.0, ans=0.1 2024-06-22 01:19:14,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=496789.3333333333, ans=0.125 2024-06-22 01:19:20,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=496807.6666666667, ans=0.0 2024-06-22 01:19:21,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=496807.6666666667, ans=0.125 2024-06-22 01:19:22,391 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.414e+02 2.634e+02 2.877e+02 3.351e+02, threshold=5.267e+02, percent-clipped=0.0 2024-06-22 01:19:27,337 INFO [train.py:1028] (1/2) Epoch 27, batch 7950, loss[loss=0.2154, simple_loss=0.267, pruned_loss=0.08186, over 10380.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2652, pruned_loss=0.07103, over 2575711.41 frames. ], batch size: 303, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:19:32,382 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=496826.0, ans=0.125 2024-06-22 01:19:41,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=496862.6666666667, ans=0.0 2024-06-22 01:19:55,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=496881.0, ans=0.2 2024-06-22 01:20:00,575 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=496899.3333333333, ans=0.125 2024-06-22 01:20:04,000 INFO [train.py:1028] (1/2) Epoch 27, batch 8000, loss[loss=0.2049, simple_loss=0.2699, pruned_loss=0.06994, over 13064.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2669, pruned_loss=0.07177, over 2573027.54 frames. ], batch size: 30, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:20:04,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=496917.6666666667, ans=0.125 2024-06-22 01:20:09,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=496917.6666666667, ans=0.0 2024-06-22 01:20:19,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=496954.3333333333, ans=0.125 2024-06-22 01:20:19,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=496954.3333333333, ans=0.5 2024-06-22 01:20:26,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=496972.6666666667, ans=0.0 2024-06-22 01:20:31,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=496991.0, ans=0.2 2024-06-22 01:20:31,720 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 2.446e+02 2.616e+02 2.837e+02 3.387e+02, threshold=5.231e+02, percent-clipped=0.0 2024-06-22 01:20:34,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=496991.0, ans=0.125 2024-06-22 01:20:37,302 INFO [train.py:1028] (1/2) Epoch 27, batch 8050, loss[loss=0.2065, simple_loss=0.2598, pruned_loss=0.07658, over 13258.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2663, pruned_loss=0.07142, over 2573317.95 frames. ], batch size: 83, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:20:44,115 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.41 vs. limit=6.0 2024-06-22 01:20:45,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=497027.6666666667, ans=0.2 2024-06-22 01:20:59,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=497064.3333333333, ans=0.125 2024-06-22 01:21:04,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=497064.3333333333, ans=0.125 2024-06-22 01:21:05,145 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.648e+00 2024-06-22 01:21:06,040 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2024-06-22 01:21:12,450 INFO [train.py:1028] (1/2) Epoch 27, batch 8100, loss[loss=0.2063, simple_loss=0.2683, pruned_loss=0.07218, over 13148.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2663, pruned_loss=0.07123, over 2577377.35 frames. ], batch size: 112, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:21:26,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=497137.6666666667, ans=0.125 2024-06-22 01:21:29,649 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.49 vs. limit=22.5 2024-06-22 01:21:31,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=497137.6666666667, ans=0.125 2024-06-22 01:21:40,430 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.375e+02 2.483e+02 2.663e+02 3.373e+02, threshold=4.967e+02, percent-clipped=0.0 2024-06-22 01:21:45,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=497174.3333333333, ans=0.0 2024-06-22 01:21:46,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=497174.3333333333, ans=0.0 2024-06-22 01:21:49,357 INFO [train.py:1028] (1/2) Epoch 27, batch 8150, loss[loss=0.2088, simple_loss=0.2662, pruned_loss=0.07568, over 13148.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2666, pruned_loss=0.07079, over 2580068.93 frames. ], batch size: 121, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:21:51,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=497192.6666666667, ans=0.125 2024-06-22 01:22:05,018 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.63 vs. limit=10.0 2024-06-22 01:22:18,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=497266.0, ans=0.125 2024-06-22 01:22:19,750 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.51 vs. limit=6.0 2024-06-22 01:22:22,716 INFO [train.py:1028] (1/2) Epoch 27, batch 8200, loss[loss=0.2144, simple_loss=0.2746, pruned_loss=0.07707, over 13152.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2668, pruned_loss=0.07079, over 2583486.85 frames. ], batch size: 112, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:22:24,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=497284.3333333333, ans=0.1 2024-06-22 01:22:35,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=497302.6666666667, ans=0.0 2024-06-22 01:22:35,337 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=497302.6666666667, ans=0.125 2024-06-22 01:22:40,576 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=497321.0, ans=0.2 2024-06-22 01:22:41,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=497321.0, ans=0.0 2024-06-22 01:22:51,188 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.462e+02 2.652e+02 2.896e+02 3.986e+02, threshold=5.303e+02, percent-clipped=0.0 2024-06-22 01:22:51,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=497357.6666666667, ans=0.125 2024-06-22 01:23:00,661 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.46 vs. limit=15.0 2024-06-22 01:23:01,686 INFO [train.py:1028] (1/2) Epoch 27, batch 8250, loss[loss=0.1932, simple_loss=0.2599, pruned_loss=0.0633, over 13284.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2672, pruned_loss=0.07124, over 2583324.65 frames. ], batch size: 52, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:23:02,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=497376.0, ans=0.0 2024-06-22 01:23:06,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=497376.0, ans=0.0 2024-06-22 01:23:07,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=497376.0, ans=0.0 2024-06-22 01:23:11,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=497394.3333333333, ans=0.125 2024-06-22 01:23:12,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=497394.3333333333, ans=0.2 2024-06-22 01:23:18,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=497412.6666666667, ans=0.125 2024-06-22 01:23:19,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=497412.6666666667, ans=0.125 2024-06-22 01:23:22,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=497431.0, ans=0.035 2024-06-22 01:23:22,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=497431.0, ans=0.2 2024-06-22 01:23:22,943 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=497431.0, ans=0.0 2024-06-22 01:23:24,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=497431.0, ans=0.0 2024-06-22 01:23:30,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=497449.3333333333, ans=0.0 2024-06-22 01:23:33,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=497467.6666666667, ans=0.025 2024-06-22 01:23:34,313 INFO [train.py:1028] (1/2) Epoch 27, batch 8300, loss[loss=0.2302, simple_loss=0.2871, pruned_loss=0.08665, over 13028.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2664, pruned_loss=0.07076, over 2580778.98 frames. ], batch size: 102, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:23:54,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=497504.3333333333, ans=0.2 2024-06-22 01:24:05,181 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.383e+02 2.532e+02 2.763e+02 3.503e+02, threshold=5.063e+02, percent-clipped=0.0 2024-06-22 01:24:08,535 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=497541.0, ans=0.125 2024-06-22 01:24:10,264 INFO [train.py:1028] (1/2) Epoch 27, batch 8350, loss[loss=0.2057, simple_loss=0.2589, pruned_loss=0.0763, over 13206.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2665, pruned_loss=0.07074, over 2579608.38 frames. ], batch size: 112, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:24:14,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=497559.3333333333, ans=0.1 2024-06-22 01:24:15,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=497559.3333333333, ans=0.125 2024-06-22 01:24:25,421 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2024-06-22 01:24:26,920 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.07 vs. limit=15.0 2024-06-22 01:24:32,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=497614.3333333333, ans=0.2 2024-06-22 01:24:40,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=497632.6666666667, ans=0.0 2024-06-22 01:24:40,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=497632.6666666667, ans=0.125 2024-06-22 01:24:43,725 INFO [train.py:1028] (1/2) Epoch 27, batch 8400, loss[loss=0.2078, simple_loss=0.272, pruned_loss=0.07176, over 12865.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2671, pruned_loss=0.07105, over 2576631.11 frames. ], batch size: 39, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:24:49,711 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.72 vs. limit=6.0 2024-06-22 01:25:05,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=497687.6666666667, ans=0.125 2024-06-22 01:25:08,470 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.48 vs. limit=22.5 2024-06-22 01:25:12,121 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=497706.0, ans=0.0 2024-06-22 01:25:15,444 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 2.394e+02 2.531e+02 2.698e+02 3.224e+02, threshold=5.061e+02, percent-clipped=0.0 2024-06-22 01:25:18,297 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=497724.3333333333, ans=15.0 2024-06-22 01:25:20,574 INFO [train.py:1028] (1/2) Epoch 27, batch 8450, loss[loss=0.2183, simple_loss=0.281, pruned_loss=0.07787, over 13153.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2678, pruned_loss=0.07103, over 2578460.89 frames. ], batch size: 112, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:25:21,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=497742.6666666667, ans=0.125 2024-06-22 01:25:37,651 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=497779.3333333333, ans=0.125 2024-06-22 01:25:38,393 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=497779.3333333333, ans=0.07 2024-06-22 01:25:49,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=497797.6666666667, ans=0.2 2024-06-22 01:25:57,701 INFO [train.py:1028] (1/2) Epoch 27, batch 8500, loss[loss=0.2086, simple_loss=0.2716, pruned_loss=0.0728, over 12575.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2686, pruned_loss=0.0715, over 2578191.93 frames. ], batch size: 29, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:25:57,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=497834.3333333333, ans=0.125 2024-06-22 01:26:03,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=497852.6666666667, ans=0.0 2024-06-22 01:26:04,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=497852.6666666667, ans=0.125 2024-06-22 01:26:10,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=497871.0, ans=0.125 2024-06-22 01:26:14,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=497871.0, ans=0.125 2024-06-22 01:26:17,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=497889.3333333333, ans=0.1 2024-06-22 01:26:24,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=497907.6666666667, ans=0.125 2024-06-22 01:26:26,133 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.403e+02 2.536e+02 2.770e+02 3.755e+02, threshold=5.072e+02, percent-clipped=0.0 2024-06-22 01:26:30,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=497907.6666666667, ans=0.125 2024-06-22 01:26:30,388 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.89 vs. limit=15.0 2024-06-22 01:26:31,338 INFO [train.py:1028] (1/2) Epoch 27, batch 8550, loss[loss=0.1765, simple_loss=0.2501, pruned_loss=0.05141, over 12563.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2685, pruned_loss=0.07128, over 2576485.53 frames. ], batch size: 22, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:26:31,749 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.83 vs. limit=15.0 2024-06-22 01:26:32,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=497926.0, ans=0.015 2024-06-22 01:26:38,689 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.54 vs. limit=6.0 2024-06-22 01:26:54,499 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.14 vs. limit=15.0 2024-06-22 01:27:01,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=497999.3333333333, ans=0.1 2024-06-22 01:27:08,281 INFO [train.py:1028] (1/2) Epoch 27, batch 8600, loss[loss=0.2229, simple_loss=0.2786, pruned_loss=0.08358, over 13128.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2687, pruned_loss=0.07154, over 2573321.74 frames. ], batch size: 121, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:27:33,204 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.955e+00 2024-06-22 01:27:35,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=498091.0, ans=0.0 2024-06-22 01:27:36,309 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.397e+02 2.555e+02 2.719e+02 3.362e+02, threshold=5.111e+02, percent-clipped=0.0 2024-06-22 01:27:41,808 INFO [train.py:1028] (1/2) Epoch 27, batch 8650, loss[loss=0.1965, simple_loss=0.2572, pruned_loss=0.06795, over 13089.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2687, pruned_loss=0.07122, over 2577184.76 frames. ], batch size: 103, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:27:56,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=498127.6666666667, ans=0.1 2024-06-22 01:28:09,772 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=498164.3333333333, ans=0.1 2024-06-22 01:28:15,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=498182.6666666667, ans=0.125 2024-06-22 01:28:18,511 INFO [train.py:1028] (1/2) Epoch 27, batch 8700, loss[loss=0.2126, simple_loss=0.282, pruned_loss=0.07157, over 13205.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2689, pruned_loss=0.07142, over 2573675.81 frames. ], batch size: 59, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:28:25,002 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2024-06-22 01:28:30,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=498219.3333333333, ans=0.125 2024-06-22 01:28:35,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=498237.6666666667, ans=0.2 2024-06-22 01:28:36,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=498237.6666666667, ans=0.0 2024-06-22 01:28:44,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=498256.0, ans=0.0 2024-06-22 01:28:45,857 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.61 vs. limit=12.0 2024-06-22 01:28:46,876 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.454e+02 2.639e+02 2.856e+02 4.545e+02, threshold=5.278e+02, percent-clipped=0.0 2024-06-22 01:28:47,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=498274.3333333333, ans=0.125 2024-06-22 01:28:52,230 INFO [train.py:1028] (1/2) Epoch 27, batch 8750, loss[loss=0.1787, simple_loss=0.2368, pruned_loss=0.06033, over 13085.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.269, pruned_loss=0.07166, over 2568995.52 frames. ], batch size: 121, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:29:05,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=498311.0, ans=0.0 2024-06-22 01:29:06,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=498311.0, ans=0.1 2024-06-22 01:29:10,584 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=14.17 vs. limit=15.0 2024-06-22 01:29:18,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=498347.6666666667, ans=0.125 2024-06-22 01:29:29,215 INFO [train.py:1028] (1/2) Epoch 27, batch 8800, loss[loss=0.2089, simple_loss=0.2764, pruned_loss=0.07073, over 13238.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2692, pruned_loss=0.07181, over 2573936.12 frames. ], batch size: 72, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:29:38,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=498402.6666666667, ans=0.125 2024-06-22 01:29:51,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=498421.0, ans=0.0 2024-06-22 01:29:55,214 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:30:00,922 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.370e+02 2.522e+02 2.746e+02 3.463e+02, threshold=5.043e+02, percent-clipped=0.0 2024-06-22 01:30:02,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=498457.6666666667, ans=0.0 2024-06-22 01:30:04,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.87 vs. limit=15.0 2024-06-22 01:30:06,514 INFO [train.py:1028] (1/2) Epoch 27, batch 8850, loss[loss=0.2087, simple_loss=0.266, pruned_loss=0.07573, over 12504.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2688, pruned_loss=0.07204, over 2561790.81 frames. ], batch size: 202, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:30:17,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=498494.3333333333, ans=0.0 2024-06-22 01:30:19,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=498494.3333333333, ans=0.125 2024-06-22 01:30:20,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=498512.6666666667, ans=0.2 2024-06-22 01:30:28,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=498531.0, ans=0.125 2024-06-22 01:30:40,969 INFO [train.py:1028] (1/2) Epoch 27, batch 8900, loss[loss=0.2112, simple_loss=0.2792, pruned_loss=0.07166, over 12945.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2699, pruned_loss=0.07252, over 2561298.06 frames. ], batch size: 33, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:30:41,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=498567.6666666667, ans=0.125 2024-06-22 01:30:46,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=498567.6666666667, ans=0.015 2024-06-22 01:30:47,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=498586.0, ans=0.2 2024-06-22 01:30:48,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=498586.0, ans=0.125 2024-06-22 01:30:49,247 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=498586.0, ans=0.125 2024-06-22 01:30:53,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=498604.3333333333, ans=0.1 2024-06-22 01:31:07,684 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:31:14,464 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.525e+02 2.729e+02 2.955e+02 3.918e+02, threshold=5.458e+02, percent-clipped=0.0 2024-06-22 01:31:17,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=498641.0, ans=0.2 2024-06-22 01:31:19,987 INFO [train.py:1028] (1/2) Epoch 27, batch 8950, loss[loss=0.2183, simple_loss=0.2816, pruned_loss=0.0775, over 12539.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2699, pruned_loss=0.07244, over 2560138.86 frames. ], batch size: 202, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:31:28,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=498659.3333333333, ans=0.125 2024-06-22 01:31:28,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=498659.3333333333, ans=0.2 2024-06-22 01:31:32,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=498677.6666666667, ans=0.125 2024-06-22 01:31:47,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=498714.3333333333, ans=0.0 2024-06-22 01:31:48,015 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=22.5 2024-06-22 01:32:02,438 INFO [train.py:1028] (1/2) Epoch 27, batch 9000, loss[loss=0.2085, simple_loss=0.2775, pruned_loss=0.06971, over 13291.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2699, pruned_loss=0.07193, over 2566612.71 frames. ], batch size: 46, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:32:02,439 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 01:32:09,426 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.5547, 4.1267, 3.0052, 4.3720], device='cuda:1') 2024-06-22 01:32:10,389 INFO [train.py:1060] (1/2) Epoch 27, validation: loss=0.1926, simple_loss=0.2522, pruned_loss=0.06649, over 351949.00 frames. 2024-06-22 01:32:10,390 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 01:32:14,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=498751.0, ans=0.125 2024-06-22 01:32:16,061 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=498751.0, ans=0.125 2024-06-22 01:32:16,134 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:32:18,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=498769.3333333333, ans=0.1 2024-06-22 01:32:20,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=498769.3333333333, ans=0.1 2024-06-22 01:32:22,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=498769.3333333333, ans=0.2 2024-06-22 01:32:29,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=498787.6666666667, ans=0.0 2024-06-22 01:32:38,286 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 2.536e+02 2.681e+02 2.945e+02 4.161e+02, threshold=5.362e+02, percent-clipped=0.0 2024-06-22 01:32:43,634 INFO [train.py:1028] (1/2) Epoch 27, batch 9050, loss[loss=0.1963, simple_loss=0.2491, pruned_loss=0.07174, over 11204.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2709, pruned_loss=0.07238, over 2566143.79 frames. ], batch size: 16, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:32:46,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=498842.6666666667, ans=0.125 2024-06-22 01:32:52,182 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2024-06-22 01:32:55,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=498861.0, ans=0.0 2024-06-22 01:32:56,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=498879.3333333333, ans=0.0 2024-06-22 01:32:59,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=498879.3333333333, ans=0.125 2024-06-22 01:33:00,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=498879.3333333333, ans=0.0 2024-06-22 01:33:06,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=498897.6666666667, ans=0.125 2024-06-22 01:33:16,540 INFO [train.py:1028] (1/2) Epoch 27, batch 9100, loss[loss=0.1788, simple_loss=0.2493, pruned_loss=0.05416, over 13222.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2708, pruned_loss=0.07191, over 2568225.12 frames. ], batch size: 72, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:33:17,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=498934.3333333333, ans=0.125 2024-06-22 01:33:17,544 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.79 vs. limit=10.0 2024-06-22 01:33:32,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=498971.0, ans=0.02 2024-06-22 01:33:36,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=498989.3333333333, ans=0.0 2024-06-22 01:33:43,274 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.400e+02 2.503e+02 2.699e+02 4.162e+02, threshold=5.007e+02, percent-clipped=0.0 2024-06-22 01:33:44,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=499007.6666666667, ans=0.025 2024-06-22 01:33:44,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=499007.6666666667, ans=0.2 2024-06-22 01:33:48,103 INFO [train.py:1028] (1/2) Epoch 27, batch 9150, loss[loss=0.1986, simple_loss=0.2645, pruned_loss=0.06631, over 13166.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2711, pruned_loss=0.07229, over 2569613.04 frames. ], batch size: 77, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:33:48,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=499026.0, ans=0.0 2024-06-22 01:33:55,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=499044.3333333333, ans=0.1 2024-06-22 01:34:11,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=499081.0, ans=0.0 2024-06-22 01:34:22,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=499099.3333333333, ans=0.2 2024-06-22 01:34:23,459 INFO [train.py:1028] (1/2) Epoch 27, batch 9200, loss[loss=0.2033, simple_loss=0.264, pruned_loss=0.07133, over 12876.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2709, pruned_loss=0.07178, over 2572903.13 frames. ], batch size: 36, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:34:31,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=499136.0, ans=0.2 2024-06-22 01:34:32,675 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.69 vs. limit=12.0 2024-06-22 01:34:43,687 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.10 vs. limit=15.0 2024-06-22 01:34:49,614 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.443e+02 2.548e+02 2.707e+02 3.723e+02, threshold=5.097e+02, percent-clipped=0.0 2024-06-22 01:34:50,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=499191.0, ans=0.125 2024-06-22 01:34:51,823 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.97 vs. limit=15.0 2024-06-22 01:34:54,573 INFO [train.py:1028] (1/2) Epoch 27, batch 9250, loss[loss=0.208, simple_loss=0.281, pruned_loss=0.06754, over 13223.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2711, pruned_loss=0.07177, over 2573912.78 frames. ], batch size: 67, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:35:02,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=499227.6666666667, ans=0.2 2024-06-22 01:35:03,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=499227.6666666667, ans=0.0 2024-06-22 01:35:04,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=499227.6666666667, ans=0.125 2024-06-22 01:35:06,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=499227.6666666667, ans=22.5 2024-06-22 01:35:15,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=499264.3333333333, ans=0.0 2024-06-22 01:35:17,737 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=499264.3333333333, ans=10.0 2024-06-22 01:35:26,848 INFO [train.py:1028] (1/2) Epoch 27, batch 9300, loss[loss=0.1851, simple_loss=0.255, pruned_loss=0.0576, over 12876.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2705, pruned_loss=0.07164, over 2569569.54 frames. ], batch size: 39, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:35:27,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=499301.0, ans=0.0 2024-06-22 01:35:35,967 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=12.0 2024-06-22 01:35:40,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=499319.3333333333, ans=0.0 2024-06-22 01:35:54,435 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.44 vs. limit=6.0 2024-06-22 01:35:55,751 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.108e+02 2.423e+02 2.579e+02 2.808e+02 3.719e+02, threshold=5.158e+02, percent-clipped=0.0 2024-06-22 01:35:57,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=499374.3333333333, ans=0.05 2024-06-22 01:36:00,809 INFO [train.py:1028] (1/2) Epoch 27, batch 9350, loss[loss=0.206, simple_loss=0.2732, pruned_loss=0.06937, over 12591.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2705, pruned_loss=0.07168, over 2566639.90 frames. ], batch size: 22, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:36:05,908 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.42 vs. limit=5.0 2024-06-22 01:36:16,951 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=499429.3333333333, ans=0.0 2024-06-22 01:36:21,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=499447.6666666667, ans=0.0 2024-06-22 01:36:29,444 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=15.0 2024-06-22 01:36:32,026 INFO [train.py:1028] (1/2) Epoch 27, batch 9400, loss[loss=0.2079, simple_loss=0.2754, pruned_loss=0.07022, over 13239.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2702, pruned_loss=0.07165, over 2567116.68 frames. ], batch size: 52, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:36:36,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=499484.3333333333, ans=0.0 2024-06-22 01:36:49,323 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.78 vs. limit=15.0 2024-06-22 01:36:57,398 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.388e+02 2.549e+02 2.800e+02 3.883e+02, threshold=5.097e+02, percent-clipped=0.0 2024-06-22 01:36:57,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=499557.6666666667, ans=0.0 2024-06-22 01:37:02,411 INFO [train.py:1028] (1/2) Epoch 27, batch 9450, loss[loss=0.2077, simple_loss=0.2677, pruned_loss=0.07387, over 12620.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2711, pruned_loss=0.07203, over 2568955.75 frames. ], batch size: 22, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:37:12,041 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.99 vs. limit=15.0 2024-06-22 01:37:23,793 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:37:33,800 INFO [train.py:1028] (1/2) Epoch 27, batch 9500, loss[loss=0.2021, simple_loss=0.2613, pruned_loss=0.07143, over 13257.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2706, pruned_loss=0.07136, over 2577590.53 frames. ], batch size: 43, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:37:40,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=499667.6666666667, ans=0.025 2024-06-22 01:37:42,492 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.35 vs. limit=15.0 2024-06-22 01:38:00,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=499722.6666666667, ans=0.025 2024-06-22 01:38:03,254 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.417e+02 2.583e+02 2.820e+02 3.896e+02, threshold=5.166e+02, percent-clipped=0.0 2024-06-22 01:38:06,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=499741.0, ans=0.025 2024-06-22 01:38:08,381 INFO [train.py:1028] (1/2) Epoch 27, batch 9550, loss[loss=0.1877, simple_loss=0.2551, pruned_loss=0.06019, over 12895.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2709, pruned_loss=0.07166, over 2573737.36 frames. ], batch size: 39, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:38:15,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=499777.6666666667, ans=0.125 2024-06-22 01:38:17,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=499777.6666666667, ans=10.0 2024-06-22 01:38:20,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=499796.0, ans=0.1 2024-06-22 01:38:41,221 INFO [train.py:1028] (1/2) Epoch 27, batch 9600, loss[loss=0.2133, simple_loss=0.2638, pruned_loss=0.0814, over 10470.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2704, pruned_loss=0.07156, over 2572230.99 frames. ], batch size: 304, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:38:47,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=499869.3333333333, ans=0.1 2024-06-22 01:38:48,360 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.16 vs. limit=22.5 2024-06-22 01:38:51,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=499869.3333333333, ans=0.0 2024-06-22 01:38:55,233 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=15.0 2024-06-22 01:38:56,443 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.76 vs. limit=15.0 2024-06-22 01:39:07,320 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.414e+02 2.589e+02 2.883e+02 3.902e+02, threshold=5.179e+02, percent-clipped=0.0 2024-06-22 01:39:12,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=499942.6666666667, ans=0.125 2024-06-22 01:39:12,523 INFO [train.py:1028] (1/2) Epoch 27, batch 9650, loss[loss=0.207, simple_loss=0.264, pruned_loss=0.07497, over 13100.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2706, pruned_loss=0.07214, over 2562741.07 frames. ], batch size: 132, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:39:14,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=499942.6666666667, ans=0.0 2024-06-22 01:39:26,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=499979.3333333333, ans=0.125 2024-06-22 01:39:37,310 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.48 vs. limit=6.0 2024-06-22 01:39:43,638 INFO [train.py:1028] (1/2) Epoch 27, batch 9700, loss[loss=0.2088, simple_loss=0.2679, pruned_loss=0.07487, over 13051.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2703, pruned_loss=0.07204, over 2557411.59 frames. ], batch size: 144, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:39:45,729 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.04 vs. limit=15.0 2024-06-22 01:40:00,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=500071.0, ans=0.125 2024-06-22 01:40:01,598 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.19 vs. limit=22.5 2024-06-22 01:40:11,504 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.436e+02 2.633e+02 2.894e+02 3.808e+02, threshold=5.266e+02, percent-clipped=0.0 2024-06-22 01:40:12,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=500107.6666666667, ans=0.1 2024-06-22 01:40:16,561 INFO [train.py:1028] (1/2) Epoch 27, batch 9750, loss[loss=0.2133, simple_loss=0.2749, pruned_loss=0.07582, over 13083.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2688, pruned_loss=0.07114, over 2553386.32 frames. ], batch size: 132, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:40:17,013 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.02 vs. limit=15.0 2024-06-22 01:40:32,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=500162.6666666667, ans=0.0 2024-06-22 01:40:33,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=500162.6666666667, ans=0.125 2024-06-22 01:40:35,164 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.14 vs. limit=10.0 2024-06-22 01:40:35,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=500162.6666666667, ans=0.125 2024-06-22 01:40:38,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=500181.0, ans=0.1 2024-06-22 01:40:48,947 INFO [train.py:1028] (1/2) Epoch 27, batch 9800, loss[loss=0.2044, simple_loss=0.2748, pruned_loss=0.067, over 12927.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2688, pruned_loss=0.071, over 2545911.30 frames. ], batch size: 39, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:40:52,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=500217.6666666667, ans=0.125 2024-06-22 01:41:01,317 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=500254.3333333333, ans=0.0 2024-06-22 01:41:01,356 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=500254.3333333333, ans=0.125 2024-06-22 01:41:01,640 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.12 vs. limit=22.5 2024-06-22 01:41:03,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=500254.3333333333, ans=0.1 2024-06-22 01:41:12,964 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=500291.0, ans=0.125 2024-06-22 01:41:14,727 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.409e+02 2.575e+02 2.745e+02 3.684e+02, threshold=5.150e+02, percent-clipped=0.0 2024-06-22 01:41:18,193 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.00 vs. limit=15.0 2024-06-22 01:41:19,631 INFO [train.py:1028] (1/2) Epoch 27, batch 9850, loss[loss=0.1952, simple_loss=0.2535, pruned_loss=0.06846, over 12995.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2686, pruned_loss=0.07095, over 2537909.60 frames. ], batch size: 102, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:41:42,214 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:41:43,681 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.65 vs. limit=5.0 2024-06-22 01:41:48,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=500382.6666666667, ans=0.0 2024-06-22 01:41:51,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=500382.6666666667, ans=0.125 2024-06-22 01:41:52,862 INFO [train.py:1028] (1/2) Epoch 27, batch 9900, loss[loss=0.198, simple_loss=0.2627, pruned_loss=0.06659, over 12885.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2683, pruned_loss=0.07133, over 2532175.59 frames. ], batch size: 39, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:41:55,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=500401.0, ans=0.0 2024-06-22 01:42:03,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=500419.3333333333, ans=22.5 2024-06-22 01:42:05,578 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.16 vs. limit=10.0 2024-06-22 01:42:08,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=500437.6666666667, ans=0.0 2024-06-22 01:42:18,937 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.432e+02 2.597e+02 2.830e+02 3.579e+02, threshold=5.194e+02, percent-clipped=0.0 2024-06-22 01:42:21,030 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=500474.3333333333, ans=0.125 2024-06-22 01:42:23,866 INFO [train.py:1028] (1/2) Epoch 27, batch 9950, loss[loss=0.2059, simple_loss=0.2707, pruned_loss=0.07057, over 12703.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2671, pruned_loss=0.07124, over 2525398.54 frames. ], batch size: 29, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:42:29,748 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.85 vs. limit=15.0 2024-06-22 01:42:32,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=500511.0, ans=0.1 2024-06-22 01:42:37,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=500529.3333333333, ans=0.125 2024-06-22 01:42:40,158 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.75 vs. limit=10.0 2024-06-22 01:42:41,346 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2024-06-22 01:42:50,413 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.90 vs. limit=15.0 2024-06-22 01:42:53,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=500566.0, ans=0.0 2024-06-22 01:42:55,485 INFO [train.py:1028] (1/2) Epoch 27, batch 10000, loss[loss=0.1831, simple_loss=0.2523, pruned_loss=0.05702, over 12674.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2678, pruned_loss=0.07196, over 2486281.88 frames. ], batch size: 22, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:43:05,060 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.64 vs. limit=6.0 2024-06-22 01:43:07,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=500602.6666666667, ans=0.1 2024-06-22 01:43:10,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=500621.0, ans=0.2 2024-06-22 01:43:17,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=500639.3333333333, ans=0.2 2024-06-22 01:43:21,054 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2024-06-22 01:43:21,746 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=15.0 2024-06-22 01:43:23,688 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.457e+02 2.572e+02 2.687e+02 3.633e+02, threshold=5.144e+02, percent-clipped=0.0 2024-06-22 01:43:25,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=500657.6666666667, ans=0.125 2024-06-22 01:43:26,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=500657.6666666667, ans=0.125 2024-06-22 01:43:28,362 INFO [train.py:1028] (1/2) Epoch 27, batch 10050, loss[loss=0.228, simple_loss=0.2882, pruned_loss=0.08397, over 12480.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2673, pruned_loss=0.07216, over 2443506.42 frames. ], batch size: 22, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:43:31,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=500676.0, ans=0.125 2024-06-22 01:43:35,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=500694.3333333333, ans=0.5 2024-06-22 01:43:56,897 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.60 vs. limit=12.0 2024-06-22 01:43:58,447 INFO [train.py:1028] (1/2) Epoch 27, batch 10100, loss[loss=0.1943, simple_loss=0.2569, pruned_loss=0.06586, over 10955.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2671, pruned_loss=0.07182, over 2424182.41 frames. ], batch size: 16, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:43:59,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=500767.6666666667, ans=0.025 2024-06-22 01:46:10,917 INFO [train.py:1028] (1/2) Epoch 28, batch 0, loss[loss=0.1805, simple_loss=0.2572, pruned_loss=0.05185, over 12861.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2572, pruned_loss=0.05185, over 12861.00 frames. ], batch size: 36, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:46:10,918 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 01:46:17,954 INFO [train.py:1060] (1/2) Epoch 28, validation: loss=0.1929, simple_loss=0.2534, pruned_loss=0.06623, over 351949.00 frames. 2024-06-22 01:46:17,954 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 01:46:29,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=500817.1666666667, ans=0.125 2024-06-22 01:46:31,506 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.78 vs. limit=10.0 2024-06-22 01:46:35,117 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.334e+02 2.545e+02 2.791e+02 4.268e+02, threshold=5.090e+02, percent-clipped=0.0 2024-06-22 01:46:51,510 INFO [train.py:1028] (1/2) Epoch 28, batch 50, loss[loss=0.1979, simple_loss=0.2702, pruned_loss=0.06279, over 12619.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.25, pruned_loss=0.0651, over 574639.45 frames. ], batch size: 29, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:46:55,513 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=500890.5, ans=0.07 2024-06-22 01:46:57,137 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.10 vs. limit=15.0 2024-06-22 01:47:05,114 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=500927.1666666667, ans=0.125 2024-06-22 01:47:12,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=500945.5, ans=0.125 2024-06-22 01:47:13,573 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.97 vs. limit=12.0 2024-06-22 01:47:20,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=500963.8333333333, ans=0.04949747468305833 2024-06-22 01:47:21,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=500963.8333333333, ans=0.125 2024-06-22 01:47:26,123 INFO [train.py:1028] (1/2) Epoch 28, batch 100, loss[loss=0.1877, simple_loss=0.2546, pruned_loss=0.06041, over 13313.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2491, pruned_loss=0.06473, over 1017871.47 frames. ], batch size: 46, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:47:29,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=500982.1666666667, ans=0.0 2024-06-22 01:47:41,453 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.236e+02 2.322e+02 2.548e+02 3.879e+02, threshold=4.645e+02, percent-clipped=0.0 2024-06-22 01:47:47,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=501037.1666666667, ans=0.07 2024-06-22 01:47:48,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=501037.1666666667, ans=0.125 2024-06-22 01:47:51,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=501037.1666666667, ans=0.05 2024-06-22 01:47:52,360 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=501037.1666666667, ans=0.0 2024-06-22 01:48:00,480 INFO [train.py:1028] (1/2) Epoch 28, batch 150, loss[loss=0.1739, simple_loss=0.2348, pruned_loss=0.05653, over 12437.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2479, pruned_loss=0.06369, over 1364195.73 frames. ], batch size: 29, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:48:13,118 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.71 vs. limit=15.0 2024-06-22 01:48:22,605 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.69 vs. limit=15.0 2024-06-22 01:48:24,456 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.67 vs. limit=22.5 2024-06-22 01:48:32,551 INFO [train.py:1028] (1/2) Epoch 28, batch 200, loss[loss=0.1997, simple_loss=0.2554, pruned_loss=0.07201, over 12497.00 frames. ], tot_loss[loss=0.1887, simple_loss=0.2489, pruned_loss=0.06422, over 1633939.72 frames. ], batch size: 202, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:48:36,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=501165.5, ans=0.0 2024-06-22 01:48:37,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=501165.5, ans=0.05 2024-06-22 01:48:48,623 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.301e+02 2.414e+02 2.577e+02 3.322e+02, threshold=4.828e+02, percent-clipped=0.0 2024-06-22 01:49:04,860 INFO [train.py:1028] (1/2) Epoch 28, batch 250, loss[loss=0.1891, simple_loss=0.2454, pruned_loss=0.06642, over 13039.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2487, pruned_loss=0.06405, over 1845966.45 frames. ], batch size: 144, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:49:07,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=501257.1666666667, ans=0.125 2024-06-22 01:49:09,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=501257.1666666667, ans=0.1 2024-06-22 01:49:16,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=501275.5, ans=0.125 2024-06-22 01:49:21,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=501293.8333333333, ans=0.0 2024-06-22 01:49:29,774 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=501312.1666666667, ans=0.2 2024-06-22 01:49:31,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=501312.1666666667, ans=0.5 2024-06-22 01:49:38,191 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=501330.5, ans=0.0 2024-06-22 01:49:39,735 INFO [train.py:1028] (1/2) Epoch 28, batch 300, loss[loss=0.1811, simple_loss=0.2388, pruned_loss=0.06166, over 13151.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2494, pruned_loss=0.06441, over 2009206.54 frames. ], batch size: 112, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:49:46,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=501348.8333333333, ans=0.0 2024-06-22 01:49:50,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=501367.1666666667, ans=0.2 2024-06-22 01:49:58,953 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.370e+02 2.513e+02 2.810e+02 3.407e+02, threshold=5.026e+02, percent-clipped=0.0 2024-06-22 01:50:09,474 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.07 vs. limit=22.5 2024-06-22 01:50:09,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=501422.1666666667, ans=0.2 2024-06-22 01:50:15,037 INFO [train.py:1028] (1/2) Epoch 28, batch 350, loss[loss=0.1835, simple_loss=0.2468, pruned_loss=0.0601, over 12989.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2481, pruned_loss=0.06392, over 2138375.62 frames. ], batch size: 33, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:50:16,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=501440.5, ans=0.0 2024-06-22 01:50:16,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=501440.5, ans=0.0 2024-06-22 01:50:18,651 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2024-06-22 01:50:28,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=501477.1666666667, ans=0.125 2024-06-22 01:50:35,010 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.74 vs. limit=15.0 2024-06-22 01:50:36,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=501495.5, ans=0.0 2024-06-22 01:50:38,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=501495.5, ans=0.0 2024-06-22 01:50:46,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=501532.1666666667, ans=0.07 2024-06-22 01:50:47,059 INFO [train.py:1028] (1/2) Epoch 28, batch 400, loss[loss=0.1843, simple_loss=0.2457, pruned_loss=0.06145, over 13266.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2483, pruned_loss=0.06378, over 2239252.38 frames. ], batch size: 63, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:50:50,777 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2024-06-22 01:50:51,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=501532.1666666667, ans=0.0 2024-06-22 01:50:52,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=501532.1666666667, ans=0.0 2024-06-22 01:50:58,119 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=501550.5, ans=0.125 2024-06-22 01:50:59,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=501568.8333333333, ans=0.2 2024-06-22 01:51:03,080 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.283e+02 2.420e+02 2.702e+02 3.554e+02, threshold=4.840e+02, percent-clipped=0.0 2024-06-22 01:51:16,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=501605.5, ans=0.1 2024-06-22 01:51:18,651 INFO [train.py:1028] (1/2) Epoch 28, batch 450, loss[loss=0.1818, simple_loss=0.2509, pruned_loss=0.05633, over 13228.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2478, pruned_loss=0.06354, over 2313577.94 frames. ], batch size: 67, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:51:18,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=501623.8333333333, ans=0.0 2024-06-22 01:51:19,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=501623.8333333333, ans=0.2 2024-06-22 01:51:27,117 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=501642.1666666667, ans=0.125 2024-06-22 01:51:30,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=501642.1666666667, ans=0.0 2024-06-22 01:51:33,087 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.64 vs. limit=15.0 2024-06-22 01:51:36,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=501660.5, ans=0.1 2024-06-22 01:51:43,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=501678.8333333333, ans=0.125 2024-06-22 01:51:47,876 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.63 vs. limit=22.5 2024-06-22 01:51:52,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=501697.1666666667, ans=0.025 2024-06-22 01:51:53,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=501697.1666666667, ans=0.125 2024-06-22 01:51:56,881 INFO [train.py:1028] (1/2) Epoch 28, batch 500, loss[loss=0.1712, simple_loss=0.2317, pruned_loss=0.05529, over 13074.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2483, pruned_loss=0.06343, over 2375994.46 frames. ], batch size: 121, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:51:59,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=501715.5, ans=0.125 2024-06-22 01:52:06,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=501733.8333333333, ans=0.07 2024-06-22 01:52:12,275 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.337e+02 2.480e+02 2.653e+02 4.092e+02, threshold=4.960e+02, percent-clipped=0.0 2024-06-22 01:52:18,132 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.39 vs. limit=15.0 2024-06-22 01:52:24,617 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:52:28,110 INFO [train.py:1028] (1/2) Epoch 28, batch 550, loss[loss=0.1846, simple_loss=0.2424, pruned_loss=0.06341, over 12943.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.248, pruned_loss=0.06322, over 2420495.72 frames. ], batch size: 158, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:52:48,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=501862.1666666667, ans=0.05 2024-06-22 01:52:52,825 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=501880.5, ans=0.1 2024-06-22 01:52:55,532 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.52 vs. limit=15.0 2024-06-22 01:52:57,143 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=501880.5, ans=0.125 2024-06-22 01:52:59,358 INFO [train.py:1028] (1/2) Epoch 28, batch 600, loss[loss=0.177, simple_loss=0.2232, pruned_loss=0.06544, over 13003.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2477, pruned_loss=0.06331, over 2458178.60 frames. ], batch size: 144, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:53:01,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=501898.8333333333, ans=0.125 2024-06-22 01:53:06,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=501917.1666666667, ans=0.125 2024-06-22 01:53:13,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=501935.5, ans=0.125 2024-06-22 01:53:15,040 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.289e+02 2.442e+02 2.671e+02 3.387e+02, threshold=4.883e+02, percent-clipped=0.0 2024-06-22 01:53:17,346 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.02 vs. limit=15.0 2024-06-22 01:53:20,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=501953.8333333333, ans=0.125 2024-06-22 01:53:25,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=501972.1666666667, ans=0.025 2024-06-22 01:53:34,297 INFO [train.py:1028] (1/2) Epoch 28, batch 650, loss[loss=0.1865, simple_loss=0.2528, pruned_loss=0.06011, over 13236.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2485, pruned_loss=0.06316, over 2489273.79 frames. ], batch size: 59, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:53:35,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=501990.5, ans=0.125 2024-06-22 01:53:45,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=502008.8333333333, ans=0.05 2024-06-22 01:53:50,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.17 vs. limit=15.0 2024-06-22 01:53:57,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=502045.5, ans=0.95 2024-06-22 01:53:59,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.23 vs. limit=15.0 2024-06-22 01:54:09,615 INFO [train.py:1028] (1/2) Epoch 28, batch 700, loss[loss=0.1881, simple_loss=0.2516, pruned_loss=0.06227, over 13240.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2481, pruned_loss=0.06305, over 2511786.92 frames. ], batch size: 46, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:54:15,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=502100.5, ans=0.125 2024-06-22 01:54:21,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=502118.8333333333, ans=0.125 2024-06-22 01:54:24,863 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.304e+02 2.413e+02 2.613e+02 3.461e+02, threshold=4.827e+02, percent-clipped=0.0 2024-06-22 01:54:32,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=502137.1666666667, ans=0.125 2024-06-22 01:54:32,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=502137.1666666667, ans=0.125 2024-06-22 01:54:35,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502155.5, ans=0.1 2024-06-22 01:54:38,037 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.23 vs. limit=22.5 2024-06-22 01:54:39,874 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.27 vs. limit=6.0 2024-06-22 01:54:40,863 INFO [train.py:1028] (1/2) Epoch 28, batch 750, loss[loss=0.1769, simple_loss=0.2533, pruned_loss=0.05023, over 13269.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2483, pruned_loss=0.06295, over 2528279.09 frames. ], batch size: 63, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:54:43,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502173.8333333333, ans=0.1 2024-06-22 01:54:49,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502192.1666666667, ans=0.1 2024-06-22 01:54:49,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502192.1666666667, ans=0.1 2024-06-22 01:54:55,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502210.5, ans=0.1 2024-06-22 01:55:01,785 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2024-06-22 01:55:03,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=502228.8333333333, ans=0.2 2024-06-22 01:55:12,567 INFO [train.py:1028] (1/2) Epoch 28, batch 800, loss[loss=0.1897, simple_loss=0.2557, pruned_loss=0.06187, over 12865.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2485, pruned_loss=0.06321, over 2541451.17 frames. ], batch size: 36, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:55:16,779 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.44 vs. limit=6.0 2024-06-22 01:55:17,400 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.66 vs. limit=15.0 2024-06-22 01:55:20,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=502283.8333333333, ans=0.07 2024-06-22 01:55:24,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=502302.1666666667, ans=0.025 2024-06-22 01:55:28,486 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.289e+02 2.382e+02 2.523e+02 3.078e+02, threshold=4.765e+02, percent-clipped=0.0 2024-06-22 01:55:28,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=502302.1666666667, ans=0.025 2024-06-22 01:55:47,874 INFO [train.py:1028] (1/2) Epoch 28, batch 850, loss[loss=0.1898, simple_loss=0.2497, pruned_loss=0.06496, over 13159.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2478, pruned_loss=0.06277, over 2551757.30 frames. ], batch size: 95, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:55:50,167 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.45 vs. limit=10.0 2024-06-22 01:55:55,390 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.20 vs. limit=12.0 2024-06-22 01:55:56,401 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=502375.5, ans=0.0 2024-06-22 01:56:10,200 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.27 vs. limit=22.5 2024-06-22 01:56:14,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=502412.1666666667, ans=0.0 2024-06-22 01:56:19,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=502430.5, ans=0.125 2024-06-22 01:56:22,629 INFO [train.py:1028] (1/2) Epoch 28, batch 900, loss[loss=0.1812, simple_loss=0.2461, pruned_loss=0.05819, over 12891.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2474, pruned_loss=0.06293, over 2557180.24 frames. ], batch size: 36, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:56:24,830 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.02 vs. limit=10.0 2024-06-22 01:56:28,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=502467.1666666667, ans=0.05 2024-06-22 01:56:32,937 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:56:33,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=502467.1666666667, ans=0.0 2024-06-22 01:56:36,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502485.5, ans=0.1 2024-06-22 01:56:38,444 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.258e+02 2.367e+02 2.508e+02 3.108e+02, threshold=4.733e+02, percent-clipped=0.0 2024-06-22 01:56:39,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=502485.5, ans=0.125 2024-06-22 01:56:48,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=502522.1666666667, ans=0.125 2024-06-22 01:56:48,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=502522.1666666667, ans=0.0 2024-06-22 01:56:52,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502522.1666666667, ans=0.1 2024-06-22 01:56:54,526 INFO [train.py:1028] (1/2) Epoch 28, batch 950, loss[loss=0.1982, simple_loss=0.2578, pruned_loss=0.06928, over 12956.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2474, pruned_loss=0.06285, over 2560041.01 frames. ], batch size: 39, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:56:57,243 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=502540.5, ans=0.125 2024-06-22 01:57:01,528 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:57:03,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=502558.8333333333, ans=0.2 2024-06-22 01:57:07,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=502577.1666666667, ans=0.0 2024-06-22 01:57:10,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=502577.1666666667, ans=0.0 2024-06-22 01:57:16,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=502595.5, ans=0.1 2024-06-22 01:57:18,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=502595.5, ans=0.0 2024-06-22 01:57:18,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=502595.5, ans=0.125 2024-06-22 01:57:18,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=502595.5, ans=0.2 2024-06-22 01:57:25,641 INFO [train.py:1028] (1/2) Epoch 28, batch 1000, loss[loss=0.1862, simple_loss=0.2518, pruned_loss=0.06025, over 13249.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2476, pruned_loss=0.06318, over 2562071.38 frames. ], batch size: 49, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:57:30,067 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=502632.1666666667, ans=0.125 2024-06-22 01:57:42,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=502668.8333333333, ans=0.2 2024-06-22 01:57:44,991 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.323e+02 2.464e+02 2.669e+02 3.275e+02, threshold=4.928e+02, percent-clipped=0.0 2024-06-22 01:57:49,003 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.01 vs. limit=15.0 2024-06-22 01:58:03,744 INFO [train.py:1028] (1/2) Epoch 28, batch 1050, loss[loss=0.1876, simple_loss=0.262, pruned_loss=0.05657, over 13173.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.248, pruned_loss=0.06323, over 2565283.22 frames. ], batch size: 77, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 01:58:04,704 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.24 vs. limit=15.0 2024-06-22 01:58:09,093 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=502723.8333333333, ans=0.025 2024-06-22 01:58:09,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=502742.1666666667, ans=0.125 2024-06-22 01:58:13,054 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.41 vs. limit=10.0 2024-06-22 01:58:19,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=502760.5, ans=0.125 2024-06-22 01:58:27,644 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:58:33,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502797.1666666667, ans=0.1 2024-06-22 01:58:33,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=502797.1666666667, ans=0.0 2024-06-22 01:58:36,258 INFO [train.py:1028] (1/2) Epoch 28, batch 1100, loss[loss=0.1864, simple_loss=0.2493, pruned_loss=0.06171, over 13286.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2484, pruned_loss=0.06335, over 2571076.76 frames. ], batch size: 52, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 01:58:46,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=502833.8333333333, ans=0.0 2024-06-22 01:58:50,579 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2024-06-22 01:58:52,793 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.297e+02 2.410e+02 2.550e+02 3.088e+02, threshold=4.820e+02, percent-clipped=0.0 2024-06-22 01:58:54,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=502870.5, ans=0.0 2024-06-22 01:58:58,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=502870.5, ans=0.0 2024-06-22 01:59:01,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=502888.8333333333, ans=0.125 2024-06-22 01:59:03,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=502888.8333333333, ans=0.125 2024-06-22 01:59:08,374 INFO [train.py:1028] (1/2) Epoch 28, batch 1150, loss[loss=0.2042, simple_loss=0.2667, pruned_loss=0.07082, over 13274.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2483, pruned_loss=0.06335, over 2571336.81 frames. ], batch size: 52, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 01:59:12,778 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.81 vs. limit=15.0 2024-06-22 01:59:21,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=502943.8333333333, ans=0.2 2024-06-22 01:59:23,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=502943.8333333333, ans=0.0 2024-06-22 01:59:36,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=502980.5, ans=0.0 2024-06-22 01:59:40,087 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:59:42,644 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.37 vs. limit=10.0 2024-06-22 01:59:42,957 INFO [train.py:1028] (1/2) Epoch 28, batch 1200, loss[loss=0.1877, simple_loss=0.2545, pruned_loss=0.06044, over 13165.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2479, pruned_loss=0.0634, over 2573831.59 frames. ], batch size: 77, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 01:59:59,364 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.48 vs. limit=6.0 2024-06-22 02:00:00,509 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.85 vs. limit=15.0 2024-06-22 02:00:02,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=503035.5, ans=0.1 2024-06-22 02:00:03,226 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.299e+02 2.448e+02 2.591e+02 3.320e+02, threshold=4.896e+02, percent-clipped=0.0 2024-06-22 02:00:15,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=503072.1666666667, ans=15.0 2024-06-22 02:00:19,064 INFO [train.py:1028] (1/2) Epoch 28, batch 1250, loss[loss=0.1906, simple_loss=0.2472, pruned_loss=0.06703, over 13221.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.247, pruned_loss=0.06281, over 2583165.66 frames. ], batch size: 112, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 02:00:24,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=503090.5, ans=0.025 2024-06-22 02:00:26,061 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=503108.8333333333, ans=0.95 2024-06-22 02:00:28,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=503108.8333333333, ans=0.0 2024-06-22 02:00:51,924 INFO [train.py:1028] (1/2) Epoch 28, batch 1300, loss[loss=0.2029, simple_loss=0.2558, pruned_loss=0.07501, over 12783.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2482, pruned_loss=0.06362, over 2584049.37 frames. ], batch size: 176, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 02:00:53,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=503182.1666666667, ans=0.125 2024-06-22 02:01:02,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=503200.5, ans=0.2 2024-06-22 02:01:04,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=503200.5, ans=0.1 2024-06-22 02:01:06,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=503218.8333333333, ans=0.0 2024-06-22 02:01:09,359 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 2.263e+02 2.471e+02 2.669e+02 3.773e+02, threshold=4.941e+02, percent-clipped=0.0 2024-06-22 02:01:12,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=503237.1666666667, ans=0.0 2024-06-22 02:01:13,504 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2024-06-22 02:01:18,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=503255.5, ans=0.125 2024-06-22 02:01:25,336 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2024-06-22 02:01:25,643 INFO [train.py:1028] (1/2) Epoch 28, batch 1350, loss[loss=0.1667, simple_loss=0.2338, pruned_loss=0.04978, over 13129.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2482, pruned_loss=0.06347, over 2584622.78 frames. ], batch size: 59, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 02:01:25,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=503273.8333333333, ans=0.125 2024-06-22 02:01:29,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=503273.8333333333, ans=0.1 2024-06-22 02:01:38,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=503310.5, ans=0.2 2024-06-22 02:01:54,743 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:01:57,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=503347.1666666667, ans=0.125 2024-06-22 02:01:58,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=503347.1666666667, ans=0.125 2024-06-22 02:02:04,745 INFO [train.py:1028] (1/2) Epoch 28, batch 1400, loss[loss=0.1796, simple_loss=0.2403, pruned_loss=0.05945, over 12826.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2488, pruned_loss=0.06394, over 2585867.73 frames. ], batch size: 26, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 02:02:15,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=503383.8333333333, ans=0.0 2024-06-22 02:02:15,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=503383.8333333333, ans=0.125 2024-06-22 02:02:19,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=503402.1666666667, ans=0.125 2024-06-22 02:02:21,472 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.278e+02 2.378e+02 2.560e+02 3.214e+02, threshold=4.757e+02, percent-clipped=0.0 2024-06-22 02:02:22,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=503402.1666666667, ans=0.125 2024-06-22 02:02:29,586 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2024-06-22 02:02:31,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=503438.8333333333, ans=0.0 2024-06-22 02:02:32,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=503438.8333333333, ans=0.0 2024-06-22 02:02:33,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=503438.8333333333, ans=0.1 2024-06-22 02:02:37,779 INFO [train.py:1028] (1/2) Epoch 28, batch 1450, loss[loss=0.1655, simple_loss=0.2244, pruned_loss=0.05332, over 13095.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2486, pruned_loss=0.06387, over 2585940.03 frames. ], batch size: 121, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 02:02:51,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=503493.8333333333, ans=0.05 2024-06-22 02:03:03,461 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:03:07,551 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.21 vs. limit=10.0 2024-06-22 02:03:10,276 INFO [train.py:1028] (1/2) Epoch 28, batch 1500, loss[loss=0.1752, simple_loss=0.231, pruned_loss=0.05967, over 13182.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2484, pruned_loss=0.06399, over 2589663.68 frames. ], batch size: 83, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 02:03:18,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=503567.1666666667, ans=0.0 2024-06-22 02:03:26,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=503585.5, ans=0.125 2024-06-22 02:03:26,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=503585.5, ans=0.125 2024-06-22 02:03:27,360 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.300e+02 2.420e+02 2.576e+02 3.434e+02, threshold=4.840e+02, percent-clipped=0.0 2024-06-22 02:03:28,059 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=503585.5, ans=0.125 2024-06-22 02:03:29,637 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=14.17 vs. limit=15.0 2024-06-22 02:03:48,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=503622.1666666667, ans=0.125 2024-06-22 02:03:49,371 INFO [train.py:1028] (1/2) Epoch 28, batch 1550, loss[loss=0.1769, simple_loss=0.2286, pruned_loss=0.0626, over 13025.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2477, pruned_loss=0.06365, over 2583709.37 frames. ], batch size: 102, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 02:04:09,155 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=503695.5, ans=0.0 2024-06-22 02:04:13,907 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=503695.5, ans=0.125 2024-06-22 02:04:16,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=503713.8333333333, ans=0.2 2024-06-22 02:04:19,634 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=22.5 2024-06-22 02:04:20,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=503713.8333333333, ans=0.125 2024-06-22 02:04:22,262 INFO [train.py:1028] (1/2) Epoch 28, batch 1600, loss[loss=0.1713, simple_loss=0.2329, pruned_loss=0.05489, over 13202.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2479, pruned_loss=0.06342, over 2579093.55 frames. ], batch size: 77, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:04:23,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=503732.1666666667, ans=0.04949747468305833 2024-06-22 02:04:25,252 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2024-06-22 02:04:33,077 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:04:35,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.42 vs. limit=6.0 2024-06-22 02:04:36,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=503768.8333333333, ans=0.0 2024-06-22 02:04:38,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=503768.8333333333, ans=0.1 2024-06-22 02:04:39,228 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.294e+02 2.484e+02 2.746e+02 3.743e+02, threshold=4.968e+02, percent-clipped=0.0 2024-06-22 02:04:54,315 INFO [train.py:1028] (1/2) Epoch 28, batch 1650, loss[loss=0.1949, simple_loss=0.246, pruned_loss=0.07191, over 13155.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2478, pruned_loss=0.06361, over 2575258.41 frames. ], batch size: 95, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:05:00,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=503842.1666666667, ans=0.0 2024-06-22 02:05:07,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=503860.5, ans=0.0 2024-06-22 02:05:10,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=503860.5, ans=0.0 2024-06-22 02:05:11,692 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.26 vs. limit=6.0 2024-06-22 02:05:19,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=503878.8333333333, ans=0.0 2024-06-22 02:05:23,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=503897.1666666667, ans=0.0 2024-06-22 02:05:27,721 INFO [train.py:1028] (1/2) Epoch 28, batch 1700, loss[loss=0.1838, simple_loss=0.2539, pruned_loss=0.0568, over 12498.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2479, pruned_loss=0.06338, over 2579864.31 frames. ], batch size: 25, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:05:31,820 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=503915.5, ans=0.0 2024-06-22 02:05:40,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=503952.1666666667, ans=0.0 2024-06-22 02:05:40,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=503952.1666666667, ans=0.125 2024-06-22 02:05:51,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=503952.1666666667, ans=0.125 2024-06-22 02:05:51,524 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.223e+02 2.351e+02 2.494e+02 4.179e+02, threshold=4.703e+02, percent-clipped=0.0 2024-06-22 02:05:56,267 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=503970.5, ans=0.2 2024-06-22 02:06:06,753 INFO [train.py:1028] (1/2) Epoch 28, batch 1750, loss[loss=0.2034, simple_loss=0.2674, pruned_loss=0.06975, over 12566.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2485, pruned_loss=0.06358, over 2580984.05 frames. ], batch size: 22, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:06:09,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=504007.1666666667, ans=0.0 2024-06-22 02:06:12,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=504025.5, ans=0.0 2024-06-22 02:06:30,428 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.42 vs. limit=22.5 2024-06-22 02:06:39,334 INFO [train.py:1028] (1/2) Epoch 28, batch 1800, loss[loss=0.1732, simple_loss=0.2399, pruned_loss=0.0532, over 13248.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2484, pruned_loss=0.06358, over 2581698.46 frames. ], batch size: 67, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:06:46,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=504117.1666666667, ans=0.125 2024-06-22 02:06:50,728 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=504117.1666666667, ans=0.035 2024-06-22 02:06:56,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=504135.5, ans=0.1 2024-06-22 02:06:57,303 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.300e+02 2.459e+02 2.635e+02 3.621e+02, threshold=4.917e+02, percent-clipped=0.0 2024-06-22 02:06:58,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=504135.5, ans=0.125 2024-06-22 02:07:00,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=504153.8333333333, ans=0.025 2024-06-22 02:07:06,697 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2024-06-22 02:07:09,913 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.62 vs. limit=15.0 2024-06-22 02:07:11,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=504172.1666666667, ans=0.1 2024-06-22 02:07:12,743 INFO [train.py:1028] (1/2) Epoch 28, batch 1850, loss[loss=0.1853, simple_loss=0.24, pruned_loss=0.06532, over 13209.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2482, pruned_loss=0.06351, over 2582819.05 frames. ], batch size: 83, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:07:20,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=504208.8333333333, ans=0.07 2024-06-22 02:07:20,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=504208.8333333333, ans=0.1 2024-06-22 02:07:28,751 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.45 vs. limit=22.5 2024-06-22 02:07:36,119 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.02 vs. limit=22.5 2024-06-22 02:07:45,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=504263.8333333333, ans=0.1 2024-06-22 02:07:48,302 INFO [train.py:1028] (1/2) Epoch 28, batch 1900, loss[loss=0.183, simple_loss=0.2417, pruned_loss=0.06213, over 13102.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2483, pruned_loss=0.06377, over 2584912.70 frames. ], batch size: 95, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:07:57,758 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.62 vs. limit=10.0 2024-06-22 02:07:59,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=504300.5, ans=0.125 2024-06-22 02:08:09,070 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.310e+02 2.489e+02 2.635e+02 3.817e+02, threshold=4.977e+02, percent-clipped=0.0 2024-06-22 02:08:20,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=504355.5, ans=0.125 2024-06-22 02:08:24,258 INFO [train.py:1028] (1/2) Epoch 28, batch 1950, loss[loss=0.1825, simple_loss=0.2508, pruned_loss=0.05714, over 13248.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2479, pruned_loss=0.06385, over 2591640.33 frames. ], batch size: 52, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:08:29,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=504373.8333333333, ans=0.125 2024-06-22 02:08:42,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=504410.5, ans=0.025 2024-06-22 02:08:42,849 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=504410.5, ans=0.1 2024-06-22 02:08:51,320 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:08:52,649 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=504447.1666666667, ans=0.07 2024-06-22 02:08:54,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=504447.1666666667, ans=0.0 2024-06-22 02:08:56,830 INFO [train.py:1028] (1/2) Epoch 28, batch 2000, loss[loss=0.2038, simple_loss=0.2711, pruned_loss=0.06824, over 12442.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.248, pruned_loss=0.06407, over 2587863.29 frames. ], batch size: 22, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:09:00,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=504465.5, ans=0.0 2024-06-22 02:09:03,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=504483.8333333333, ans=0.125 2024-06-22 02:09:04,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=504483.8333333333, ans=0.09899494936611666 2024-06-22 02:09:05,767 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.84 vs. limit=22.5 2024-06-22 02:09:09,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=504502.1666666667, ans=0.125 2024-06-22 02:09:15,328 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.316e+02 2.431e+02 2.600e+02 4.030e+02, threshold=4.862e+02, percent-clipped=0.0 2024-06-22 02:09:17,007 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.38 vs. limit=15.0 2024-06-22 02:09:23,388 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.07 vs. limit=15.0 2024-06-22 02:09:25,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=504538.8333333333, ans=0.125 2024-06-22 02:09:27,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=504538.8333333333, ans=0.0 2024-06-22 02:09:28,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=504538.8333333333, ans=0.125 2024-06-22 02:09:30,253 INFO [train.py:1028] (1/2) Epoch 28, batch 2050, loss[loss=0.1799, simple_loss=0.2417, pruned_loss=0.05904, over 12704.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2476, pruned_loss=0.06384, over 2584431.04 frames. ], batch size: 29, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:09:35,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=504557.1666666667, ans=0.0 2024-06-22 02:09:37,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=504557.1666666667, ans=0.1 2024-06-22 02:09:37,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=504557.1666666667, ans=0.0 2024-06-22 02:09:39,000 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=504575.5, ans=0.1 2024-06-22 02:09:51,677 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=504593.8333333333, ans=0.025 2024-06-22 02:10:08,344 INFO [train.py:1028] (1/2) Epoch 28, batch 2100, loss[loss=0.1785, simple_loss=0.2408, pruned_loss=0.0581, over 13190.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2479, pruned_loss=0.06365, over 2586844.50 frames. ], batch size: 59, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:10:24,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=504685.5, ans=0.0 2024-06-22 02:10:26,388 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.319e+02 2.485e+02 2.702e+02 3.516e+02, threshold=4.970e+02, percent-clipped=0.0 2024-06-22 02:10:27,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=504703.8333333333, ans=0.125 2024-06-22 02:10:31,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=504703.8333333333, ans=0.035 2024-06-22 02:10:31,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=504703.8333333333, ans=0.1 2024-06-22 02:10:40,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=504740.5, ans=0.2 2024-06-22 02:10:40,728 INFO [train.py:1028] (1/2) Epoch 28, batch 2150, loss[loss=0.1816, simple_loss=0.2456, pruned_loss=0.05884, over 13244.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2483, pruned_loss=0.06364, over 2589643.55 frames. ], batch size: 52, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:10:53,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=504777.1666666667, ans=0.04949747468305833 2024-06-22 02:11:02,397 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=504795.5, ans=0.04949747468305833 2024-06-22 02:11:08,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=504813.8333333333, ans=0.125 2024-06-22 02:11:11,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=504813.8333333333, ans=0.1 2024-06-22 02:11:12,860 INFO [train.py:1028] (1/2) Epoch 28, batch 2200, loss[loss=0.1849, simple_loss=0.2359, pruned_loss=0.06695, over 13267.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2487, pruned_loss=0.06394, over 2589247.30 frames. ], batch size: 83, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:11:13,646 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:11:15,386 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=22.5 2024-06-22 02:11:26,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=504868.8333333333, ans=0.125 2024-06-22 02:11:29,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=504868.8333333333, ans=0.0 2024-06-22 02:11:30,824 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.292e+02 2.464e+02 2.683e+02 3.210e+02, threshold=4.929e+02, percent-clipped=0.0 2024-06-22 02:11:39,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=504905.5, ans=0.125 2024-06-22 02:11:44,670 INFO [train.py:1028] (1/2) Epoch 28, batch 2250, loss[loss=0.1653, simple_loss=0.2313, pruned_loss=0.04968, over 13242.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2488, pruned_loss=0.0639, over 2587578.94 frames. ], batch size: 63, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:11:46,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=504923.8333333333, ans=0.2 2024-06-22 02:11:51,081 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.72 vs. limit=15.0 2024-06-22 02:12:03,940 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.98 vs. limit=15.0 2024-06-22 02:12:08,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=504960.5, ans=0.025 2024-06-22 02:12:18,508 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.81 vs. limit=15.0 2024-06-22 02:12:23,875 INFO [train.py:1028] (1/2) Epoch 28, batch 2300, loss[loss=0.1837, simple_loss=0.2537, pruned_loss=0.05683, over 12880.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2484, pruned_loss=0.06342, over 2582263.98 frames. ], batch size: 33, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:12:36,453 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=505052.1666666667, ans=10.0 2024-06-22 02:12:40,992 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=505052.1666666667, ans=0.025 2024-06-22 02:12:41,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=505052.1666666667, ans=0.125 2024-06-22 02:12:41,989 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.301e+02 2.440e+02 2.653e+02 3.442e+02, threshold=4.879e+02, percent-clipped=0.0 2024-06-22 02:12:48,270 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.35 vs. limit=22.5 2024-06-22 02:12:56,748 INFO [train.py:1028] (1/2) Epoch 28, batch 2350, loss[loss=0.1789, simple_loss=0.2447, pruned_loss=0.05654, over 13228.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2482, pruned_loss=0.06367, over 2585966.12 frames. ], batch size: 67, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:13:03,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=505125.5, ans=0.1 2024-06-22 02:13:07,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=505125.5, ans=0.0 2024-06-22 02:13:16,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=505162.1666666667, ans=0.125 2024-06-22 02:13:25,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=505180.5, ans=0.125 2024-06-22 02:13:25,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=505180.5, ans=0.0 2024-06-22 02:13:28,896 INFO [train.py:1028] (1/2) Epoch 28, batch 2400, loss[loss=0.1847, simple_loss=0.2471, pruned_loss=0.06113, over 13347.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2475, pruned_loss=0.06351, over 2589059.00 frames. ], batch size: 46, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:13:29,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=505198.8333333333, ans=0.0 2024-06-22 02:13:37,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=505198.8333333333, ans=0.125 2024-06-22 02:13:50,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=505235.5, ans=0.125 2024-06-22 02:13:52,539 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.297e+02 2.459e+02 2.575e+02 3.877e+02, threshold=4.919e+02, percent-clipped=0.0 2024-06-22 02:14:03,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=505272.1666666667, ans=0.0 2024-06-22 02:14:03,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=505272.1666666667, ans=0.1 2024-06-22 02:14:04,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=505272.1666666667, ans=0.0 2024-06-22 02:14:06,856 INFO [train.py:1028] (1/2) Epoch 28, batch 2450, loss[loss=0.1876, simple_loss=0.2457, pruned_loss=0.06473, over 13236.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2468, pruned_loss=0.06369, over 2585852.68 frames. ], batch size: 63, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:14:06,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=505290.5, ans=0.1 2024-06-22 02:14:07,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=505290.5, ans=0.1 2024-06-22 02:14:12,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=505290.5, ans=0.0 2024-06-22 02:14:14,762 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=505308.8333333333, ans=0.04949747468305833 2024-06-22 02:14:16,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=505308.8333333333, ans=0.125 2024-06-22 02:14:17,891 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=505308.8333333333, ans=0.0 2024-06-22 02:14:19,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=505327.1666666667, ans=0.125 2024-06-22 02:14:19,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=505327.1666666667, ans=0.0 2024-06-22 02:14:26,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=505345.5, ans=0.1 2024-06-22 02:14:39,662 INFO [train.py:1028] (1/2) Epoch 28, batch 2500, loss[loss=0.1738, simple_loss=0.2314, pruned_loss=0.05808, over 13190.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2468, pruned_loss=0.06361, over 2588761.06 frames. ], batch size: 83, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:14:43,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=505382.1666666667, ans=0.0 2024-06-22 02:14:44,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=505382.1666666667, ans=0.5 2024-06-22 02:14:57,645 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.291e+02 2.416e+02 2.656e+02 3.694e+02, threshold=4.831e+02, percent-clipped=0.0 2024-06-22 02:15:03,397 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.40 vs. limit=15.0 2024-06-22 02:15:12,188 INFO [train.py:1028] (1/2) Epoch 28, batch 2550, loss[loss=0.1771, simple_loss=0.2444, pruned_loss=0.05492, over 12600.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2454, pruned_loss=0.06306, over 2587470.81 frames. ], batch size: 22, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:15:12,916 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=505473.8333333333, ans=0.125 2024-06-22 02:15:40,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=505528.8333333333, ans=0.1 2024-06-22 02:15:53,804 INFO [train.py:1028] (1/2) Epoch 28, batch 2600, loss[loss=0.1664, simple_loss=0.229, pruned_loss=0.0519, over 13184.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2443, pruned_loss=0.06289, over 2587339.93 frames. ], batch size: 52, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:15:57,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=505565.5, ans=0.2 2024-06-22 02:16:11,744 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.253e+02 2.365e+02 2.575e+02 3.401e+02, threshold=4.730e+02, percent-clipped=0.0 2024-06-22 02:16:11,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=505602.1666666667, ans=0.125 2024-06-22 02:16:20,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=505638.8333333333, ans=0.1 2024-06-22 02:16:26,841 INFO [train.py:1028] (1/2) Epoch 28, batch 2650, loss[loss=0.1889, simple_loss=0.2429, pruned_loss=0.06744, over 13028.00 frames. ], tot_loss[loss=0.184, simple_loss=0.243, pruned_loss=0.0625, over 2587505.60 frames. ], batch size: 144, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:16:34,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=505675.5, ans=0.0 2024-06-22 02:16:37,439 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.60 vs. limit=15.0 2024-06-22 02:16:40,126 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.46 vs. limit=12.0 2024-06-22 02:16:41,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=505693.8333333333, ans=0.0 2024-06-22 02:16:43,074 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=505693.8333333333, ans=0.125 2024-06-22 02:16:59,664 INFO [train.py:1028] (1/2) Epoch 28, batch 2700, loss[loss=0.1933, simple_loss=0.2456, pruned_loss=0.07056, over 13291.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2421, pruned_loss=0.06234, over 2586677.36 frames. ], batch size: 89, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:17:00,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=505748.8333333333, ans=0.1 2024-06-22 02:17:18,145 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.255e+02 2.367e+02 2.561e+02 3.307e+02, threshold=4.735e+02, percent-clipped=0.0 2024-06-22 02:17:21,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=505803.8333333333, ans=0.5 2024-06-22 02:17:26,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=505822.1666666667, ans=0.125 2024-06-22 02:17:35,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=505822.1666666667, ans=0.125 2024-06-22 02:17:36,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=505822.1666666667, ans=0.05 2024-06-22 02:17:38,573 INFO [train.py:1028] (1/2) Epoch 28, batch 2750, loss[loss=0.1815, simple_loss=0.2406, pruned_loss=0.06118, over 13250.00 frames. ], tot_loss[loss=0.1825, simple_loss=0.2413, pruned_loss=0.06184, over 2582536.69 frames. ], batch size: 43, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:17:44,307 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=505840.5, ans=0.125 2024-06-22 02:17:46,572 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.57 vs. limit=6.0 2024-06-22 02:17:50,249 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=505858.8333333333, ans=0.0 2024-06-22 02:17:52,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=505877.1666666667, ans=0.1 2024-06-22 02:17:54,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=505877.1666666667, ans=0.125 2024-06-22 02:17:56,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=505877.1666666667, ans=0.125 2024-06-22 02:17:59,982 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.77 vs. limit=10.0 2024-06-22 02:18:11,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=505932.1666666667, ans=0.07 2024-06-22 02:18:11,795 INFO [train.py:1028] (1/2) Epoch 28, batch 2800, loss[loss=0.2085, simple_loss=0.2528, pruned_loss=0.08206, over 10710.00 frames. ], tot_loss[loss=0.1825, simple_loss=0.2411, pruned_loss=0.06194, over 2579590.49 frames. ], batch size: 303, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:18:20,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=505950.5, ans=0.0 2024-06-22 02:18:29,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=505968.8333333333, ans=0.025 2024-06-22 02:18:30,268 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.237e+02 2.352e+02 2.551e+02 3.481e+02, threshold=4.703e+02, percent-clipped=0.0 2024-06-22 02:18:31,996 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=505987.1666666667, ans=0.125 2024-06-22 02:18:41,971 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.77 vs. limit=22.5 2024-06-22 02:18:44,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=506005.5, ans=0.125 2024-06-22 02:18:45,599 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.72 vs. limit=15.0 2024-06-22 02:18:49,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=506023.8333333333, ans=0.2 2024-06-22 02:18:49,742 INFO [train.py:1028] (1/2) Epoch 28, batch 2850, loss[loss=0.1702, simple_loss=0.2277, pruned_loss=0.05633, over 13308.00 frames. ], tot_loss[loss=0.1817, simple_loss=0.2402, pruned_loss=0.06167, over 2577274.61 frames. ], batch size: 49, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:19:10,012 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2024-06-22 02:19:24,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=506097.1666666667, ans=0.125 2024-06-22 02:19:25,596 INFO [train.py:1028] (1/2) Epoch 28, batch 2900, loss[loss=0.1849, simple_loss=0.246, pruned_loss=0.06192, over 13152.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2384, pruned_loss=0.06129, over 2585603.68 frames. ], batch size: 55, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:19:47,877 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 2.300e+02 2.460e+02 2.673e+02 3.769e+02, threshold=4.919e+02, percent-clipped=0.0 2024-06-22 02:19:54,658 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=506170.5, ans=0.125 2024-06-22 02:20:02,325 INFO [train.py:1028] (1/2) Epoch 28, batch 2950, loss[loss=0.1712, simple_loss=0.2363, pruned_loss=0.0531, over 13180.00 frames. ], tot_loss[loss=0.18, simple_loss=0.2378, pruned_loss=0.06105, over 2579140.49 frames. ], batch size: 43, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:20:06,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=506207.1666666667, ans=0.0 2024-06-22 02:20:35,778 INFO [train.py:1028] (1/2) Epoch 28, batch 3000, loss[loss=0.1736, simple_loss=0.2354, pruned_loss=0.0559, over 13239.00 frames. ], tot_loss[loss=0.1792, simple_loss=0.237, pruned_loss=0.06067, over 2578695.65 frames. ], batch size: 59, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:20:35,779 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 02:20:40,121 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([4.2800, 3.4023, 3.6790, 3.3968, 2.8119, 3.4232, 3.6808, 3.5931], device='cuda:1') 2024-06-22 02:20:43,602 INFO [train.py:1060] (1/2) Epoch 28, validation: loss=0.1921, simple_loss=0.2514, pruned_loss=0.06643, over 351949.00 frames. 2024-06-22 02:20:43,603 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 02:20:52,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=506317.1666666667, ans=0.125 2024-06-22 02:20:55,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=506317.1666666667, ans=0.0 2024-06-22 02:20:57,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506335.5, ans=0.1 2024-06-22 02:21:02,217 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.264e+02 2.400e+02 2.674e+02 3.342e+02, threshold=4.801e+02, percent-clipped=0.0 2024-06-22 02:21:09,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=506353.8333333333, ans=0.125 2024-06-22 02:21:12,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=506372.1666666667, ans=0.0 2024-06-22 02:21:16,948 INFO [train.py:1028] (1/2) Epoch 28, batch 3050, loss[loss=0.1744, simple_loss=0.2321, pruned_loss=0.05839, over 13283.00 frames. ], tot_loss[loss=0.1794, simple_loss=0.2368, pruned_loss=0.06101, over 2578588.81 frames. ], batch size: 46, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:21:32,813 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.94 vs. limit=22.5 2024-06-22 02:21:34,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=506408.8333333333, ans=0.125 2024-06-22 02:21:41,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=506427.1666666667, ans=0.125 2024-06-22 02:21:48,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=506445.5, ans=0.125 2024-06-22 02:21:56,419 INFO [train.py:1028] (1/2) Epoch 28, batch 3100, loss[loss=0.1662, simple_loss=0.2248, pruned_loss=0.05385, over 13037.00 frames. ], tot_loss[loss=0.1785, simple_loss=0.236, pruned_loss=0.06055, over 2579973.93 frames. ], batch size: 144, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:22:11,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=506518.8333333333, ans=0.125 2024-06-22 02:22:14,704 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.312e+02 2.436e+02 2.687e+02 3.466e+02, threshold=4.872e+02, percent-clipped=0.0 2024-06-22 02:22:29,383 INFO [train.py:1028] (1/2) Epoch 28, batch 3150, loss[loss=0.1869, simple_loss=0.2387, pruned_loss=0.06751, over 12903.00 frames. ], tot_loss[loss=0.178, simple_loss=0.2353, pruned_loss=0.06032, over 2582344.64 frames. ], batch size: 158, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:22:35,176 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=506573.8333333333, ans=0.125 2024-06-22 02:22:38,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=506592.1666666667, ans=0.05 2024-06-22 02:22:39,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=506592.1666666667, ans=0.125 2024-06-22 02:22:41,631 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=506592.1666666667, ans=0.0 2024-06-22 02:22:45,466 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.67 vs. limit=15.0 2024-06-22 02:22:48,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=506610.5, ans=0.125 2024-06-22 02:22:48,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=506610.5, ans=0.125 2024-06-22 02:22:52,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=506628.8333333333, ans=15.0 2024-06-22 02:23:03,005 INFO [train.py:1028] (1/2) Epoch 28, batch 3200, loss[loss=0.1708, simple_loss=0.2295, pruned_loss=0.05603, over 13110.00 frames. ], tot_loss[loss=0.177, simple_loss=0.2345, pruned_loss=0.05975, over 2583056.51 frames. ], batch size: 55, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:23:03,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=506665.5, ans=0.125 2024-06-22 02:23:20,910 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.220e+02 2.369e+02 2.514e+02 3.098e+02, threshold=4.738e+02, percent-clipped=0.0 2024-06-22 02:23:29,611 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.77 vs. limit=15.0 2024-06-22 02:23:33,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=506720.5, ans=0.2 2024-06-22 02:23:36,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=506738.8333333333, ans=0.1 2024-06-22 02:23:41,219 INFO [train.py:1028] (1/2) Epoch 28, batch 3250, loss[loss=0.1559, simple_loss=0.213, pruned_loss=0.04945, over 13245.00 frames. ], tot_loss[loss=0.1764, simple_loss=0.2336, pruned_loss=0.05961, over 2586418.16 frames. ], batch size: 72, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:23:43,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=506757.1666666667, ans=0.125 2024-06-22 02:23:52,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=506775.5, ans=0.0 2024-06-22 02:23:54,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=506793.8333333333, ans=0.2 2024-06-22 02:23:58,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=506793.8333333333, ans=0.015 2024-06-22 02:24:05,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=506812.1666666667, ans=0.0 2024-06-22 02:24:06,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=506812.1666666667, ans=0.125 2024-06-22 02:24:11,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=506830.5, ans=0.1 2024-06-22 02:24:14,399 INFO [train.py:1028] (1/2) Epoch 28, batch 3300, loss[loss=0.1802, simple_loss=0.2265, pruned_loss=0.06698, over 12776.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.2326, pruned_loss=0.05923, over 2582795.59 frames. ], batch size: 177, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:24:20,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=506867.1666666667, ans=0.0 2024-06-22 02:24:27,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=506885.5, ans=0.125 2024-06-22 02:24:32,098 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.270e+02 2.420e+02 2.594e+02 3.768e+02, threshold=4.839e+02, percent-clipped=0.0 2024-06-22 02:24:36,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=506903.8333333333, ans=0.125 2024-06-22 02:24:42,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=506922.1666666667, ans=0.0 2024-06-22 02:24:46,170 INFO [train.py:1028] (1/2) Epoch 28, batch 3350, loss[loss=0.1899, simple_loss=0.244, pruned_loss=0.06786, over 12893.00 frames. ], tot_loss[loss=0.1764, simple_loss=0.233, pruned_loss=0.05987, over 2577831.11 frames. ], batch size: 158, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:24:50,106 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506940.5, ans=0.1 2024-06-22 02:25:01,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=506977.1666666667, ans=0.0 2024-06-22 02:25:06,815 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.81 vs. limit=15.0 2024-06-22 02:25:20,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=507013.8333333333, ans=10.0 2024-06-22 02:25:23,732 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.46 vs. limit=10.0 2024-06-22 02:25:27,206 INFO [train.py:1028] (1/2) Epoch 28, batch 3400, loss[loss=0.1775, simple_loss=0.2393, pruned_loss=0.05784, over 12655.00 frames. ], tot_loss[loss=0.1771, simple_loss=0.2335, pruned_loss=0.06034, over 2575452.55 frames. ], batch size: 22, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:25:34,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=507050.5, ans=0.2 2024-06-22 02:25:36,731 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.74 vs. limit=15.0 2024-06-22 02:25:37,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=507050.5, ans=0.1 2024-06-22 02:25:41,637 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.79 vs. limit=10.0 2024-06-22 02:25:45,524 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.242e+02 2.401e+02 2.630e+02 3.658e+02, threshold=4.802e+02, percent-clipped=0.0 2024-06-22 02:26:00,377 INFO [train.py:1028] (1/2) Epoch 28, batch 3450, loss[loss=0.1967, simple_loss=0.2444, pruned_loss=0.0745, over 12744.00 frames. ], tot_loss[loss=0.1765, simple_loss=0.2329, pruned_loss=0.05999, over 2578069.41 frames. ], batch size: 177, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:26:04,940 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=507123.8333333333, ans=0.0 2024-06-22 02:26:12,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=507160.5, ans=0.2 2024-06-22 02:26:21,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=507178.8333333333, ans=0.125 2024-06-22 02:26:29,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=507197.1666666667, ans=0.125 2024-06-22 02:26:31,697 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.81 vs. limit=22.5 2024-06-22 02:26:32,607 INFO [train.py:1028] (1/2) Epoch 28, batch 3500, loss[loss=0.1827, simple_loss=0.2456, pruned_loss=0.05986, over 12960.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2323, pruned_loss=0.0595, over 2577922.83 frames. ], batch size: 33, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:26:50,019 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.29 vs. limit=10.0 2024-06-22 02:26:50,777 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.178e+02 2.309e+02 2.461e+02 3.259e+02, threshold=4.618e+02, percent-clipped=0.0 2024-06-22 02:26:57,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=507288.8333333333, ans=0.125 2024-06-22 02:26:57,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=507288.8333333333, ans=0.125 2024-06-22 02:26:58,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=507288.8333333333, ans=0.125 2024-06-22 02:27:05,049 INFO [train.py:1028] (1/2) Epoch 28, batch 3550, loss[loss=0.1743, simple_loss=0.2306, pruned_loss=0.05903, over 13204.00 frames. ], tot_loss[loss=0.1752, simple_loss=0.2318, pruned_loss=0.05932, over 2578773.75 frames. ], batch size: 95, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:27:05,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=507307.1666666667, ans=0.125 2024-06-22 02:27:22,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=507325.5, ans=0.05 2024-06-22 02:27:22,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=507325.5, ans=0.125 2024-06-22 02:27:24,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=507343.8333333333, ans=0.125 2024-06-22 02:27:26,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=507343.8333333333, ans=0.0 2024-06-22 02:27:31,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=507362.1666666667, ans=0.125 2024-06-22 02:27:36,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=507380.5, ans=0.025 2024-06-22 02:27:43,106 INFO [train.py:1028] (1/2) Epoch 28, batch 3600, loss[loss=0.1579, simple_loss=0.2283, pruned_loss=0.04372, over 12981.00 frames. ], tot_loss[loss=0.1749, simple_loss=0.2314, pruned_loss=0.05921, over 2581619.33 frames. ], batch size: 48, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:27:46,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=507398.8333333333, ans=0.125 2024-06-22 02:27:47,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=507398.8333333333, ans=0.0 2024-06-22 02:28:01,593 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.281e+02 2.430e+02 2.623e+02 3.597e+02, threshold=4.860e+02, percent-clipped=0.0 2024-06-22 02:28:02,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=507453.8333333333, ans=0.125 2024-06-22 02:28:09,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=507472.1666666667, ans=0.125 2024-06-22 02:28:16,036 INFO [train.py:1028] (1/2) Epoch 28, batch 3650, loss[loss=0.1744, simple_loss=0.2309, pruned_loss=0.05898, over 13055.00 frames. ], tot_loss[loss=0.1743, simple_loss=0.2311, pruned_loss=0.05879, over 2579605.53 frames. ], batch size: 102, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:28:18,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=507490.5, ans=0.125 2024-06-22 02:28:36,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=507545.5, ans=0.125 2024-06-22 02:28:42,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=507563.8333333333, ans=0.0 2024-06-22 02:28:43,823 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.89 vs. limit=10.0 2024-06-22 02:28:49,327 INFO [train.py:1028] (1/2) Epoch 28, batch 3700, loss[loss=0.1733, simple_loss=0.2386, pruned_loss=0.054, over 13193.00 frames. ], tot_loss[loss=0.174, simple_loss=0.2305, pruned_loss=0.05876, over 2584375.39 frames. ], batch size: 72, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:28:51,063 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.84 vs. limit=22.5 2024-06-22 02:28:55,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=507600.5, ans=0.0 2024-06-22 02:28:55,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=507600.5, ans=0.0 2024-06-22 02:29:02,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=507618.8333333333, ans=0.125 2024-06-22 02:29:07,496 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.190e+02 2.378e+02 2.549e+02 3.471e+02, threshold=4.756e+02, percent-clipped=0.0 2024-06-22 02:29:24,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=507673.8333333333, ans=0.125 2024-06-22 02:29:25,029 INFO [train.py:1028] (1/2) Epoch 28, batch 3750, loss[loss=0.1773, simple_loss=0.2375, pruned_loss=0.05857, over 12379.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2298, pruned_loss=0.0583, over 2586945.04 frames. ], batch size: 22, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:29:36,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=507692.1666666667, ans=0.1 2024-06-22 02:29:37,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=507692.1666666667, ans=0.1 2024-06-22 02:29:58,348 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.52 vs. limit=15.0 2024-06-22 02:29:59,702 INFO [train.py:1028] (1/2) Epoch 28, batch 3800, loss[loss=0.1666, simple_loss=0.2221, pruned_loss=0.05552, over 13162.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2298, pruned_loss=0.05826, over 2584373.75 frames. ], batch size: 83, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:30:00,869 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.03 vs. limit=15.0 2024-06-22 02:30:18,665 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.200e+02 2.344e+02 2.553e+02 3.068e+02, threshold=4.689e+02, percent-clipped=0.0 2024-06-22 02:30:31,795 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=507838.8333333333, ans=0.04949747468305833 2024-06-22 02:30:34,310 INFO [train.py:1028] (1/2) Epoch 28, batch 3850, loss[loss=0.1762, simple_loss=0.229, pruned_loss=0.06173, over 13053.00 frames. ], tot_loss[loss=0.1729, simple_loss=0.2296, pruned_loss=0.05811, over 2584601.33 frames. ], batch size: 144, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:30:34,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=507857.1666666667, ans=0.0 2024-06-22 02:30:40,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=507875.5, ans=0.0 2024-06-22 02:30:42,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=507875.5, ans=10.0 2024-06-22 02:30:56,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=507912.1666666667, ans=0.0 2024-06-22 02:31:06,761 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.50 vs. limit=15.0 2024-06-22 02:31:06,957 INFO [train.py:1028] (1/2) Epoch 28, batch 3900, loss[loss=0.1827, simple_loss=0.2359, pruned_loss=0.06477, over 13200.00 frames. ], tot_loss[loss=0.1728, simple_loss=0.2293, pruned_loss=0.05819, over 2587437.57 frames. ], batch size: 83, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:31:07,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=507948.8333333333, ans=0.0 2024-06-22 02:31:17,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=507967.1666666667, ans=0.0 2024-06-22 02:31:20,950 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:31:28,424 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.168e+02 2.320e+02 2.490e+02 3.133e+02, threshold=4.639e+02, percent-clipped=0.0 2024-06-22 02:31:42,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=508022.1666666667, ans=0.025 2024-06-22 02:31:45,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=508040.5, ans=0.125 2024-06-22 02:31:46,305 INFO [train.py:1028] (1/2) Epoch 28, batch 3950, loss[loss=0.1607, simple_loss=0.2077, pruned_loss=0.05688, over 13088.00 frames. ], tot_loss[loss=0.1722, simple_loss=0.2287, pruned_loss=0.05789, over 2588374.25 frames. ], batch size: 132, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:31:56,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=508058.8333333333, ans=0.025 2024-06-22 02:31:57,179 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=508058.8333333333, ans=10.0 2024-06-22 02:32:01,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=508077.1666666667, ans=10.0 2024-06-22 02:32:14,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=508113.8333333333, ans=0.95 2024-06-22 02:32:18,994 INFO [train.py:1028] (1/2) Epoch 28, batch 4000, loss[loss=0.1753, simple_loss=0.2307, pruned_loss=0.05989, over 12921.00 frames. ], tot_loss[loss=0.1712, simple_loss=0.2275, pruned_loss=0.05746, over 2582467.39 frames. ], batch size: 39, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:32:19,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=508132.1666666667, ans=0.125 2024-06-22 02:32:20,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=508132.1666666667, ans=0.0 2024-06-22 02:32:21,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=508132.1666666667, ans=0.2 2024-06-22 02:32:24,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=508150.5, ans=0.05 2024-06-22 02:32:32,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=508168.8333333333, ans=10.0 2024-06-22 02:32:34,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=508168.8333333333, ans=0.125 2024-06-22 02:32:34,397 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=22.5 2024-06-22 02:32:37,264 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.251e+02 2.352e+02 2.537e+02 3.336e+02, threshold=4.704e+02, percent-clipped=0.0 2024-06-22 02:32:45,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=508205.5, ans=0.09899494936611666 2024-06-22 02:32:49,200 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.95 vs. limit=15.0 2024-06-22 02:32:51,947 INFO [train.py:1028] (1/2) Epoch 28, batch 4050, loss[loss=0.1805, simple_loss=0.2209, pruned_loss=0.07005, over 10867.00 frames. ], tot_loss[loss=0.1715, simple_loss=0.2277, pruned_loss=0.05763, over 2580237.89 frames. ], batch size: 304, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:32:53,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=508223.8333333333, ans=0.2 2024-06-22 02:33:00,865 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.56 vs. limit=10.0 2024-06-22 02:33:08,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=508260.5, ans=0.125 2024-06-22 02:33:11,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=508278.8333333333, ans=0.125 2024-06-22 02:33:11,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=508278.8333333333, ans=0.125 2024-06-22 02:33:11,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=508278.8333333333, ans=0.125 2024-06-22 02:33:13,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=508278.8333333333, ans=0.125 2024-06-22 02:33:21,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=508297.1666666667, ans=0.09899494936611666 2024-06-22 02:33:26,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=508297.1666666667, ans=0.125 2024-06-22 02:33:26,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=508297.1666666667, ans=0.125 2024-06-22 02:33:28,163 INFO [train.py:1028] (1/2) Epoch 28, batch 4100, loss[loss=0.1725, simple_loss=0.2163, pruned_loss=0.06433, over 13129.00 frames. ], tot_loss[loss=0.1719, simple_loss=0.228, pruned_loss=0.0579, over 2576904.25 frames. ], batch size: 103, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:33:28,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=508315.5, ans=0.1 2024-06-22 02:33:28,942 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=508315.5, ans=0.125 2024-06-22 02:33:36,311 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.82 vs. limit=15.0 2024-06-22 02:33:39,033 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.05 vs. limit=22.5 2024-06-22 02:33:39,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=508333.8333333333, ans=0.2 2024-06-22 02:33:40,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=508333.8333333333, ans=0.2 2024-06-22 02:33:49,597 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.240e+02 2.367e+02 2.585e+02 3.602e+02, threshold=4.734e+02, percent-clipped=0.0 2024-06-22 02:33:50,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=508370.5, ans=0.0 2024-06-22 02:33:50,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=508370.5, ans=0.2 2024-06-22 02:33:52,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=508370.5, ans=0.0 2024-06-22 02:34:04,013 INFO [train.py:1028] (1/2) Epoch 28, batch 4150, loss[loss=0.1694, simple_loss=0.2235, pruned_loss=0.05763, over 13132.00 frames. ], tot_loss[loss=0.1716, simple_loss=0.2277, pruned_loss=0.05779, over 2575174.35 frames. ], batch size: 55, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:34:21,916 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:34:34,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=508480.5, ans=0.2 2024-06-22 02:34:36,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=508498.8333333333, ans=0.1 2024-06-22 02:34:36,574 INFO [train.py:1028] (1/2) Epoch 28, batch 4200, loss[loss=0.155, simple_loss=0.2043, pruned_loss=0.05285, over 13051.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.2271, pruned_loss=0.05775, over 2579072.60 frames. ], batch size: 102, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:34:54,421 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.163e+02 2.299e+02 2.410e+02 3.221e+02, threshold=4.599e+02, percent-clipped=0.0 2024-06-22 02:35:01,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=508553.8333333333, ans=0.0 2024-06-22 02:35:02,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=508572.1666666667, ans=0.125 2024-06-22 02:35:04,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=508572.1666666667, ans=0.125 2024-06-22 02:35:06,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=508572.1666666667, ans=0.09899494936611666 2024-06-22 02:35:08,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=508590.5, ans=0.125 2024-06-22 02:35:09,289 INFO [train.py:1028] (1/2) Epoch 28, batch 4250, loss[loss=0.1641, simple_loss=0.231, pruned_loss=0.04857, over 13350.00 frames. ], tot_loss[loss=0.171, simple_loss=0.227, pruned_loss=0.05749, over 2582199.42 frames. ], batch size: 46, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:35:10,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=508590.5, ans=0.125 2024-06-22 02:35:28,715 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=508627.1666666667, ans=0.2 2024-06-22 02:35:48,039 INFO [train.py:1028] (1/2) Epoch 28, batch 4300, loss[loss=0.1629, simple_loss=0.2169, pruned_loss=0.05443, over 13189.00 frames. ], tot_loss[loss=0.1708, simple_loss=0.2267, pruned_loss=0.05739, over 2581514.19 frames. ], batch size: 59, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:36:03,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=508718.8333333333, ans=0.0 2024-06-22 02:36:05,689 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.998e+02 2.244e+02 2.349e+02 2.533e+02 3.198e+02, threshold=4.698e+02, percent-clipped=0.0 2024-06-22 02:36:19,838 INFO [train.py:1028] (1/2) Epoch 28, batch 4350, loss[loss=0.1511, simple_loss=0.2165, pruned_loss=0.04287, over 13237.00 frames. ], tot_loss[loss=0.1706, simple_loss=0.2264, pruned_loss=0.05738, over 2585488.08 frames. ], batch size: 59, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:36:28,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=508792.1666666667, ans=0.0 2024-06-22 02:36:30,679 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.99 vs. limit=5.0 2024-06-22 02:36:31,935 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:36:36,526 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:36:47,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=508847.1666666667, ans=0.025 2024-06-22 02:36:52,561 INFO [train.py:1028] (1/2) Epoch 28, batch 4400, loss[loss=0.1805, simple_loss=0.2283, pruned_loss=0.06632, over 13204.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2264, pruned_loss=0.0573, over 2586522.95 frames. ], batch size: 83, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:36:55,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=508865.5, ans=0.0 2024-06-22 02:37:08,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=508902.1666666667, ans=0.1 2024-06-22 02:37:10,729 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.259e+02 2.411e+02 2.583e+02 3.312e+02, threshold=4.822e+02, percent-clipped=0.0 2024-06-22 02:37:12,563 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.74 vs. limit=15.0 2024-06-22 02:37:23,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=508938.8333333333, ans=0.1 2024-06-22 02:37:24,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=508938.8333333333, ans=0.2 2024-06-22 02:37:24,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=508938.8333333333, ans=0.125 2024-06-22 02:37:29,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=508957.1666666667, ans=0.125 2024-06-22 02:37:30,236 INFO [train.py:1028] (1/2) Epoch 28, batch 4450, loss[loss=0.1851, simple_loss=0.2407, pruned_loss=0.06478, over 12932.00 frames. ], tot_loss[loss=0.1703, simple_loss=0.2261, pruned_loss=0.05722, over 2580442.77 frames. ], batch size: 33, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:37:30,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=508957.1666666667, ans=0.0 2024-06-22 02:37:31,855 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.38 vs. limit=15.0 2024-06-22 02:37:48,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=508993.8333333333, ans=0.05 2024-06-22 02:37:50,224 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.74 vs. limit=15.0 2024-06-22 02:37:51,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=508993.8333333333, ans=0.0 2024-06-22 02:37:52,044 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.07 vs. limit=12.0 2024-06-22 02:37:57,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=509012.1666666667, ans=0.125 2024-06-22 02:37:58,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=509030.5, ans=0.2 2024-06-22 02:37:59,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=509030.5, ans=0.125 2024-06-22 02:38:05,348 INFO [train.py:1028] (1/2) Epoch 28, batch 4500, loss[loss=0.1476, simple_loss=0.2084, pruned_loss=0.04336, over 13247.00 frames. ], tot_loss[loss=0.1701, simple_loss=0.2256, pruned_loss=0.05724, over 2584961.92 frames. ], batch size: 89, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:38:16,481 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:38:23,540 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.192e+02 2.313e+02 2.537e+02 3.154e+02, threshold=4.626e+02, percent-clipped=0.0 2024-06-22 02:38:30,292 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=509103.8333333333, ans=0.125 2024-06-22 02:38:38,145 INFO [train.py:1028] (1/2) Epoch 28, batch 4550, loss[loss=0.1588, simple_loss=0.213, pruned_loss=0.05224, over 13287.00 frames. ], tot_loss[loss=0.1695, simple_loss=0.2253, pruned_loss=0.05687, over 2588429.45 frames. ], batch size: 52, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:38:47,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=509158.8333333333, ans=0.2 2024-06-22 02:38:53,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=509177.1666666667, ans=0.125 2024-06-22 02:38:56,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=509177.1666666667, ans=0.04949747468305833 2024-06-22 02:39:04,578 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=15.0 2024-06-22 02:39:10,646 INFO [train.py:1028] (1/2) Epoch 28, batch 4600, loss[loss=0.2033, simple_loss=0.2455, pruned_loss=0.08058, over 12571.00 frames. ], tot_loss[loss=0.1699, simple_loss=0.2258, pruned_loss=0.05703, over 2584176.83 frames. ], batch size: 202, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:39:14,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=509232.1666666667, ans=0.2 2024-06-22 02:39:14,487 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.47 vs. limit=22.5 2024-06-22 02:39:14,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=509232.1666666667, ans=0.025 2024-06-22 02:39:32,087 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.200e+02 2.308e+02 2.530e+02 2.971e+02, threshold=4.617e+02, percent-clipped=0.0 2024-06-22 02:39:39,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=509287.1666666667, ans=0.125 2024-06-22 02:39:47,269 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.46 vs. limit=10.0 2024-06-22 02:39:49,379 INFO [train.py:1028] (1/2) Epoch 28, batch 4650, loss[loss=0.1654, simple_loss=0.2183, pruned_loss=0.05631, over 13112.00 frames. ], tot_loss[loss=0.1693, simple_loss=0.2251, pruned_loss=0.05677, over 2587004.00 frames. ], batch size: 132, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:39:55,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=509342.1666666667, ans=0.09899494936611666 2024-06-22 02:39:59,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=509342.1666666667, ans=0.0 2024-06-22 02:40:22,989 INFO [train.py:1028] (1/2) Epoch 28, batch 4700, loss[loss=0.1684, simple_loss=0.227, pruned_loss=0.05492, over 12445.00 frames. ], tot_loss[loss=0.1687, simple_loss=0.2246, pruned_loss=0.05644, over 2582441.64 frames. ], batch size: 25, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:40:28,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=509415.5, ans=0.125 2024-06-22 02:40:29,050 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.49 vs. limit=12.0 2024-06-22 02:40:32,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=509433.8333333333, ans=0.125 2024-06-22 02:40:32,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=509433.8333333333, ans=0.0 2024-06-22 02:40:33,608 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.92 vs. limit=15.0 2024-06-22 02:40:37,255 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2024-06-22 02:40:40,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=509452.1666666667, ans=0.125 2024-06-22 02:40:41,383 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.189e+02 2.305e+02 2.464e+02 3.540e+02, threshold=4.609e+02, percent-clipped=0.0 2024-06-22 02:40:42,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=509470.5, ans=0.2 2024-06-22 02:40:54,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=509488.8333333333, ans=15.0 2024-06-22 02:40:56,265 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.69 vs. limit=22.5 2024-06-22 02:40:56,363 INFO [train.py:1028] (1/2) Epoch 28, batch 4750, loss[loss=0.1794, simple_loss=0.2326, pruned_loss=0.06313, over 12547.00 frames. ], tot_loss[loss=0.1694, simple_loss=0.2248, pruned_loss=0.05702, over 2579856.73 frames. ], batch size: 202, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:41:01,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=509507.1666666667, ans=0.125 2024-06-22 02:41:06,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=509525.5, ans=0.0 2024-06-22 02:41:14,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=509543.8333333333, ans=0.125 2024-06-22 02:41:32,771 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=509580.5, ans=0.125 2024-06-22 02:41:34,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=509580.5, ans=0.125 2024-06-22 02:41:35,576 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.51 vs. limit=15.0 2024-06-22 02:41:35,843 INFO [train.py:1028] (1/2) Epoch 28, batch 4800, loss[loss=0.1557, simple_loss=0.2195, pruned_loss=0.04597, over 13240.00 frames. ], tot_loss[loss=0.1687, simple_loss=0.2242, pruned_loss=0.05657, over 2577494.07 frames. ], batch size: 63, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:41:41,972 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=509617.1666666667, ans=0.1 2024-06-22 02:41:44,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=509617.1666666667, ans=0.04949747468305833 2024-06-22 02:41:54,999 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.245e+02 2.353e+02 2.511e+02 3.167e+02, threshold=4.706e+02, percent-clipped=0.0 2024-06-22 02:42:00,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=509653.8333333333, ans=0.0 2024-06-22 02:42:08,783 INFO [train.py:1028] (1/2) Epoch 28, batch 4850, loss[loss=0.1643, simple_loss=0.2168, pruned_loss=0.05589, over 13222.00 frames. ], tot_loss[loss=0.1685, simple_loss=0.2242, pruned_loss=0.05642, over 2575856.59 frames. ], batch size: 89, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:42:12,935 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2024-06-22 02:42:13,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=509690.5, ans=0.2 2024-06-22 02:42:21,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=509727.1666666667, ans=0.125 2024-06-22 02:42:25,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=509727.1666666667, ans=0.125 2024-06-22 02:42:29,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=509745.5, ans=0.125 2024-06-22 02:42:41,992 INFO [train.py:1028] (1/2) Epoch 28, batch 4900, loss[loss=0.1587, simple_loss=0.2175, pruned_loss=0.04993, over 13208.00 frames. ], tot_loss[loss=0.1685, simple_loss=0.224, pruned_loss=0.05649, over 2576801.67 frames. ], batch size: 59, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:42:42,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=509782.1666666667, ans=0.125 2024-06-22 02:42:44,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=509782.1666666667, ans=0.05 2024-06-22 02:42:53,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=509800.5, ans=0.07 2024-06-22 02:43:00,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=509818.8333333333, ans=0.125 2024-06-22 02:43:00,993 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.177e+02 2.287e+02 2.460e+02 3.421e+02, threshold=4.575e+02, percent-clipped=0.0 2024-06-22 02:43:17,549 INFO [train.py:1028] (1/2) Epoch 28, batch 4950, loss[loss=0.1907, simple_loss=0.2315, pruned_loss=0.07498, over 11115.00 frames. ], tot_loss[loss=0.1689, simple_loss=0.2241, pruned_loss=0.05682, over 2570807.74 frames. ], batch size: 304, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:43:24,097 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2024-06-22 02:43:38,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=509910.5, ans=0.125 2024-06-22 02:43:44,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=509928.8333333333, ans=0.1 2024-06-22 02:43:52,622 INFO [train.py:1028] (1/2) Epoch 28, batch 5000, loss[loss=0.1572, simple_loss=0.2077, pruned_loss=0.05339, over 13068.00 frames. ], tot_loss[loss=0.1692, simple_loss=0.2243, pruned_loss=0.05704, over 2574315.93 frames. ], batch size: 95, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:43:53,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=509965.5, ans=0.0 2024-06-22 02:43:54,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=509965.5, ans=0.125 2024-06-22 02:44:00,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=509983.8333333333, ans=0.0 2024-06-22 02:44:11,846 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.148e+02 2.250e+02 2.438e+02 3.117e+02, threshold=4.500e+02, percent-clipped=0.0 2024-06-22 02:44:26,235 INFO [train.py:1028] (1/2) Epoch 28, batch 5050, loss[loss=0.1698, simple_loss=0.2262, pruned_loss=0.05674, over 12813.00 frames. ], tot_loss[loss=0.1693, simple_loss=0.2244, pruned_loss=0.05708, over 2574177.48 frames. ], batch size: 36, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:44:26,669 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.75 vs. limit=6.0 2024-06-22 02:44:32,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=510075.5, ans=0.125 2024-06-22 02:44:45,235 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=510093.8333333333, ans=0.95 2024-06-22 02:44:48,200 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=510112.1666666667, ans=0.0 2024-06-22 02:45:00,244 INFO [train.py:1028] (1/2) Epoch 28, batch 5100, loss[loss=0.1688, simple_loss=0.2317, pruned_loss=0.05292, over 12974.00 frames. ], tot_loss[loss=0.1693, simple_loss=0.2242, pruned_loss=0.05718, over 2569975.72 frames. ], batch size: 39, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:45:19,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=510185.5, ans=0.035 2024-06-22 02:45:23,315 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.204e+02 2.349e+02 2.512e+02 3.432e+02, threshold=4.698e+02, percent-clipped=0.0 2024-06-22 02:45:31,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=510222.1666666667, ans=0.2 2024-06-22 02:45:39,165 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.52 vs. limit=6.0 2024-06-22 02:45:40,257 INFO [train.py:1028] (1/2) Epoch 28, batch 5150, loss[loss=0.1831, simple_loss=0.2257, pruned_loss=0.07027, over 13070.00 frames. ], tot_loss[loss=0.1698, simple_loss=0.2247, pruned_loss=0.05745, over 2572974.13 frames. ], batch size: 132, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:45:40,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=510240.5, ans=0.1 2024-06-22 02:45:41,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=510240.5, ans=0.125 2024-06-22 02:45:43,232 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.49 vs. limit=15.0 2024-06-22 02:46:05,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=510313.8333333333, ans=0.0 2024-06-22 02:46:08,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=510313.8333333333, ans=0.0 2024-06-22 02:46:13,010 INFO [train.py:1028] (1/2) Epoch 28, batch 5200, loss[loss=0.1851, simple_loss=0.2414, pruned_loss=0.0644, over 13199.00 frames. ], tot_loss[loss=0.1698, simple_loss=0.2247, pruned_loss=0.05741, over 2575718.73 frames. ], batch size: 95, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:46:15,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=510332.1666666667, ans=0.0 2024-06-22 02:46:28,513 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.57 vs. limit=15.0 2024-06-22 02:46:32,105 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.187e+02 2.335e+02 2.531e+02 4.037e+02, threshold=4.669e+02, percent-clipped=0.0 2024-06-22 02:46:36,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=510387.1666666667, ans=0.125 2024-06-22 02:46:41,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=510405.5, ans=0.1 2024-06-22 02:46:46,180 INFO [train.py:1028] (1/2) Epoch 28, batch 5250, loss[loss=0.1824, simple_loss=0.2342, pruned_loss=0.06526, over 13242.00 frames. ], tot_loss[loss=0.1698, simple_loss=0.2248, pruned_loss=0.05733, over 2572768.93 frames. ], batch size: 52, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:46:55,891 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.71 vs. limit=6.0 2024-06-22 02:47:00,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=510460.5, ans=0.025 2024-06-22 02:47:05,364 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.86 vs. limit=15.0 2024-06-22 02:47:07,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=510478.8333333333, ans=0.0 2024-06-22 02:47:07,424 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.45 vs. limit=15.0 2024-06-22 02:47:17,064 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=510497.1666666667, ans=0.05 2024-06-22 02:47:21,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=510497.1666666667, ans=0.125 2024-06-22 02:47:22,711 INFO [train.py:1028] (1/2) Epoch 28, batch 5300, loss[loss=0.1713, simple_loss=0.2248, pruned_loss=0.05892, over 13017.00 frames. ], tot_loss[loss=0.1701, simple_loss=0.2251, pruned_loss=0.05759, over 2569394.74 frames. ], batch size: 144, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:47:34,865 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=510533.8333333333, ans=0.2 2024-06-22 02:47:35,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=510533.8333333333, ans=0.1 2024-06-22 02:47:38,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=510533.8333333333, ans=0.125 2024-06-22 02:47:38,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=510552.1666666667, ans=0.025 2024-06-22 02:47:40,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=510552.1666666667, ans=0.1 2024-06-22 02:47:45,341 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.211e+02 2.309e+02 2.467e+02 2.967e+02, threshold=4.619e+02, percent-clipped=0.0 2024-06-22 02:47:50,185 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2024-06-22 02:47:53,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=510588.8333333333, ans=0.125 2024-06-22 02:47:59,074 INFO [train.py:1028] (1/2) Epoch 28, batch 5350, loss[loss=0.1679, simple_loss=0.2304, pruned_loss=0.0527, over 12256.00 frames. ], tot_loss[loss=0.1697, simple_loss=0.2246, pruned_loss=0.05745, over 2575643.74 frames. ], batch size: 17, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:48:03,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=510607.1666666667, ans=0.125 2024-06-22 02:48:19,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=510662.1666666667, ans=0.035 2024-06-22 02:48:24,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=510680.5, ans=0.0 2024-06-22 02:48:25,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=510680.5, ans=0.125 2024-06-22 02:48:28,542 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.69 vs. limit=12.0 2024-06-22 02:48:31,215 INFO [train.py:1028] (1/2) Epoch 28, batch 5400, loss[loss=0.1736, simple_loss=0.2221, pruned_loss=0.0625, over 12210.00 frames. ], tot_loss[loss=0.1697, simple_loss=0.2244, pruned_loss=0.05746, over 2568413.25 frames. ], batch size: 240, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:48:32,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=510698.8333333333, ans=0.0 2024-06-22 02:48:38,965 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=510717.1666666667, ans=0.07 2024-06-22 02:48:43,640 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2024-06-22 02:48:44,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=510735.5, ans=0.125 2024-06-22 02:48:50,559 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.205e+02 2.308e+02 2.460e+02 3.091e+02, threshold=4.617e+02, percent-clipped=0.0 2024-06-22 02:48:50,858 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=510753.8333333333, ans=0.125 2024-06-22 02:49:04,610 INFO [train.py:1028] (1/2) Epoch 28, batch 5450, loss[loss=0.1548, simple_loss=0.2162, pruned_loss=0.04668, over 12474.00 frames. ], tot_loss[loss=0.1697, simple_loss=0.2246, pruned_loss=0.05741, over 2572633.94 frames. ], batch size: 25, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:49:10,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=510790.5, ans=0.2 2024-06-22 02:49:18,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=510808.8333333333, ans=0.1 2024-06-22 02:49:19,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=510808.8333333333, ans=0.2 2024-06-22 02:49:28,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=510827.1666666667, ans=0.0 2024-06-22 02:49:34,851 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=15.0 2024-06-22 02:49:35,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=510845.5, ans=0.125 2024-06-22 02:49:38,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=510863.8333333333, ans=0.125 2024-06-22 02:49:40,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=510863.8333333333, ans=0.02 2024-06-22 02:49:44,181 INFO [train.py:1028] (1/2) Epoch 28, batch 5500, loss[loss=0.1825, simple_loss=0.2271, pruned_loss=0.06896, over 12291.00 frames. ], tot_loss[loss=0.1698, simple_loss=0.2247, pruned_loss=0.05745, over 2565006.53 frames. ], batch size: 240, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:49:48,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=510882.1666666667, ans=0.2 2024-06-22 02:49:48,950 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=22.5 2024-06-22 02:49:52,478 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=15.0 2024-06-22 02:50:04,663 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.249e+02 2.374e+02 2.590e+02 3.505e+02, threshold=4.748e+02, percent-clipped=0.0 2024-06-22 02:50:12,221 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.28 vs. limit=15.0 2024-06-22 02:50:14,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=510955.5, ans=0.125 2024-06-22 02:50:15,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=510955.5, ans=0.025 2024-06-22 02:50:17,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=510973.8333333333, ans=0.0 2024-06-22 02:50:18,212 INFO [train.py:1028] (1/2) Epoch 28, batch 5550, loss[loss=0.1626, simple_loss=0.2292, pruned_loss=0.04803, over 13252.00 frames. ], tot_loss[loss=0.1691, simple_loss=0.2242, pruned_loss=0.057, over 2568696.01 frames. ], batch size: 43, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:50:22,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=510973.8333333333, ans=0.125 2024-06-22 02:50:28,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=510992.1666666667, ans=0.025 2024-06-22 02:50:36,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=511010.5, ans=0.125 2024-06-22 02:50:54,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=511047.1666666667, ans=0.125 2024-06-22 02:50:59,817 INFO [train.py:1028] (1/2) Epoch 28, batch 5600, loss[loss=0.158, simple_loss=0.2108, pruned_loss=0.05261, over 13233.00 frames. ], tot_loss[loss=0.1688, simple_loss=0.2239, pruned_loss=0.05683, over 2570748.44 frames. ], batch size: 89, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:51:00,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=511065.5, ans=0.125 2024-06-22 02:51:01,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=511065.5, ans=0.125 2024-06-22 02:51:23,106 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.201e+02 2.313e+02 2.495e+02 3.105e+02, threshold=4.625e+02, percent-clipped=0.0 2024-06-22 02:51:24,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=511120.5, ans=0.125 2024-06-22 02:51:34,968 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.94 vs. limit=15.0 2024-06-22 02:51:38,753 INFO [train.py:1028] (1/2) Epoch 28, batch 5650, loss[loss=0.1619, simple_loss=0.2129, pruned_loss=0.05548, over 12466.00 frames. ], tot_loss[loss=0.169, simple_loss=0.2242, pruned_loss=0.05695, over 2575955.66 frames. ], batch size: 202, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:51:45,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=511157.1666666667, ans=0.125 2024-06-22 02:51:46,480 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=12.0 2024-06-22 02:51:47,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=511157.1666666667, ans=0.0 2024-06-22 02:51:53,135 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.34 vs. limit=10.0 2024-06-22 02:52:01,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=511193.8333333333, ans=0.2 2024-06-22 02:52:10,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=511230.5, ans=0.0 2024-06-22 02:52:17,414 INFO [train.py:1028] (1/2) Epoch 28, batch 5700, loss[loss=0.1586, simple_loss=0.2147, pruned_loss=0.05126, over 13246.00 frames. ], tot_loss[loss=0.1691, simple_loss=0.2241, pruned_loss=0.057, over 2578999.32 frames. ], batch size: 63, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:52:21,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=511248.8333333333, ans=0.125 2024-06-22 02:52:26,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=511267.1666666667, ans=0.0 2024-06-22 02:52:27,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=511267.1666666667, ans=0.035 2024-06-22 02:52:36,019 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.222e+02 2.355e+02 2.470e+02 3.464e+02, threshold=4.710e+02, percent-clipped=0.0 2024-06-22 02:52:39,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=511303.8333333333, ans=0.0 2024-06-22 02:52:41,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=511303.8333333333, ans=0.125 2024-06-22 02:52:43,880 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=30.61 vs. limit=22.5 2024-06-22 02:52:48,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=511322.1666666667, ans=0.125 2024-06-22 02:52:49,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=511340.5, ans=0.125 2024-06-22 02:52:50,062 INFO [train.py:1028] (1/2) Epoch 28, batch 5750, loss[loss=0.1865, simple_loss=0.2393, pruned_loss=0.0668, over 12793.00 frames. ], tot_loss[loss=0.1693, simple_loss=0.2244, pruned_loss=0.05706, over 2578844.10 frames. ], batch size: 176, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:52:58,967 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.36 vs. limit=15.0 2024-06-22 02:53:04,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=511377.1666666667, ans=0.025 2024-06-22 02:53:04,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=511377.1666666667, ans=0.125 2024-06-22 02:53:07,300 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:53:21,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=511413.8333333333, ans=0.1 2024-06-22 02:53:25,864 INFO [train.py:1028] (1/2) Epoch 28, batch 5800, loss[loss=0.2021, simple_loss=0.248, pruned_loss=0.07811, over 12774.00 frames. ], tot_loss[loss=0.171, simple_loss=0.2261, pruned_loss=0.05798, over 2578576.07 frames. ], batch size: 177, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:53:42,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=511468.8333333333, ans=0.025 2024-06-22 02:53:48,159 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.308e+02 2.448e+02 2.639e+02 3.828e+02, threshold=4.896e+02, percent-clipped=0.0 2024-06-22 02:53:57,789 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2024-06-22 02:54:02,228 INFO [train.py:1028] (1/2) Epoch 28, batch 5850, loss[loss=0.182, simple_loss=0.2356, pruned_loss=0.06417, over 12505.00 frames. ], tot_loss[loss=0.1728, simple_loss=0.2279, pruned_loss=0.05881, over 2576643.80 frames. ], batch size: 202, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:54:15,609 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.59 vs. limit=6.0 2024-06-22 02:54:17,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=511560.5, ans=0.125 2024-06-22 02:54:26,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=511578.8333333333, ans=0.125 2024-06-22 02:54:29,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=511597.1666666667, ans=0.125 2024-06-22 02:54:34,877 INFO [train.py:1028] (1/2) Epoch 28, batch 5900, loss[loss=0.17, simple_loss=0.2229, pruned_loss=0.05853, over 13126.00 frames. ], tot_loss[loss=0.1736, simple_loss=0.2293, pruned_loss=0.05901, over 2576420.26 frames. ], batch size: 121, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:54:41,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=511633.8333333333, ans=0.1 2024-06-22 02:54:45,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=511633.8333333333, ans=0.1 2024-06-22 02:54:50,801 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=511652.1666666667, ans=0.025 2024-06-22 02:54:53,802 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.207e+02 2.363e+02 2.599e+02 3.698e+02, threshold=4.726e+02, percent-clipped=0.0 2024-06-22 02:55:04,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=511688.8333333333, ans=0.2 2024-06-22 02:55:07,354 INFO [train.py:1028] (1/2) Epoch 28, batch 5950, loss[loss=0.168, simple_loss=0.224, pruned_loss=0.05602, over 13106.00 frames. ], tot_loss[loss=0.1747, simple_loss=0.2304, pruned_loss=0.0595, over 2580882.20 frames. ], batch size: 121, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:55:10,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=511707.1666666667, ans=0.0 2024-06-22 02:55:24,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=511743.8333333333, ans=0.125 2024-06-22 02:55:31,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=511762.1666666667, ans=0.1 2024-06-22 02:55:35,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=511762.1666666667, ans=0.2 2024-06-22 02:55:48,341 INFO [train.py:1028] (1/2) Epoch 28, batch 6000, loss[loss=0.2038, simple_loss=0.2554, pruned_loss=0.0761, over 12143.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2315, pruned_loss=0.05985, over 2574071.82 frames. ], batch size: 240, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:55:48,342 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 02:55:53,479 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([0.7505, 4.6378, 5.0979, 4.7143], device='cuda:1') 2024-06-22 02:55:56,204 INFO [train.py:1060] (1/2) Epoch 28, validation: loss=0.193, simple_loss=0.2523, pruned_loss=0.06681, over 351949.00 frames. 2024-06-22 02:55:56,205 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 02:56:08,604 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.71 vs. limit=15.0 2024-06-22 02:56:12,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=511835.5, ans=0.07 2024-06-22 02:56:16,038 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.308e+02 2.445e+02 2.736e+02 3.381e+02, threshold=4.889e+02, percent-clipped=0.0 2024-06-22 02:56:17,161 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=511853.8333333333, ans=0.1 2024-06-22 02:56:25,194 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=511872.1666666667, ans=0.125 2024-06-22 02:56:29,434 INFO [train.py:1028] (1/2) Epoch 28, batch 6050, loss[loss=0.1582, simple_loss=0.2223, pruned_loss=0.04705, over 12927.00 frames. ], tot_loss[loss=0.1768, simple_loss=0.2332, pruned_loss=0.0602, over 2576863.42 frames. ], batch size: 39, lr: 2.06e-03, grad_scale: 16.0 2024-06-22 02:56:39,689 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.66 vs. limit=15.0 2024-06-22 02:56:45,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=511927.1666666667, ans=0.125 2024-06-22 02:56:47,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=511927.1666666667, ans=0.125 2024-06-22 02:56:50,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=511945.5, ans=0.125 2024-06-22 02:56:56,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=511963.8333333333, ans=0.2 2024-06-22 02:57:01,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=511982.1666666667, ans=10.0 2024-06-22 02:57:02,481 INFO [train.py:1028] (1/2) Epoch 28, batch 6100, loss[loss=0.1551, simple_loss=0.2065, pruned_loss=0.05186, over 13105.00 frames. ], tot_loss[loss=0.1775, simple_loss=0.234, pruned_loss=0.06048, over 2580228.88 frames. ], batch size: 121, lr: 2.06e-03, grad_scale: 8.0 2024-06-22 02:57:14,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=512000.5, ans=0.2 2024-06-22 02:57:16,063 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.34 vs. limit=22.5 2024-06-22 02:57:17,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=512018.8333333333, ans=0.2 2024-06-22 02:57:18,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=512018.8333333333, ans=0.015 2024-06-22 02:57:28,691 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.261e+02 2.383e+02 2.685e+02 4.094e+02, threshold=4.767e+02, percent-clipped=0.0 2024-06-22 02:57:41,580 INFO [train.py:1028] (1/2) Epoch 28, batch 6150, loss[loss=0.1806, simple_loss=0.2324, pruned_loss=0.0644, over 10887.00 frames. ], tot_loss[loss=0.1791, simple_loss=0.236, pruned_loss=0.06116, over 2578349.49 frames. ], batch size: 303, lr: 2.06e-03, grad_scale: 8.0 2024-06-22 02:57:50,027 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:57:51,092 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2024-06-22 02:57:54,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=512092.1666666667, ans=0.0 2024-06-22 02:57:55,147 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2024-06-22 02:58:00,235 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.27 vs. limit=10.0 2024-06-22 02:58:11,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=512147.1666666667, ans=0.125 2024-06-22 02:58:16,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=512147.1666666667, ans=10.0 2024-06-22 02:58:18,587 INFO [train.py:1028] (1/2) Epoch 28, batch 6200, loss[loss=0.214, simple_loss=0.2714, pruned_loss=0.0783, over 13273.00 frames. ], tot_loss[loss=0.1803, simple_loss=0.2374, pruned_loss=0.06165, over 2577054.05 frames. ], batch size: 89, lr: 2.06e-03, grad_scale: 8.0 2024-06-22 02:58:28,801 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.68 vs. limit=12.0 2024-06-22 02:58:36,331 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=512202.1666666667, ans=0.1 2024-06-22 02:58:36,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=512202.1666666667, ans=0.125 2024-06-22 02:58:39,575 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.324e+02 2.521e+02 2.855e+02 4.564e+02, threshold=5.041e+02, percent-clipped=0.0 2024-06-22 02:58:41,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=512220.5, ans=0.1 2024-06-22 02:58:47,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=512238.8333333333, ans=0.1 2024-06-22 02:58:49,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=512238.8333333333, ans=0.07 2024-06-22 02:58:52,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=512257.1666666667, ans=6.0 2024-06-22 02:58:52,404 INFO [train.py:1028] (1/2) Epoch 28, batch 6250, loss[loss=0.1951, simple_loss=0.2464, pruned_loss=0.07186, over 13235.00 frames. ], tot_loss[loss=0.1816, simple_loss=0.2388, pruned_loss=0.06223, over 2569567.46 frames. ], batch size: 83, lr: 2.06e-03, grad_scale: 8.0 2024-06-22 02:59:00,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=512275.5, ans=0.2 2024-06-22 02:59:19,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=512330.5, ans=0.1 2024-06-22 02:59:23,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=512330.5, ans=0.025 2024-06-22 02:59:25,933 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.19 vs. limit=6.0 2024-06-22 02:59:28,070 INFO [train.py:1028] (1/2) Epoch 28, batch 6300, loss[loss=0.2036, simple_loss=0.2577, pruned_loss=0.07477, over 11352.00 frames. ], tot_loss[loss=0.1829, simple_loss=0.2405, pruned_loss=0.06268, over 2564156.39 frames. ], batch size: 16, lr: 2.06e-03, grad_scale: 8.0 2024-06-22 02:59:34,427 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.44 vs. limit=22.5 2024-06-22 02:59:36,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=512367.1666666667, ans=0.0 2024-06-22 02:59:51,420 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.395e+02 2.577e+02 2.834e+02 4.218e+02, threshold=5.155e+02, percent-clipped=0.0 2024-06-22 02:59:52,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=512403.8333333333, ans=0.1 2024-06-22 02:59:54,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=512403.8333333333, ans=0.125 2024-06-22 03:00:03,723 INFO [train.py:1028] (1/2) Epoch 28, batch 6350, loss[loss=0.2113, simple_loss=0.2644, pruned_loss=0.07912, over 12601.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2418, pruned_loss=0.06297, over 2573293.62 frames. ], batch size: 202, lr: 2.06e-03, grad_scale: 8.0 2024-06-22 03:00:13,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2024-06-22 03:00:17,197 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.45 vs. limit=5.0 2024-06-22 03:00:27,987 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=512495.5, ans=0.125 2024-06-22 03:00:28,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=512495.5, ans=0.125 2024-06-22 03:00:33,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=512513.8333333333, ans=0.1 2024-06-22 03:00:37,049 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:00:38,846 INFO [train.py:1028] (1/2) Epoch 28, batch 6400, loss[loss=0.1712, simple_loss=0.2293, pruned_loss=0.05658, over 13261.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2439, pruned_loss=0.06394, over 2574664.43 frames. ], batch size: 67, lr: 2.06e-03, grad_scale: 16.0 2024-06-22 03:00:47,996 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.82 vs. limit=12.0 2024-06-22 03:00:54,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=512568.8333333333, ans=0.0 2024-06-22 03:00:59,685 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.347e+02 2.471e+02 2.800e+02 4.021e+02, threshold=4.942e+02, percent-clipped=0.0 2024-06-22 03:01:10,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=512605.5, ans=0.1 2024-06-22 03:01:12,254 INFO [train.py:1028] (1/2) Epoch 28, batch 6450, loss[loss=0.22, simple_loss=0.2714, pruned_loss=0.08428, over 12518.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2453, pruned_loss=0.06436, over 2580600.90 frames. ], batch size: 202, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:01:13,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=512623.8333333333, ans=0.125 2024-06-22 03:01:17,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=512623.8333333333, ans=0.125 2024-06-22 03:01:19,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=512642.1666666667, ans=0.125 2024-06-22 03:01:35,287 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.73 vs. limit=15.0 2024-06-22 03:01:36,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=512678.8333333333, ans=0.1 2024-06-22 03:01:38,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=512678.8333333333, ans=0.125 2024-06-22 03:01:40,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=512678.8333333333, ans=0.1 2024-06-22 03:01:50,652 INFO [train.py:1028] (1/2) Epoch 28, batch 6500, loss[loss=0.1943, simple_loss=0.244, pruned_loss=0.07231, over 10665.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2472, pruned_loss=0.06468, over 2584148.83 frames. ], batch size: 304, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:02:01,701 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.20 vs. limit=10.0 2024-06-22 03:02:06,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=512752.1666666667, ans=0.0 2024-06-22 03:02:12,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=512752.1666666667, ans=15.0 2024-06-22 03:02:14,783 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.063e+02 2.332e+02 2.481e+02 2.683e+02 3.389e+02, threshold=4.963e+02, percent-clipped=0.0 2024-06-22 03:02:18,183 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=512770.5, ans=0.1 2024-06-22 03:02:22,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=512788.8333333333, ans=0.0 2024-06-22 03:02:22,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=512788.8333333333, ans=0.2 2024-06-22 03:02:24,007 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.19 vs. limit=15.0 2024-06-22 03:02:25,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=512788.8333333333, ans=0.125 2024-06-22 03:02:26,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=512807.1666666667, ans=0.125 2024-06-22 03:02:27,186 INFO [train.py:1028] (1/2) Epoch 28, batch 6550, loss[loss=0.1898, simple_loss=0.2456, pruned_loss=0.06701, over 12500.00 frames. ], tot_loss[loss=0.1887, simple_loss=0.248, pruned_loss=0.06468, over 2587877.21 frames. ], batch size: 22, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:02:58,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=512880.5, ans=0.125 2024-06-22 03:03:00,781 INFO [train.py:1028] (1/2) Epoch 28, batch 6600, loss[loss=0.1768, simple_loss=0.2416, pruned_loss=0.05596, over 13251.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2481, pruned_loss=0.06473, over 2589927.30 frames. ], batch size: 72, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:03:15,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=512935.5, ans=0.125 2024-06-22 03:03:15,892 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=512935.5, ans=0.0 2024-06-22 03:03:19,769 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2024-06-22 03:03:22,214 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.448e+02 2.635e+02 2.897e+02 3.626e+02, threshold=5.270e+02, percent-clipped=0.0 2024-06-22 03:03:35,014 INFO [train.py:1028] (1/2) Epoch 28, batch 6650, loss[loss=0.199, simple_loss=0.2581, pruned_loss=0.06999, over 12988.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2499, pruned_loss=0.06533, over 2584408.76 frames. ], batch size: 158, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:03:51,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=513027.1666666667, ans=0.0 2024-06-22 03:03:58,104 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2024-06-22 03:04:14,722 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=513063.8333333333, ans=0.125 2024-06-22 03:04:15,931 INFO [train.py:1028] (1/2) Epoch 28, batch 6700, loss[loss=0.2007, simple_loss=0.2592, pruned_loss=0.07104, over 12677.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2506, pruned_loss=0.06567, over 2583517.66 frames. ], batch size: 176, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:04:16,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=513082.1666666667, ans=0.0 2024-06-22 03:04:22,732 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=513100.5, ans=0.0 2024-06-22 03:04:32,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=513118.8333333333, ans=15.0 2024-06-22 03:04:35,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=513137.1666666667, ans=0.95 2024-06-22 03:04:36,380 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.416e+02 2.555e+02 2.742e+02 4.045e+02, threshold=5.109e+02, percent-clipped=0.0 2024-06-22 03:04:40,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=513137.1666666667, ans=0.0 2024-06-22 03:04:41,709 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=513155.5, ans=0.1 2024-06-22 03:04:42,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513155.5, ans=0.1 2024-06-22 03:04:48,797 INFO [train.py:1028] (1/2) Epoch 28, batch 6750, loss[loss=0.256, simple_loss=0.3028, pruned_loss=0.1046, over 12113.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2513, pruned_loss=0.06615, over 2578036.77 frames. ], batch size: 240, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:04:54,286 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.43 vs. limit=12.0 2024-06-22 03:04:57,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=513192.1666666667, ans=0.125 2024-06-22 03:05:13,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=513228.8333333333, ans=0.2 2024-06-22 03:05:18,812 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=513247.1666666667, ans=0.125 2024-06-22 03:05:20,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=513247.1666666667, ans=0.125 2024-06-22 03:05:20,931 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=513265.5, ans=0.0 2024-06-22 03:05:21,385 INFO [train.py:1028] (1/2) Epoch 28, batch 6800, loss[loss=0.1818, simple_loss=0.2407, pruned_loss=0.06143, over 13191.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2519, pruned_loss=0.06615, over 2579975.04 frames. ], batch size: 67, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:05:22,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=513265.5, ans=0.125 2024-06-22 03:05:41,603 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.376e+02 2.528e+02 2.856e+02 4.508e+02, threshold=5.057e+02, percent-clipped=0.0 2024-06-22 03:06:00,724 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.33 vs. limit=15.0 2024-06-22 03:06:02,124 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.55 vs. limit=6.0 2024-06-22 03:06:04,445 INFO [train.py:1028] (1/2) Epoch 28, batch 6850, loss[loss=0.2178, simple_loss=0.286, pruned_loss=0.07478, over 13217.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2524, pruned_loss=0.06617, over 2584010.39 frames. ], batch size: 63, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:06:17,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=513375.5, ans=0.125 2024-06-22 03:06:18,903 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2024-06-22 03:06:19,334 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=513375.5, ans=0.125 2024-06-22 03:06:19,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=513375.5, ans=0.125 2024-06-22 03:06:31,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=513412.1666666667, ans=0.1 2024-06-22 03:06:36,935 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2024-06-22 03:06:41,270 INFO [train.py:1028] (1/2) Epoch 28, batch 6900, loss[loss=0.1989, simple_loss=0.2629, pruned_loss=0.06743, over 13285.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2529, pruned_loss=0.066, over 2585968.35 frames. ], batch size: 49, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:06:42,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=513448.8333333333, ans=0.125 2024-06-22 03:06:49,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=513467.1666666667, ans=0.125 2024-06-22 03:06:54,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=513485.5, ans=0.125 2024-06-22 03:07:02,071 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 2.506e+02 2.734e+02 3.126e+02 4.309e+02, threshold=5.468e+02, percent-clipped=0.0 2024-06-22 03:07:15,223 INFO [train.py:1028] (1/2) Epoch 28, batch 6950, loss[loss=0.1706, simple_loss=0.2284, pruned_loss=0.05641, over 11854.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.253, pruned_loss=0.06588, over 2580309.08 frames. ], batch size: 17, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:07:45,619 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=513613.8333333333, ans=0.025 2024-06-22 03:07:48,753 INFO [train.py:1028] (1/2) Epoch 28, batch 7000, loss[loss=0.1963, simple_loss=0.2547, pruned_loss=0.06893, over 12993.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2533, pruned_loss=0.06579, over 2577219.82 frames. ], batch size: 158, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:07:59,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=513650.5, ans=0.125 2024-06-22 03:08:03,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=513650.5, ans=0.125 2024-06-22 03:08:03,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=513650.5, ans=0.125 2024-06-22 03:08:12,972 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.440e+02 2.563e+02 2.722e+02 3.382e+02, threshold=5.126e+02, percent-clipped=0.0 2024-06-22 03:08:25,790 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.06 vs. limit=12.0 2024-06-22 03:08:29,909 INFO [train.py:1028] (1/2) Epoch 28, batch 7050, loss[loss=0.2192, simple_loss=0.273, pruned_loss=0.08269, over 12769.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2549, pruned_loss=0.0665, over 2584755.94 frames. ], batch size: 177, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:08:32,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=513723.8333333333, ans=0.5 2024-06-22 03:08:32,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=513723.8333333333, ans=0.125 2024-06-22 03:08:37,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=513742.1666666667, ans=0.0 2024-06-22 03:08:40,929 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.37 vs. limit=15.0 2024-06-22 03:08:47,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=513760.5, ans=0.0 2024-06-22 03:08:56,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=513797.1666666667, ans=0.0 2024-06-22 03:08:56,585 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.14 vs. limit=15.0 2024-06-22 03:09:02,807 INFO [train.py:1028] (1/2) Epoch 28, batch 7100, loss[loss=0.2065, simple_loss=0.2746, pruned_loss=0.06919, over 13162.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2551, pruned_loss=0.06655, over 2576521.56 frames. ], batch size: 112, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:09:23,356 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.404e+02 2.579e+02 2.811e+02 3.907e+02, threshold=5.158e+02, percent-clipped=0.0 2024-06-22 03:09:36,357 INFO [train.py:1028] (1/2) Epoch 28, batch 7150, loss[loss=0.244, simple_loss=0.2926, pruned_loss=0.09772, over 12478.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2554, pruned_loss=0.06635, over 2575939.58 frames. ], batch size: 202, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:09:38,380 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=513907.1666666667, ans=0.125 2024-06-22 03:09:41,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=513907.1666666667, ans=0.0 2024-06-22 03:09:41,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=513907.1666666667, ans=0.025 2024-06-22 03:09:45,431 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.49 vs. limit=6.0 2024-06-22 03:09:47,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=513925.5, ans=0.0 2024-06-22 03:09:52,012 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.91 vs. limit=15.0 2024-06-22 03:10:04,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=513962.1666666667, ans=0.1 2024-06-22 03:10:07,258 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=513980.5, ans=0.2 2024-06-22 03:10:10,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=513980.5, ans=0.125 2024-06-22 03:10:12,678 INFO [train.py:1028] (1/2) Epoch 28, batch 7200, loss[loss=0.1991, simple_loss=0.261, pruned_loss=0.06866, over 13203.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2565, pruned_loss=0.06665, over 2580807.33 frames. ], batch size: 112, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:10:15,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=513998.8333333333, ans=0.125 2024-06-22 03:10:24,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=514035.5, ans=0.0 2024-06-22 03:10:33,689 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2024-06-22 03:10:35,911 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.414e+02 2.571e+02 2.823e+02 4.177e+02, threshold=5.142e+02, percent-clipped=0.0 2024-06-22 03:10:46,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=514072.1666666667, ans=0.0 2024-06-22 03:10:48,654 INFO [train.py:1028] (1/2) Epoch 28, batch 7250, loss[loss=0.1715, simple_loss=0.2457, pruned_loss=0.04867, over 13052.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.257, pruned_loss=0.06666, over 2581437.22 frames. ], batch size: 36, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:10:55,817 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=514108.8333333333, ans=0.125 2024-06-22 03:10:56,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=514108.8333333333, ans=0.125 2024-06-22 03:11:00,253 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:11:09,573 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.68 vs. limit=10.0 2024-06-22 03:11:09,932 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=514145.5, ans=0.07 2024-06-22 03:11:21,308 INFO [train.py:1028] (1/2) Epoch 28, batch 7300, loss[loss=0.192, simple_loss=0.2578, pruned_loss=0.06315, over 13011.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2585, pruned_loss=0.06773, over 2581442.35 frames. ], batch size: 36, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:11:23,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=514182.1666666667, ans=0.05 2024-06-22 03:11:29,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=514200.5, ans=0.2 2024-06-22 03:11:35,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=514218.8333333333, ans=0.0 2024-06-22 03:11:37,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=514218.8333333333, ans=0.1 2024-06-22 03:11:40,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=514237.1666666667, ans=0.0 2024-06-22 03:11:41,494 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.425e+02 2.624e+02 2.897e+02 4.391e+02, threshold=5.249e+02, percent-clipped=0.0 2024-06-22 03:11:45,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=514237.1666666667, ans=0.0 2024-06-22 03:11:48,735 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=514255.5, ans=0.0 2024-06-22 03:11:54,008 INFO [train.py:1028] (1/2) Epoch 28, batch 7350, loss[loss=0.2037, simple_loss=0.2632, pruned_loss=0.07207, over 13212.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2595, pruned_loss=0.0685, over 2581684.90 frames. ], batch size: 46, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:12:19,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=514328.8333333333, ans=0.125 2024-06-22 03:12:26,461 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2024-06-22 03:12:27,921 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2024-06-22 03:12:30,932 INFO [train.py:1028] (1/2) Epoch 28, batch 7400, loss[loss=0.2096, simple_loss=0.2783, pruned_loss=0.07041, over 13245.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2595, pruned_loss=0.06809, over 2586806.66 frames. ], batch size: 63, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:12:51,181 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.76 vs. limit=15.0 2024-06-22 03:12:51,753 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.28 vs. limit=15.0 2024-06-22 03:12:55,359 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.432e+02 2.586e+02 2.828e+02 3.873e+02, threshold=5.173e+02, percent-clipped=0.0 2024-06-22 03:12:55,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=514420.5, ans=0.125 2024-06-22 03:13:00,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=514438.8333333333, ans=0.125 2024-06-22 03:13:07,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=514438.8333333333, ans=0.125 2024-06-22 03:13:08,560 INFO [train.py:1028] (1/2) Epoch 28, batch 7450, loss[loss=0.1782, simple_loss=0.2418, pruned_loss=0.05732, over 12587.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2589, pruned_loss=0.06776, over 2579726.34 frames. ], batch size: 29, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:13:22,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=514493.8333333333, ans=0.025 2024-06-22 03:13:42,412 INFO [train.py:1028] (1/2) Epoch 28, batch 7500, loss[loss=0.2292, simple_loss=0.2812, pruned_loss=0.08864, over 10703.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2602, pruned_loss=0.06832, over 2577826.86 frames. ], batch size: 304, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:13:47,212 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=514548.8333333333, ans=0.125 2024-06-22 03:13:50,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=514567.1666666667, ans=0.0 2024-06-22 03:13:50,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=514567.1666666667, ans=0.125 2024-06-22 03:13:51,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=514567.1666666667, ans=0.0 2024-06-22 03:13:55,765 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=514585.5, ans=0.0 2024-06-22 03:13:57,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=514585.5, ans=0.05 2024-06-22 03:13:58,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=514585.5, ans=15.0 2024-06-22 03:13:58,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=514585.5, ans=0.0 2024-06-22 03:14:00,810 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=514585.5, ans=0.125 2024-06-22 03:14:02,524 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.451e+02 2.630e+02 2.857e+02 4.051e+02, threshold=5.260e+02, percent-clipped=0.0 2024-06-22 03:14:02,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=514603.8333333333, ans=0.125 2024-06-22 03:14:17,974 INFO [train.py:1028] (1/2) Epoch 28, batch 7550, loss[loss=0.1858, simple_loss=0.2446, pruned_loss=0.06351, over 12928.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2608, pruned_loss=0.06909, over 2576443.20 frames. ], batch size: 158, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:14:26,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=514658.8333333333, ans=0.125 2024-06-22 03:14:40,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=514695.5, ans=0.2 2024-06-22 03:14:43,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=514695.5, ans=0.125 2024-06-22 03:14:43,425 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.44 vs. limit=12.0 2024-06-22 03:14:47,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=514713.8333333333, ans=0.2 2024-06-22 03:14:52,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514713.8333333333, ans=0.1 2024-06-22 03:14:53,832 INFO [train.py:1028] (1/2) Epoch 28, batch 7600, loss[loss=0.1908, simple_loss=0.2553, pruned_loss=0.06313, over 13149.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2617, pruned_loss=0.06959, over 2575302.31 frames. ], batch size: 83, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:14:58,405 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=514732.1666666667, ans=0.125 2024-06-22 03:14:59,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=514750.5, ans=0.025 2024-06-22 03:15:05,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=514750.5, ans=0.125 2024-06-22 03:15:10,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514768.8333333333, ans=0.1 2024-06-22 03:15:14,855 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 2.464e+02 2.688e+02 2.984e+02 3.842e+02, threshold=5.375e+02, percent-clipped=0.0 2024-06-22 03:15:21,111 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=514805.5, ans=10.0 2024-06-22 03:15:24,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=514805.5, ans=0.1 2024-06-22 03:15:24,312 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.51 vs. limit=10.0 2024-06-22 03:15:27,641 INFO [train.py:1028] (1/2) Epoch 28, batch 7650, loss[loss=0.193, simple_loss=0.2579, pruned_loss=0.06403, over 12875.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2617, pruned_loss=0.06915, over 2573072.82 frames. ], batch size: 33, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:15:36,022 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=514842.1666666667, ans=0.125 2024-06-22 03:15:39,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=514842.1666666667, ans=0.0 2024-06-22 03:15:51,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514878.8333333333, ans=0.1 2024-06-22 03:15:54,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=514897.1666666667, ans=0.0 2024-06-22 03:15:58,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514897.1666666667, ans=0.1 2024-06-22 03:16:05,091 INFO [train.py:1028] (1/2) Epoch 28, batch 7700, loss[loss=0.2212, simple_loss=0.2925, pruned_loss=0.07497, over 13277.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2622, pruned_loss=0.06926, over 2570255.35 frames. ], batch size: 63, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:16:08,414 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=514915.5, ans=0.0 2024-06-22 03:16:12,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.43 vs. limit=8.0 2024-06-22 03:16:20,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514952.1666666667, ans=0.1 2024-06-22 03:16:25,044 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.510e+02 2.776e+02 2.982e+02 3.995e+02, threshold=5.552e+02, percent-clipped=0.0 2024-06-22 03:16:41,335 INFO [train.py:1028] (1/2) Epoch 28, batch 7750, loss[loss=0.1859, simple_loss=0.2514, pruned_loss=0.06015, over 13255.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2627, pruned_loss=0.06944, over 2574410.44 frames. ], batch size: 72, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:16:41,464 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=515007.1666666667, ans=0.95 2024-06-22 03:16:45,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=515007.1666666667, ans=0.2 2024-06-22 03:17:06,098 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:17:15,137 INFO [train.py:1028] (1/2) Epoch 28, batch 7800, loss[loss=0.2112, simple_loss=0.2682, pruned_loss=0.07709, over 13172.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2633, pruned_loss=0.06953, over 2578616.30 frames. ], batch size: 95, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:17:15,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=515098.8333333333, ans=0.0 2024-06-22 03:17:21,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=515117.1666666667, ans=0.025 2024-06-22 03:17:25,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=515117.1666666667, ans=0.125 2024-06-22 03:17:36,159 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.475e+02 2.626e+02 2.878e+02 4.182e+02, threshold=5.252e+02, percent-clipped=0.0 2024-06-22 03:17:37,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=515153.8333333333, ans=0.2 2024-06-22 03:17:44,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=515172.1666666667, ans=0.125 2024-06-22 03:17:47,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=515172.1666666667, ans=0.1 2024-06-22 03:17:49,183 INFO [train.py:1028] (1/2) Epoch 28, batch 7850, loss[loss=0.1847, simple_loss=0.2487, pruned_loss=0.06033, over 11222.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2642, pruned_loss=0.06997, over 2571735.67 frames. ], batch size: 16, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:18:04,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=515208.8333333333, ans=0.0 2024-06-22 03:18:08,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=515227.1666666667, ans=0.125 2024-06-22 03:18:14,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=515245.5, ans=0.125 2024-06-22 03:18:21,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=515263.8333333333, ans=0.2 2024-06-22 03:18:28,700 INFO [train.py:1028] (1/2) Epoch 28, batch 7900, loss[loss=0.1949, simple_loss=0.2675, pruned_loss=0.06121, over 13149.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2642, pruned_loss=0.06988, over 2571157.06 frames. ], batch size: 77, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:18:31,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=515282.1666666667, ans=0.0 2024-06-22 03:18:38,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=515300.5, ans=0.05 2024-06-22 03:18:38,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=515300.5, ans=0.125 2024-06-22 03:18:41,521 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=515318.8333333333, ans=0.2 2024-06-22 03:18:42,832 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=515318.8333333333, ans=0.0 2024-06-22 03:18:45,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=515318.8333333333, ans=0.0 2024-06-22 03:18:49,293 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.459e+02 2.575e+02 2.929e+02 4.217e+02, threshold=5.150e+02, percent-clipped=0.0 2024-06-22 03:18:56,343 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=515355.5, ans=0.0 2024-06-22 03:18:58,889 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.43 vs. limit=15.0 2024-06-22 03:19:01,855 INFO [train.py:1028] (1/2) Epoch 28, batch 7950, loss[loss=0.2223, simple_loss=0.2665, pruned_loss=0.08908, over 10903.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2647, pruned_loss=0.07008, over 2574710.76 frames. ], batch size: 304, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:19:03,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=515373.8333333333, ans=0.0 2024-06-22 03:19:12,084 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=515392.1666666667, ans=0.125 2024-06-22 03:19:15,791 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.45 vs. limit=22.5 2024-06-22 03:19:25,623 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.93 vs. limit=15.0 2024-06-22 03:19:34,995 INFO [train.py:1028] (1/2) Epoch 28, batch 8000, loss[loss=0.189, simple_loss=0.2658, pruned_loss=0.05604, over 12597.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2656, pruned_loss=0.07032, over 2571020.78 frames. ], batch size: 29, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:19:40,568 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2024-06-22 03:19:57,038 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.65 vs. limit=12.0 2024-06-22 03:19:58,478 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.417e+02 2.578e+02 2.776e+02 3.520e+02, threshold=5.156e+02, percent-clipped=0.0 2024-06-22 03:20:11,251 INFO [train.py:1028] (1/2) Epoch 28, batch 8050, loss[loss=0.1765, simple_loss=0.2415, pruned_loss=0.0557, over 13211.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2655, pruned_loss=0.07013, over 2571855.48 frames. ], batch size: 83, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:20:36,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=515612.1666666667, ans=0.125 2024-06-22 03:20:40,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=515630.5, ans=0.05 2024-06-22 03:20:41,026 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=12.0 2024-06-22 03:20:47,688 INFO [train.py:1028] (1/2) Epoch 28, batch 8100, loss[loss=0.1981, simple_loss=0.2644, pruned_loss=0.06589, over 13205.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2663, pruned_loss=0.07062, over 2575548.46 frames. ], batch size: 112, lr: 2.05e-03, grad_scale: 64.0 2024-06-22 03:20:48,219 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.68 vs. limit=10.0 2024-06-22 03:20:53,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=515648.8333333333, ans=0.09899494936611666 2024-06-22 03:20:57,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=515667.1666666667, ans=0.2 2024-06-22 03:21:03,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=515685.5, ans=0.1 2024-06-22 03:21:08,251 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.373e+02 2.514e+02 2.726e+02 3.558e+02, threshold=5.028e+02, percent-clipped=0.0 2024-06-22 03:21:08,830 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.50 vs. limit=22.5 2024-06-22 03:21:15,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=515722.1666666667, ans=0.1 2024-06-22 03:21:17,076 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:21:21,081 INFO [train.py:1028] (1/2) Epoch 28, batch 8150, loss[loss=0.1872, simple_loss=0.2453, pruned_loss=0.06451, over 13103.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2661, pruned_loss=0.07006, over 2579769.96 frames. ], batch size: 121, lr: 2.05e-03, grad_scale: 64.0 2024-06-22 03:21:26,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=515740.5, ans=22.5 2024-06-22 03:21:34,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=515777.1666666667, ans=0.1 2024-06-22 03:21:43,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=515795.5, ans=0.125 2024-06-22 03:21:50,955 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.21 vs. limit=15.0 2024-06-22 03:21:53,959 INFO [train.py:1028] (1/2) Epoch 28, batch 8200, loss[loss=0.2149, simple_loss=0.2791, pruned_loss=0.07534, over 13148.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2657, pruned_loss=0.0697, over 2582970.66 frames. ], batch size: 112, lr: 2.05e-03, grad_scale: 64.0 2024-06-22 03:21:56,773 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=515832.1666666667, ans=0.0 2024-06-22 03:22:08,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=515850.5, ans=0.025 2024-06-22 03:22:10,871 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.79 vs. limit=15.0 2024-06-22 03:22:12,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=515868.8333333333, ans=0.1 2024-06-22 03:22:16,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=515868.8333333333, ans=0.0 2024-06-22 03:22:18,218 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 2.538e+02 2.667e+02 2.950e+02 3.581e+02, threshold=5.334e+02, percent-clipped=0.0 2024-06-22 03:22:18,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=515887.1666666667, ans=0.0 2024-06-22 03:22:33,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=515905.5, ans=0.2 2024-06-22 03:22:34,358 INFO [train.py:1028] (1/2) Epoch 28, batch 8250, loss[loss=0.1915, simple_loss=0.2578, pruned_loss=0.06255, over 13214.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2655, pruned_loss=0.06972, over 2583640.98 frames. ], batch size: 52, lr: 2.05e-03, grad_scale: 64.0 2024-06-22 03:22:35,391 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.00 vs. limit=10.0 2024-06-22 03:22:44,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=515942.1666666667, ans=0.0 2024-06-22 03:22:59,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=515997.1666666667, ans=0.125 2024-06-22 03:23:02,472 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:23:03,911 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2024-06-22 03:23:05,657 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=516015.5, ans=0.025 2024-06-22 03:23:06,031 INFO [train.py:1028] (1/2) Epoch 28, batch 8300, loss[loss=0.207, simple_loss=0.2641, pruned_loss=0.07495, over 13024.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2649, pruned_loss=0.06938, over 2580504.26 frames. ], batch size: 102, lr: 2.05e-03, grad_scale: 64.0 2024-06-22 03:23:09,876 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=516015.5, ans=0.125 2024-06-22 03:23:11,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=516033.8333333333, ans=0.0 2024-06-22 03:23:20,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=516052.1666666667, ans=0.0 2024-06-22 03:23:26,198 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.524e+02 2.659e+02 2.850e+02 3.606e+02, threshold=5.319e+02, percent-clipped=0.0 2024-06-22 03:23:29,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=516070.5, ans=0.125 2024-06-22 03:23:38,881 INFO [train.py:1028] (1/2) Epoch 28, batch 8350, loss[loss=0.2278, simple_loss=0.2855, pruned_loss=0.08509, over 13182.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2647, pruned_loss=0.06913, over 2582555.73 frames. ], batch size: 112, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:23:48,792 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.55 vs. limit=6.0 2024-06-22 03:23:49,786 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=516125.5, ans=0.1 2024-06-22 03:24:05,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=516162.1666666667, ans=0.1 2024-06-22 03:24:05,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=516162.1666666667, ans=0.5 2024-06-22 03:24:06,543 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.600e-03 2024-06-22 03:24:15,683 INFO [train.py:1028] (1/2) Epoch 28, batch 8400, loss[loss=0.1836, simple_loss=0.2473, pruned_loss=0.05994, over 12930.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.265, pruned_loss=0.06978, over 2579009.28 frames. ], batch size: 39, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:24:32,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=516235.5, ans=0.125 2024-06-22 03:24:40,133 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.546e+02 2.779e+02 3.098e+02 4.044e+02, threshold=5.557e+02, percent-clipped=0.0 2024-06-22 03:24:40,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=516253.8333333333, ans=0.1 2024-06-22 03:24:43,539 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:24:44,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=516253.8333333333, ans=0.2 2024-06-22 03:24:44,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=516253.8333333333, ans=0.125 2024-06-22 03:24:51,564 INFO [train.py:1028] (1/2) Epoch 28, batch 8450, loss[loss=0.2113, simple_loss=0.2701, pruned_loss=0.07624, over 13178.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2664, pruned_loss=0.0702, over 2580427.37 frames. ], batch size: 112, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:24:53,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=516290.5, ans=0.125 2024-06-22 03:24:56,699 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.51 vs. limit=6.0 2024-06-22 03:25:06,707 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=516327.1666666667, ans=0.05 2024-06-22 03:25:15,868 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=516345.5, ans=0.125 2024-06-22 03:25:17,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=516363.8333333333, ans=10.0 2024-06-22 03:25:24,252 INFO [train.py:1028] (1/2) Epoch 28, batch 8500, loss[loss=0.2105, simple_loss=0.2741, pruned_loss=0.07346, over 13013.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2677, pruned_loss=0.07073, over 2578808.08 frames. ], batch size: 30, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:25:31,753 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2024-06-22 03:25:34,116 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=516400.5, ans=0.04949747468305833 2024-06-22 03:25:40,433 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=516418.8333333333, ans=0.125 2024-06-22 03:25:41,382 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.54 vs. limit=6.0 2024-06-22 03:25:44,085 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.86 vs. limit=22.5 2024-06-22 03:25:45,675 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 2.443e+02 2.667e+02 2.913e+02 4.137e+02, threshold=5.335e+02, percent-clipped=0.0 2024-06-22 03:25:50,261 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.72 vs. limit=15.0 2024-06-22 03:25:50,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=516455.5, ans=0.2 2024-06-22 03:25:57,652 INFO [train.py:1028] (1/2) Epoch 28, batch 8550, loss[loss=0.1941, simple_loss=0.2693, pruned_loss=0.0594, over 12725.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2671, pruned_loss=0.0702, over 2577228.47 frames. ], batch size: 22, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:26:03,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=516473.8333333333, ans=0.1 2024-06-22 03:26:06,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=516473.8333333333, ans=0.0 2024-06-22 03:26:16,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=516510.5, ans=0.0 2024-06-22 03:26:19,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=516510.5, ans=0.125 2024-06-22 03:26:26,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=516547.1666666667, ans=0.125 2024-06-22 03:26:28,747 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=516547.1666666667, ans=0.125 2024-06-22 03:26:31,675 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2024-06-22 03:26:37,497 INFO [train.py:1028] (1/2) Epoch 28, batch 8600, loss[loss=0.2094, simple_loss=0.2726, pruned_loss=0.0731, over 13113.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2672, pruned_loss=0.07018, over 2573691.22 frames. ], batch size: 121, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:26:41,951 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.76 vs. limit=15.0 2024-06-22 03:26:54,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=516602.1666666667, ans=0.125 2024-06-22 03:26:59,202 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.447e+02 2.628e+02 2.794e+02 3.705e+02, threshold=5.257e+02, percent-clipped=0.0 2024-06-22 03:27:09,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=516638.8333333333, ans=0.95 2024-06-22 03:27:11,343 INFO [train.py:1028] (1/2) Epoch 28, batch 8650, loss[loss=0.2018, simple_loss=0.2543, pruned_loss=0.07462, over 13030.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2676, pruned_loss=0.07003, over 2576628.06 frames. ], batch size: 102, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:27:15,283 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:27:17,901 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.32 vs. limit=12.0 2024-06-22 03:27:26,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=516693.8333333333, ans=0.125 2024-06-22 03:27:33,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=516712.1666666667, ans=0.0 2024-06-22 03:27:36,969 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=5.081e-03 2024-06-22 03:27:42,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=516730.5, ans=0.09899494936611666 2024-06-22 03:27:44,267 INFO [train.py:1028] (1/2) Epoch 28, batch 8700, loss[loss=0.2144, simple_loss=0.2864, pruned_loss=0.07119, over 13204.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2679, pruned_loss=0.07044, over 2573348.96 frames. ], batch size: 59, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:27:45,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=516748.8333333333, ans=0.125 2024-06-22 03:27:51,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=516767.1666666667, ans=0.0 2024-06-22 03:28:03,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.98 vs. limit=15.0 2024-06-22 03:28:08,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=516803.8333333333, ans=0.1 2024-06-22 03:28:09,965 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 2.413e+02 2.623e+02 2.842e+02 4.444e+02, threshold=5.246e+02, percent-clipped=0.0 2024-06-22 03:28:13,342 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:28:21,689 INFO [train.py:1028] (1/2) Epoch 28, batch 8750, loss[loss=0.2188, simple_loss=0.2751, pruned_loss=0.08125, over 13115.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2677, pruned_loss=0.07052, over 2568822.67 frames. ], batch size: 121, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:28:24,605 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=516840.5, ans=0.0 2024-06-22 03:28:27,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=516858.8333333333, ans=0.0 2024-06-22 03:28:56,591 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=516913.8333333333, ans=0.125 2024-06-22 03:28:57,626 INFO [train.py:1028] (1/2) Epoch 28, batch 8800, loss[loss=0.1943, simple_loss=0.2675, pruned_loss=0.06059, over 13224.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2685, pruned_loss=0.07071, over 2573529.37 frames. ], batch size: 72, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:29:16,399 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.11 vs. limit=15.0 2024-06-22 03:29:19,578 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 2.598e+02 2.774e+02 3.033e+02 4.031e+02, threshold=5.548e+02, percent-clipped=0.0 2024-06-22 03:29:28,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=517005.5, ans=0.025 2024-06-22 03:29:31,718 INFO [train.py:1028] (1/2) Epoch 28, batch 8850, loss[loss=0.2266, simple_loss=0.2851, pruned_loss=0.08406, over 12509.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2685, pruned_loss=0.0711, over 2562367.88 frames. ], batch size: 202, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:29:41,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=517042.1666666667, ans=0.125 2024-06-22 03:29:47,696 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=517060.5, ans=0.0 2024-06-22 03:29:57,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=517078.8333333333, ans=0.125 2024-06-22 03:30:00,645 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=517078.8333333333, ans=0.0 2024-06-22 03:30:18,800 INFO [train.py:1028] (1/2) Epoch 28, batch 8900, loss[loss=0.2048, simple_loss=0.2683, pruned_loss=0.07068, over 12861.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2691, pruned_loss=0.07141, over 2560452.59 frames. ], batch size: 33, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:30:22,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=517115.5, ans=0.125 2024-06-22 03:30:23,526 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:30:28,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=517133.8333333333, ans=0.125 2024-06-22 03:30:45,168 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.543e+02 2.773e+02 3.005e+02 4.376e+02, threshold=5.546e+02, percent-clipped=0.0 2024-06-22 03:30:51,010 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.75 vs. limit=6.0 2024-06-22 03:30:52,193 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=517188.8333333333, ans=0.125 2024-06-22 03:30:54,642 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.96 vs. limit=6.0 2024-06-22 03:30:56,990 INFO [train.py:1028] (1/2) Epoch 28, batch 8950, loss[loss=0.2239, simple_loss=0.2832, pruned_loss=0.08226, over 12475.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.269, pruned_loss=0.0709, over 2560023.45 frames. ], batch size: 202, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:30:59,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=517207.1666666667, ans=0.0 2024-06-22 03:31:06,040 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=517225.5, ans=0.0 2024-06-22 03:31:11,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=517243.8333333333, ans=0.025 2024-06-22 03:31:12,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=517243.8333333333, ans=0.0 2024-06-22 03:31:13,200 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:31:31,450 INFO [train.py:1028] (1/2) Epoch 28, batch 9000, loss[loss=0.1804, simple_loss=0.2478, pruned_loss=0.05646, over 13234.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2682, pruned_loss=0.0702, over 2566385.34 frames. ], batch size: 46, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:31:31,450 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 03:31:39,689 INFO [train.py:1060] (1/2) Epoch 28, validation: loss=0.194, simple_loss=0.2527, pruned_loss=0.06771, over 351949.00 frames. 2024-06-22 03:31:39,690 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 03:31:43,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=517298.8333333333, ans=0.0 2024-06-22 03:31:52,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=517317.1666666667, ans=0.025 2024-06-22 03:32:01,817 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.478e+02 2.643e+02 2.834e+02 3.341e+02, threshold=5.287e+02, percent-clipped=0.0 2024-06-22 03:32:05,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=517353.8333333333, ans=0.125 2024-06-22 03:32:07,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=517372.1666666667, ans=0.0 2024-06-22 03:32:07,208 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=517372.1666666667, ans=0.125 2024-06-22 03:32:07,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=517372.1666666667, ans=0.125 2024-06-22 03:32:09,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=517372.1666666667, ans=0.125 2024-06-22 03:32:10,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=517372.1666666667, ans=0.125 2024-06-22 03:32:13,467 INFO [train.py:1028] (1/2) Epoch 28, batch 9050, loss[loss=0.1898, simple_loss=0.2582, pruned_loss=0.06076, over 11075.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2687, pruned_loss=0.07047, over 2565618.56 frames. ], batch size: 16, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:32:14,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=517390.5, ans=0.2 2024-06-22 03:32:17,939 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=517390.5, ans=0.125 2024-06-22 03:32:44,167 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:32:46,106 INFO [train.py:1028] (1/2) Epoch 28, batch 9100, loss[loss=0.2247, simple_loss=0.2889, pruned_loss=0.08024, over 13224.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2681, pruned_loss=0.06995, over 2568023.38 frames. ], batch size: 72, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:32:47,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=517482.1666666667, ans=0.125 2024-06-22 03:32:48,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=517482.1666666667, ans=0.125 2024-06-22 03:32:57,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=517500.5, ans=0.0 2024-06-22 03:32:58,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=517500.5, ans=0.125 2024-06-22 03:33:06,226 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=517518.8333333333, ans=0.1 2024-06-22 03:33:06,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=517518.8333333333, ans=0.125 2024-06-22 03:33:11,885 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.492e+02 2.616e+02 2.811e+02 3.924e+02, threshold=5.232e+02, percent-clipped=0.0 2024-06-22 03:33:15,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=517537.1666666667, ans=0.125 2024-06-22 03:33:19,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=517555.5, ans=0.0 2024-06-22 03:33:22,724 INFO [train.py:1028] (1/2) Epoch 28, batch 9150, loss[loss=0.1976, simple_loss=0.263, pruned_loss=0.06612, over 13218.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2688, pruned_loss=0.07056, over 2569652.61 frames. ], batch size: 77, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:33:28,176 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.64 vs. limit=6.0 2024-06-22 03:33:34,571 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.19 vs. limit=8.0 2024-06-22 03:33:57,787 INFO [train.py:1028] (1/2) Epoch 28, batch 9200, loss[loss=0.2088, simple_loss=0.2779, pruned_loss=0.06984, over 12957.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2687, pruned_loss=0.07006, over 2573743.00 frames. ], batch size: 36, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:34:06,958 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.27 vs. limit=10.0 2024-06-22 03:34:08,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=517683.8333333333, ans=0.125 2024-06-22 03:34:09,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=517683.8333333333, ans=0.5 2024-06-22 03:34:09,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=517683.8333333333, ans=0.125 2024-06-22 03:34:17,454 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2024-06-22 03:34:19,552 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.193e+02 2.453e+02 2.562e+02 2.702e+02 3.598e+02, threshold=5.123e+02, percent-clipped=0.0 2024-06-22 03:34:26,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=517738.8333333333, ans=0.0 2024-06-22 03:34:27,420 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=517738.8333333333, ans=0.125 2024-06-22 03:34:28,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=517738.8333333333, ans=0.125 2024-06-22 03:34:30,517 INFO [train.py:1028] (1/2) Epoch 28, batch 9250, loss[loss=0.2129, simple_loss=0.2804, pruned_loss=0.07277, over 13248.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2682, pruned_loss=0.06977, over 2575897.76 frames. ], batch size: 67, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:34:33,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=517757.1666666667, ans=0.025 2024-06-22 03:34:33,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=517757.1666666667, ans=0.2 2024-06-22 03:34:34,021 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=517757.1666666667, ans=0.025 2024-06-22 03:34:38,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=517775.5, ans=0.125 2024-06-22 03:34:39,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=517775.5, ans=0.0 2024-06-22 03:34:41,370 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.60 vs. limit=12.0 2024-06-22 03:34:49,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=517812.1666666667, ans=0.125 2024-06-22 03:34:54,285 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=517812.1666666667, ans=0.125 2024-06-22 03:35:00,395 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.86 vs. limit=15.0 2024-06-22 03:35:01,069 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.42 vs. limit=10.0 2024-06-22 03:35:02,486 INFO [train.py:1028] (1/2) Epoch 28, batch 9300, loss[loss=0.1916, simple_loss=0.2578, pruned_loss=0.06269, over 12920.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2686, pruned_loss=0.06968, over 2571424.65 frames. ], batch size: 39, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:35:20,212 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.55 vs. limit=15.0 2024-06-22 03:35:20,868 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.21 vs. limit=15.0 2024-06-22 03:35:20,881 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.34 vs. limit=15.0 2024-06-22 03:35:23,079 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 2.455e+02 2.627e+02 2.790e+02 3.283e+02, threshold=5.254e+02, percent-clipped=0.0 2024-06-22 03:35:33,670 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=517940.5, ans=0.125 2024-06-22 03:35:34,114 INFO [train.py:1028] (1/2) Epoch 28, batch 9350, loss[loss=0.1975, simple_loss=0.2675, pruned_loss=0.06379, over 12450.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2689, pruned_loss=0.0702, over 2567616.66 frames. ], batch size: 22, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:35:39,216 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.41 vs. limit=22.5 2024-06-22 03:35:43,219 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=517958.8333333333, ans=0.125 2024-06-22 03:35:47,787 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.37 vs. limit=15.0 2024-06-22 03:35:57,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=518013.8333333333, ans=0.0 2024-06-22 03:36:04,564 INFO [train.py:1028] (1/2) Epoch 28, batch 9400, loss[loss=0.1986, simple_loss=0.2676, pruned_loss=0.06476, over 13244.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2688, pruned_loss=0.07032, over 2567914.55 frames. ], batch size: 52, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:36:04,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=518032.1666666667, ans=0.1 2024-06-22 03:36:18,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=518050.5, ans=0.0 2024-06-22 03:36:21,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=518068.8333333333, ans=0.0 2024-06-22 03:36:25,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=518068.8333333333, ans=0.125 2024-06-22 03:36:28,629 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.467e+02 2.619e+02 2.846e+02 3.317e+02, threshold=5.239e+02, percent-clipped=0.0 2024-06-22 03:36:32,892 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.18 vs. limit=10.0 2024-06-22 03:36:39,608 INFO [train.py:1028] (1/2) Epoch 28, batch 9450, loss[loss=0.2074, simple_loss=0.2704, pruned_loss=0.0722, over 12594.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2696, pruned_loss=0.07063, over 2567984.85 frames. ], batch size: 22, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:36:42,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=518123.8333333333, ans=0.125 2024-06-22 03:36:54,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=518160.5, ans=0.0 2024-06-22 03:36:59,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=518178.8333333333, ans=0.0 2024-06-22 03:37:00,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.39 vs. limit=12.0 2024-06-22 03:37:03,438 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=518197.1666666667, ans=0.04949747468305833 2024-06-22 03:37:10,047 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.90 vs. limit=10.0 2024-06-22 03:37:12,069 INFO [train.py:1028] (1/2) Epoch 28, batch 9500, loss[loss=0.2014, simple_loss=0.2638, pruned_loss=0.06949, over 13222.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2685, pruned_loss=0.07002, over 2576888.79 frames. ], batch size: 43, lr: 2.04e-03, grad_scale: 16.0 2024-06-22 03:37:12,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=518215.5, ans=0.125 2024-06-22 03:37:19,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=518233.8333333333, ans=0.125 2024-06-22 03:37:21,347 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2024-06-22 03:37:30,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=518270.5, ans=0.1 2024-06-22 03:37:32,784 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.479e+02 2.685e+02 2.929e+02 4.298e+02, threshold=5.369e+02, percent-clipped=0.0 2024-06-22 03:37:41,522 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=518288.8333333333, ans=0.05 2024-06-22 03:37:42,679 INFO [train.py:1028] (1/2) Epoch 28, batch 9550, loss[loss=0.1842, simple_loss=0.252, pruned_loss=0.05819, over 12909.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.268, pruned_loss=0.06988, over 2573366.30 frames. ], batch size: 39, lr: 2.04e-03, grad_scale: 16.0 2024-06-22 03:37:45,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=518307.1666666667, ans=0.125 2024-06-22 03:37:46,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=518307.1666666667, ans=0.125 2024-06-22 03:38:00,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=518362.1666666667, ans=0.0 2024-06-22 03:38:02,309 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.24 vs. limit=10.0 2024-06-22 03:38:05,905 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2024-06-22 03:38:13,572 INFO [train.py:1028] (1/2) Epoch 28, batch 9600, loss[loss=0.224, simple_loss=0.2746, pruned_loss=0.08667, over 10490.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2675, pruned_loss=0.06972, over 2571777.16 frames. ], batch size: 304, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:38:16,099 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=518398.8333333333, ans=0.1 2024-06-22 03:38:16,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=518398.8333333333, ans=0.0 2024-06-22 03:38:22,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=518417.1666666667, ans=0.125 2024-06-22 03:38:34,107 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 2.512e+02 2.670e+02 3.033e+02 4.551e+02, threshold=5.340e+02, percent-clipped=0.0 2024-06-22 03:38:43,915 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.88 vs. limit=12.0 2024-06-22 03:38:45,987 INFO [train.py:1028] (1/2) Epoch 28, batch 9650, loss[loss=0.2052, simple_loss=0.2574, pruned_loss=0.0765, over 13156.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2678, pruned_loss=0.0703, over 2561505.49 frames. ], batch size: 132, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:38:58,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=518527.1666666667, ans=0.2 2024-06-22 03:39:04,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=518545.5, ans=0.0 2024-06-22 03:39:15,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=518563.8333333333, ans=0.125 2024-06-22 03:39:18,799 INFO [train.py:1028] (1/2) Epoch 28, batch 9700, loss[loss=0.1959, simple_loss=0.2538, pruned_loss=0.06895, over 13028.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2682, pruned_loss=0.07073, over 2555285.10 frames. ], batch size: 144, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:39:21,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=518582.1666666667, ans=0.125 2024-06-22 03:39:22,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=518582.1666666667, ans=0.0 2024-06-22 03:39:27,969 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=518600.5, ans=0.125 2024-06-22 03:39:30,234 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.04 vs. limit=15.0 2024-06-22 03:39:39,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=518637.1666666667, ans=0.125 2024-06-22 03:39:39,441 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 2.535e+02 2.677e+02 2.927e+02 4.534e+02, threshold=5.353e+02, percent-clipped=0.0 2024-06-22 03:39:44,214 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.25 vs. limit=15.0 2024-06-22 03:39:46,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=518655.5, ans=0.025 2024-06-22 03:39:47,978 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=518655.5, ans=0.1 2024-06-22 03:39:49,014 INFO [train.py:1028] (1/2) Epoch 28, batch 9750, loss[loss=0.1912, simple_loss=0.2563, pruned_loss=0.06306, over 13066.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2679, pruned_loss=0.07034, over 2552032.51 frames. ], batch size: 132, lr: 2.04e-03, grad_scale: 16.0 2024-06-22 03:39:49,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=518673.8333333333, ans=0.125 2024-06-22 03:39:54,192 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=518673.8333333333, ans=0.125 2024-06-22 03:40:02,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=518710.5, ans=0.0 2024-06-22 03:40:08,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=518728.8333333333, ans=0.2 2024-06-22 03:40:09,446 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.23 vs. limit=22.5 2024-06-22 03:40:19,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=518747.1666666667, ans=0.125 2024-06-22 03:40:20,109 INFO [train.py:1028] (1/2) Epoch 28, batch 9800, loss[loss=0.2047, simple_loss=0.2728, pruned_loss=0.06831, over 12890.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2671, pruned_loss=0.06978, over 2545742.03 frames. ], batch size: 39, lr: 2.04e-03, grad_scale: 16.0 2024-06-22 03:40:25,723 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=518783.8333333333, ans=0.0 2024-06-22 03:40:30,094 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.68 vs. limit=15.0 2024-06-22 03:40:36,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=518802.1666666667, ans=0.0 2024-06-22 03:40:37,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=518802.1666666667, ans=0.025 2024-06-22 03:40:38,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=518802.1666666667, ans=0.0 2024-06-22 03:40:42,487 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.463e+02 2.598e+02 2.789e+02 3.504e+02, threshold=5.197e+02, percent-clipped=0.0 2024-06-22 03:40:42,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=518820.5, ans=0.125 2024-06-22 03:40:51,975 INFO [train.py:1028] (1/2) Epoch 28, batch 9850, loss[loss=0.1952, simple_loss=0.2627, pruned_loss=0.06385, over 13058.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2666, pruned_loss=0.06966, over 2539008.43 frames. ], batch size: 102, lr: 2.04e-03, grad_scale: 16.0 2024-06-22 03:40:52,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=518857.1666666667, ans=0.1 2024-06-22 03:40:58,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=518875.5, ans=0.125 2024-06-22 03:41:07,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=518893.8333333333, ans=0.0 2024-06-22 03:41:07,410 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.81 vs. limit=15.0 2024-06-22 03:41:08,095 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=518893.8333333333, ans=6.0 2024-06-22 03:41:08,129 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.93 vs. limit=15.0 2024-06-22 03:41:16,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=518930.5, ans=0.125 2024-06-22 03:41:22,835 INFO [train.py:1028] (1/2) Epoch 28, batch 9900, loss[loss=0.1837, simple_loss=0.25, pruned_loss=0.05874, over 12968.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2663, pruned_loss=0.0699, over 2531143.28 frames. ], batch size: 39, lr: 2.04e-03, grad_scale: 16.0 2024-06-22 03:41:23,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=518948.8333333333, ans=0.1 2024-06-22 03:41:38,844 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.57 vs. limit=10.0 2024-06-22 03:41:39,893 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=518985.5, ans=0.2 2024-06-22 03:41:42,282 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=519003.8333333333, ans=0.125 2024-06-22 03:41:45,953 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.222e+02 2.493e+02 2.584e+02 2.775e+02 3.456e+02, threshold=5.167e+02, percent-clipped=0.0 2024-06-22 03:41:55,326 INFO [train.py:1028] (1/2) Epoch 28, batch 9950, loss[loss=0.2239, simple_loss=0.2934, pruned_loss=0.07722, over 13082.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2653, pruned_loss=0.07022, over 2523821.07 frames. ], batch size: 30, lr: 2.04e-03, grad_scale: 16.0 2024-06-22 03:42:00,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=519040.5, ans=0.1 2024-06-22 03:42:02,367 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.53 vs. limit=10.0 2024-06-22 03:42:11,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=519077.1666666667, ans=0.1 2024-06-22 03:42:18,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519095.5, ans=0.1 2024-06-22 03:42:20,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=519113.8333333333, ans=0.125 2024-06-22 03:42:20,814 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.19 vs. limit=22.5 2024-06-22 03:42:27,448 INFO [train.py:1028] (1/2) Epoch 28, batch 10000, loss[loss=0.1837, simple_loss=0.2519, pruned_loss=0.05777, over 12541.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2655, pruned_loss=0.07025, over 2486382.54 frames. ], batch size: 22, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:42:27,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=519132.1666666667, ans=0.125 2024-06-22 03:42:39,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=519150.5, ans=0.2 2024-06-22 03:42:48,548 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.88 vs. limit=15.0 2024-06-22 03:42:49,590 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=519187.1666666667, ans=0.0 2024-06-22 03:42:49,953 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 2.442e+02 2.601e+02 2.839e+02 4.168e+02, threshold=5.203e+02, percent-clipped=0.0 2024-06-22 03:42:53,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=519205.5, ans=0.125 2024-06-22 03:42:55,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=519205.5, ans=0.1 2024-06-22 03:42:58,270 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.60 vs. limit=6.0 2024-06-22 03:42:59,665 INFO [train.py:1028] (1/2) Epoch 28, batch 10050, loss[loss=0.1917, simple_loss=0.2544, pruned_loss=0.06452, over 12501.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2662, pruned_loss=0.0713, over 2443718.25 frames. ], batch size: 22, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:43:01,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519223.8333333333, ans=0.1 2024-06-22 03:43:05,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=519242.1666666667, ans=0.2 2024-06-22 03:43:07,673 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=519242.1666666667, ans=0.2 2024-06-22 03:43:10,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=519242.1666666667, ans=0.125 2024-06-22 03:43:11,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=519260.5, ans=0.95 2024-06-22 03:43:11,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=519260.5, ans=0.125 2024-06-22 03:43:14,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=519260.5, ans=0.2 2024-06-22 03:43:16,767 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.81 vs. limit=15.0 2024-06-22 03:43:17,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=519278.8333333333, ans=0.0 2024-06-22 03:43:21,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=519278.8333333333, ans=0.2 2024-06-22 03:43:30,414 INFO [train.py:1028] (1/2) Epoch 28, batch 10100, loss[loss=0.1748, simple_loss=0.2349, pruned_loss=0.05738, over 11662.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2663, pruned_loss=0.07119, over 2427360.12 frames. ], batch size: 17, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:45:46,433 INFO [train.py:1028] (1/2) Epoch 29, batch 0, loss[loss=0.1636, simple_loss=0.2297, pruned_loss=0.04881, over 12989.00 frames. ], tot_loss[loss=0.1636, simple_loss=0.2297, pruned_loss=0.04881, over 12989.00 frames. ], batch size: 36, lr: 2.01e-03, grad_scale: 32.0 2024-06-22 03:45:46,434 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 03:45:53,883 INFO [train.py:1060] (1/2) Epoch 29, validation: loss=0.1942, simple_loss=0.2536, pruned_loss=0.06743, over 351949.00 frames. 2024-06-22 03:45:53,884 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 03:46:04,557 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=519365.0, ans=0.0 2024-06-22 03:46:08,847 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.442e+02 2.601e+02 2.796e+02 3.819e+02, threshold=5.201e+02, percent-clipped=0.0 2024-06-22 03:46:26,657 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:46:30,590 INFO [train.py:1028] (1/2) Epoch 29, batch 50, loss[loss=0.1596, simple_loss=0.2231, pruned_loss=0.04807, over 12622.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2455, pruned_loss=0.06356, over 575321.32 frames. ], batch size: 29, lr: 2.01e-03, grad_scale: 32.0 2024-06-22 03:46:36,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=519456.6666666667, ans=0.125 2024-06-22 03:46:37,938 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=519456.6666666667, ans=0.125 2024-06-22 03:46:42,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=519475.0, ans=0.04949747468305833 2024-06-22 03:46:44,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=519475.0, ans=0.1 2024-06-22 03:46:55,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=519493.3333333333, ans=0.125 2024-06-22 03:46:58,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=519511.6666666667, ans=0.1 2024-06-22 03:47:03,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=519511.6666666667, ans=0.125 2024-06-22 03:47:04,974 INFO [train.py:1028] (1/2) Epoch 29, batch 100, loss[loss=0.1785, simple_loss=0.2477, pruned_loss=0.05465, over 13243.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2459, pruned_loss=0.06319, over 1017584.02 frames. ], batch size: 46, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:47:15,806 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.327e+02 2.406e+02 2.611e+02 3.499e+02, threshold=4.812e+02, percent-clipped=0.0 2024-06-22 03:47:19,175 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.05 vs. limit=15.0 2024-06-22 03:47:26,638 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=519585.0, ans=0.95 2024-06-22 03:47:30,894 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=519603.3333333333, ans=0.125 2024-06-22 03:47:36,225 INFO [train.py:1028] (1/2) Epoch 29, batch 150, loss[loss=0.1689, simple_loss=0.2312, pruned_loss=0.05329, over 13108.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.246, pruned_loss=0.06211, over 1365001.15 frames. ], batch size: 30, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:47:39,781 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.50 vs. limit=22.5 2024-06-22 03:47:43,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=519640.0, ans=0.0 2024-06-22 03:48:05,035 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.65 vs. limit=15.0 2024-06-22 03:48:07,166 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.64 vs. limit=22.5 2024-06-22 03:48:08,426 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.51 vs. limit=22.5 2024-06-22 03:48:10,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=519713.3333333333, ans=0.0 2024-06-22 03:48:11,111 INFO [train.py:1028] (1/2) Epoch 29, batch 200, loss[loss=0.1784, simple_loss=0.2399, pruned_loss=0.05845, over 12572.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2472, pruned_loss=0.06253, over 1634461.19 frames. ], batch size: 202, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:48:21,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=519731.6666666667, ans=0.07 2024-06-22 03:48:22,495 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.329e+02 2.457e+02 2.674e+02 3.354e+02, threshold=4.915e+02, percent-clipped=0.0 2024-06-22 03:48:27,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519750.0, ans=0.1 2024-06-22 03:48:28,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=519750.0, ans=0.2 2024-06-22 03:48:38,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=519786.6666666667, ans=0.0 2024-06-22 03:48:41,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=519786.6666666667, ans=0.125 2024-06-22 03:48:42,717 INFO [train.py:1028] (1/2) Epoch 29, batch 250, loss[loss=0.1811, simple_loss=0.2262, pruned_loss=0.06799, over 13029.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2466, pruned_loss=0.06258, over 1846093.44 frames. ], batch size: 144, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:48:46,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=519805.0, ans=0.2 2024-06-22 03:48:46,748 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=519805.0, ans=0.125 2024-06-22 03:48:48,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=519805.0, ans=0.125 2024-06-22 03:48:51,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=519823.3333333333, ans=0.0 2024-06-22 03:48:56,486 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=519823.3333333333, ans=0.0 2024-06-22 03:49:11,145 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.87 vs. limit=22.5 2024-06-22 03:49:12,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=519878.3333333333, ans=0.0 2024-06-22 03:49:18,441 INFO [train.py:1028] (1/2) Epoch 29, batch 300, loss[loss=0.1832, simple_loss=0.2418, pruned_loss=0.06234, over 13190.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2469, pruned_loss=0.06264, over 2009562.95 frames. ], batch size: 112, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:49:20,552 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=519896.6666666667, ans=0.0 2024-06-22 03:49:21,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=519896.6666666667, ans=0.2 2024-06-22 03:49:22,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=519896.6666666667, ans=0.125 2024-06-22 03:49:30,239 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.335e+02 2.446e+02 2.676e+02 3.689e+02, threshold=4.893e+02, percent-clipped=0.0 2024-06-22 03:49:43,185 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=519951.6666666667, ans=0.2 2024-06-22 03:49:44,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=519970.0, ans=0.125 2024-06-22 03:49:48,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=519970.0, ans=0.125 2024-06-22 03:49:50,863 INFO [train.py:1028] (1/2) Epoch 29, batch 350, loss[loss=0.1716, simple_loss=0.2293, pruned_loss=0.0569, over 12908.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2465, pruned_loss=0.06226, over 2139256.73 frames. ], batch size: 33, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:49:53,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519988.3333333333, ans=0.1 2024-06-22 03:49:54,848 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=519988.3333333333, ans=0.2 2024-06-22 03:49:55,561 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=519988.3333333333, ans=0.125 2024-06-22 03:50:18,955 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=520043.3333333333, ans=0.05 2024-06-22 03:50:22,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=520061.6666666667, ans=0.2 2024-06-22 03:50:23,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=520061.6666666667, ans=0.125 2024-06-22 03:50:25,632 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=520080.0, ans=0.125 2024-06-22 03:50:26,156 INFO [train.py:1028] (1/2) Epoch 29, batch 400, loss[loss=0.1847, simple_loss=0.2513, pruned_loss=0.05909, over 13262.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2468, pruned_loss=0.06248, over 2239561.29 frames. ], batch size: 63, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:50:35,822 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:50:37,471 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.286e+02 2.426e+02 2.676e+02 3.372e+02, threshold=4.852e+02, percent-clipped=0.0 2024-06-22 03:50:37,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=520098.3333333333, ans=0.0 2024-06-22 03:51:00,639 INFO [train.py:1028] (1/2) Epoch 29, batch 450, loss[loss=0.1713, simple_loss=0.2365, pruned_loss=0.05298, over 13289.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2465, pruned_loss=0.06214, over 2314076.41 frames. ], batch size: 67, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:51:07,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=520190.0, ans=0.0 2024-06-22 03:51:09,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=520190.0, ans=0.125 2024-06-22 03:51:28,010 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=520245.0, ans=0.125 2024-06-22 03:51:31,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=520245.0, ans=0.125 2024-06-22 03:51:32,700 INFO [train.py:1028] (1/2) Epoch 29, batch 500, loss[loss=0.1892, simple_loss=0.2422, pruned_loss=0.06812, over 13117.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2466, pruned_loss=0.06207, over 2376894.04 frames. ], batch size: 121, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:51:44,383 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.281e+02 2.398e+02 2.609e+02 3.237e+02, threshold=4.796e+02, percent-clipped=0.0 2024-06-22 03:51:48,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=520300.0, ans=0.125 2024-06-22 03:51:48,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=520300.0, ans=0.125 2024-06-22 03:52:04,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=520336.6666666667, ans=0.1 2024-06-22 03:52:05,809 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.75 vs. limit=22.5 2024-06-22 03:52:07,349 INFO [train.py:1028] (1/2) Epoch 29, batch 550, loss[loss=0.1777, simple_loss=0.2314, pruned_loss=0.06204, over 12955.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2462, pruned_loss=0.06199, over 2421042.09 frames. ], batch size: 158, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:52:07,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=520355.0, ans=0.125 2024-06-22 03:52:39,899 INFO [train.py:1028] (1/2) Epoch 29, batch 600, loss[loss=0.1835, simple_loss=0.2391, pruned_loss=0.06395, over 13019.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2462, pruned_loss=0.0622, over 2458713.35 frames. ], batch size: 144, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:52:41,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=520446.6666666667, ans=0.125 2024-06-22 03:52:47,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=520465.0, ans=0.0 2024-06-22 03:52:51,303 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.75 vs. limit=15.0 2024-06-22 03:52:52,169 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.356e+02 2.448e+02 2.616e+02 3.167e+02, threshold=4.896e+02, percent-clipped=0.0 2024-06-22 03:52:59,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=520501.6666666667, ans=0.0 2024-06-22 03:53:12,214 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2024-06-22 03:53:15,126 INFO [train.py:1028] (1/2) Epoch 29, batch 650, loss[loss=0.1895, simple_loss=0.2555, pruned_loss=0.06177, over 13185.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2466, pruned_loss=0.06166, over 2489962.92 frames. ], batch size: 59, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:53:19,346 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2024-06-22 03:53:32,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=520575.0, ans=0.07 2024-06-22 03:53:34,454 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=520593.3333333333, ans=0.09899494936611666 2024-06-22 03:53:47,488 INFO [train.py:1028] (1/2) Epoch 29, batch 700, loss[loss=0.205, simple_loss=0.2651, pruned_loss=0.07245, over 13352.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2464, pruned_loss=0.06182, over 2512625.10 frames. ], batch size: 46, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:53:56,655 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=520648.3333333333, ans=0.2 2024-06-22 03:53:59,755 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.301e+02 2.442e+02 2.617e+02 3.752e+02, threshold=4.885e+02, percent-clipped=0.0 2024-06-22 03:54:17,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=520685.0, ans=0.1 2024-06-22 03:54:18,167 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2024-06-22 03:54:28,244 INFO [train.py:1028] (1/2) Epoch 29, batch 750, loss[loss=0.1701, simple_loss=0.2352, pruned_loss=0.05249, over 13271.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2464, pruned_loss=0.06146, over 2528987.42 frames. ], batch size: 63, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:54:35,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=520740.0, ans=0.125 2024-06-22 03:54:42,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=520758.3333333333, ans=0.125 2024-06-22 03:54:43,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=520758.3333333333, ans=0.2 2024-06-22 03:54:51,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=520776.6666666667, ans=0.125 2024-06-22 03:54:54,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=520795.0, ans=0.0 2024-06-22 03:55:00,555 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.79 vs. limit=10.0 2024-06-22 03:55:00,744 INFO [train.py:1028] (1/2) Epoch 29, batch 800, loss[loss=0.1717, simple_loss=0.2332, pruned_loss=0.05507, over 12942.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2467, pruned_loss=0.06175, over 2542051.96 frames. ], batch size: 36, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:55:06,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=520831.6666666667, ans=0.125 2024-06-22 03:55:15,610 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.36 vs. limit=10.0 2024-06-22 03:55:16,439 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.350e+02 2.476e+02 2.735e+02 3.510e+02, threshold=4.951e+02, percent-clipped=0.0 2024-06-22 03:55:16,581 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=520850.0, ans=0.0 2024-06-22 03:55:19,622 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2024-06-22 03:55:36,961 INFO [train.py:1028] (1/2) Epoch 29, batch 850, loss[loss=0.1851, simple_loss=0.2454, pruned_loss=0.06238, over 13118.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2466, pruned_loss=0.06155, over 2552006.57 frames. ], batch size: 95, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:55:40,424 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.75 vs. limit=10.0 2024-06-22 03:55:41,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=520905.0, ans=0.1 2024-06-22 03:55:45,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=520923.3333333333, ans=0.0 2024-06-22 03:55:46,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=520923.3333333333, ans=0.125 2024-06-22 03:55:50,893 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.95 vs. limit=22.5 2024-06-22 03:56:01,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=520978.3333333333, ans=0.0 2024-06-22 03:56:11,817 INFO [train.py:1028] (1/2) Epoch 29, batch 900, loss[loss=0.1507, simple_loss=0.2152, pruned_loss=0.04314, over 13014.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.246, pruned_loss=0.06165, over 2556700.19 frames. ], batch size: 36, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:56:11,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=520996.6666666667, ans=0.04949747468305833 2024-06-22 03:56:23,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=521015.0, ans=0.0 2024-06-22 03:56:25,020 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 2.377e+02 2.562e+02 2.835e+02 3.581e+02, threshold=5.124e+02, percent-clipped=0.0 2024-06-22 03:56:36,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=521051.6666666667, ans=0.1 2024-06-22 03:56:37,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=521070.0, ans=0.1 2024-06-22 03:56:42,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=521070.0, ans=0.125 2024-06-22 03:56:44,647 INFO [train.py:1028] (1/2) Epoch 29, batch 950, loss[loss=0.1794, simple_loss=0.244, pruned_loss=0.05737, over 12883.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2463, pruned_loss=0.06175, over 2559250.86 frames. ], batch size: 39, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:56:48,710 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=521088.3333333333, ans=0.0 2024-06-22 03:57:00,702 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=521125.0, ans=0.125 2024-06-22 03:57:13,694 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=521161.6666666667, ans=0.0 2024-06-22 03:57:16,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=521161.6666666667, ans=0.125 2024-06-22 03:57:19,493 INFO [train.py:1028] (1/2) Epoch 29, batch 1000, loss[loss=0.1927, simple_loss=0.2568, pruned_loss=0.06426, over 13241.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2463, pruned_loss=0.06197, over 2561040.11 frames. ], batch size: 49, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:57:19,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=521180.0, ans=0.125 2024-06-22 03:57:30,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=521198.3333333333, ans=0.09899494936611666 2024-06-22 03:57:31,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=521198.3333333333, ans=0.125 2024-06-22 03:57:31,554 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=521198.3333333333, ans=0.0 2024-06-22 03:57:32,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=521216.6666666667, ans=0.0 2024-06-22 03:57:32,725 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.347e+02 2.496e+02 2.727e+02 3.623e+02, threshold=4.992e+02, percent-clipped=0.0 2024-06-22 03:57:33,545 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=521216.6666666667, ans=0.1 2024-06-22 03:57:37,481 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=22.5 2024-06-22 03:57:37,884 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=521216.6666666667, ans=0.2 2024-06-22 03:57:40,003 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=521235.0, ans=0.0 2024-06-22 03:57:52,126 INFO [train.py:1028] (1/2) Epoch 29, batch 1050, loss[loss=0.1786, simple_loss=0.2463, pruned_loss=0.0555, over 13192.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2464, pruned_loss=0.06186, over 2563654.17 frames. ], batch size: 77, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:57:52,499 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.23 vs. limit=12.0 2024-06-22 03:57:54,123 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=521271.6666666667, ans=0.2 2024-06-22 03:57:54,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=521271.6666666667, ans=0.125 2024-06-22 03:57:55,473 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=521271.6666666667, ans=0.1 2024-06-22 03:58:16,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=521326.6666666667, ans=0.2 2024-06-22 03:58:21,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=521345.0, ans=0.125 2024-06-22 03:58:27,299 INFO [train.py:1028] (1/2) Epoch 29, batch 1100, loss[loss=0.1888, simple_loss=0.2516, pruned_loss=0.06297, over 13241.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2467, pruned_loss=0.06185, over 2569124.49 frames. ], batch size: 52, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:58:34,720 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.87 vs. limit=22.5 2024-06-22 03:58:35,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=521381.6666666667, ans=0.125 2024-06-22 03:58:40,237 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.277e+02 2.425e+02 2.541e+02 3.423e+02, threshold=4.850e+02, percent-clipped=0.0 2024-06-22 03:58:50,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=521418.3333333333, ans=0.025 2024-06-22 03:58:59,803 INFO [train.py:1028] (1/2) Epoch 29, batch 1150, loss[loss=0.1703, simple_loss=0.2303, pruned_loss=0.05514, over 13294.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.247, pruned_loss=0.06226, over 2570000.43 frames. ], batch size: 52, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:59:18,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=521491.6666666667, ans=0.05 2024-06-22 03:59:23,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=521510.0, ans=0.125 2024-06-22 03:59:26,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=521510.0, ans=0.2 2024-06-22 03:59:35,165 INFO [train.py:1028] (1/2) Epoch 29, batch 1200, loss[loss=0.18, simple_loss=0.2457, pruned_loss=0.05714, over 13183.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.247, pruned_loss=0.06237, over 2572212.86 frames. ], batch size: 77, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:59:47,593 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.416e+02 2.587e+02 2.851e+02 3.693e+02, threshold=5.175e+02, percent-clipped=0.0 2024-06-22 03:59:55,683 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=521601.6666666667, ans=0.1 2024-06-22 04:00:09,340 INFO [train.py:1028] (1/2) Epoch 29, batch 1250, loss[loss=0.1847, simple_loss=0.2358, pruned_loss=0.06682, over 13173.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.247, pruned_loss=0.06234, over 2581292.57 frames. ], batch size: 112, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:00:11,552 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2024-06-22 04:00:14,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=521638.3333333333, ans=0.125 2024-06-22 04:00:21,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=521656.6666666667, ans=0.1 2024-06-22 04:00:23,030 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2024-06-22 04:00:25,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=521675.0, ans=0.125 2024-06-22 04:00:30,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=521693.3333333333, ans=0.1 2024-06-22 04:00:42,869 INFO [train.py:1028] (1/2) Epoch 29, batch 1300, loss[loss=0.1937, simple_loss=0.2479, pruned_loss=0.06973, over 12767.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2468, pruned_loss=0.06224, over 2581921.19 frames. ], batch size: 176, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:00:46,334 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2024-06-22 04:00:48,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=521748.3333333333, ans=0.2 2024-06-22 04:00:56,017 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.307e+02 2.473e+02 2.693e+02 3.722e+02, threshold=4.947e+02, percent-clipped=0.0 2024-06-22 04:01:00,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=521766.6666666667, ans=0.0 2024-06-22 04:01:00,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=521766.6666666667, ans=0.025 2024-06-22 04:01:02,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=521785.0, ans=0.2 2024-06-22 04:01:19,519 INFO [train.py:1028] (1/2) Epoch 29, batch 1350, loss[loss=0.1688, simple_loss=0.232, pruned_loss=0.0528, over 13209.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2469, pruned_loss=0.06234, over 2583460.63 frames. ], batch size: 59, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:01:27,661 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=521840.0, ans=0.0 2024-06-22 04:01:31,842 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.98 vs. limit=15.0 2024-06-22 04:01:47,493 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=521895.0, ans=0.2 2024-06-22 04:01:53,130 INFO [train.py:1028] (1/2) Epoch 29, batch 1400, loss[loss=0.1945, simple_loss=0.2621, pruned_loss=0.06344, over 12471.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.247, pruned_loss=0.06235, over 2584682.05 frames. ], batch size: 25, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:02:05,012 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.76 vs. limit=10.0 2024-06-22 04:02:08,666 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.312e+02 2.451e+02 2.701e+02 3.676e+02, threshold=4.903e+02, percent-clipped=0.0 2024-06-22 04:02:09,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=521950.0, ans=0.1 2024-06-22 04:02:16,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=521968.3333333333, ans=0.5 2024-06-22 04:02:17,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=521968.3333333333, ans=0.04949747468305833 2024-06-22 04:02:28,582 INFO [train.py:1028] (1/2) Epoch 29, batch 1450, loss[loss=0.1746, simple_loss=0.229, pruned_loss=0.0601, over 13111.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2473, pruned_loss=0.06249, over 2584800.38 frames. ], batch size: 121, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:02:30,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=522005.0, ans=0.125 2024-06-22 04:02:37,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=522023.3333333333, ans=0.125 2024-06-22 04:02:38,509 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2024-06-22 04:02:39,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=522023.3333333333, ans=0.125 2024-06-22 04:02:50,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=522060.0, ans=0.125 2024-06-22 04:03:00,859 INFO [train.py:1028] (1/2) Epoch 29, batch 1500, loss[loss=0.1881, simple_loss=0.2559, pruned_loss=0.06013, over 13230.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2467, pruned_loss=0.06244, over 2587270.78 frames. ], batch size: 83, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:03:11,236 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2024-06-22 04:03:16,641 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.331e+02 2.514e+02 2.730e+02 3.846e+02, threshold=5.028e+02, percent-clipped=0.0 2024-06-22 04:03:16,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=522133.3333333333, ans=0.125 2024-06-22 04:03:18,602 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=522133.3333333333, ans=0.025 2024-06-22 04:03:23,860 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=522151.6666666667, ans=0.0 2024-06-22 04:03:24,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=522151.6666666667, ans=0.1 2024-06-22 04:03:35,849 INFO [train.py:1028] (1/2) Epoch 29, batch 1550, loss[loss=0.1862, simple_loss=0.245, pruned_loss=0.06372, over 13065.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2469, pruned_loss=0.06243, over 2582529.17 frames. ], batch size: 102, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:03:36,277 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.77 vs. limit=10.0 2024-06-22 04:03:37,284 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=522188.3333333333, ans=0.125 2024-06-22 04:03:57,043 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=522225.0, ans=0.1 2024-06-22 04:04:03,390 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.85 vs. limit=15.0 2024-06-22 04:04:10,432 INFO [train.py:1028] (1/2) Epoch 29, batch 1600, loss[loss=0.167, simple_loss=0.2264, pruned_loss=0.05386, over 13150.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2475, pruned_loss=0.06262, over 2578184.37 frames. ], batch size: 77, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:04:11,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=522280.0, ans=0.1 2024-06-22 04:04:17,011 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.33 vs. limit=15.0 2024-06-22 04:04:22,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=522316.6666666667, ans=0.125 2024-06-22 04:04:22,845 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.325e+02 2.452e+02 2.626e+02 3.155e+02, threshold=4.904e+02, percent-clipped=0.0 2024-06-22 04:04:23,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=522316.6666666667, ans=0.0 2024-06-22 04:04:36,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=522353.3333333333, ans=0.0 2024-06-22 04:04:38,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=522353.3333333333, ans=0.0 2024-06-22 04:04:42,038 INFO [train.py:1028] (1/2) Epoch 29, batch 1650, loss[loss=0.1826, simple_loss=0.2378, pruned_loss=0.06366, over 13150.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2476, pruned_loss=0.0628, over 2574850.99 frames. ], batch size: 95, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:04:45,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=522371.6666666667, ans=0.125 2024-06-22 04:04:53,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=522390.0, ans=0.125 2024-06-22 04:05:05,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=522426.6666666667, ans=0.125 2024-06-22 04:05:06,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=522426.6666666667, ans=0.125 2024-06-22 04:05:07,662 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.09 vs. limit=15.0 2024-06-22 04:05:13,167 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.79 vs. limit=12.0 2024-06-22 04:05:17,275 INFO [train.py:1028] (1/2) Epoch 29, batch 1700, loss[loss=0.1758, simple_loss=0.2401, pruned_loss=0.05572, over 12881.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2472, pruned_loss=0.06252, over 2580807.91 frames. ], batch size: 26, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:05:17,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=522463.3333333333, ans=0.025 2024-06-22 04:05:21,239 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=522463.3333333333, ans=0.125 2024-06-22 04:05:23,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=522481.6666666667, ans=0.125 2024-06-22 04:05:25,154 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=522481.6666666667, ans=0.2 2024-06-22 04:05:25,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=522481.6666666667, ans=0.125 2024-06-22 04:05:30,138 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.349e+02 2.492e+02 2.692e+02 3.299e+02, threshold=4.984e+02, percent-clipped=0.0 2024-06-22 04:05:30,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=522500.0, ans=0.0 2024-06-22 04:05:31,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=522500.0, ans=0.2 2024-06-22 04:05:33,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=522500.0, ans=0.1 2024-06-22 04:05:35,512 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=522500.0, ans=0.2 2024-06-22 04:05:49,853 INFO [train.py:1028] (1/2) Epoch 29, batch 1750, loss[loss=0.1845, simple_loss=0.2469, pruned_loss=0.06107, over 12545.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2473, pruned_loss=0.06253, over 2581618.65 frames. ], batch size: 22, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:05:51,552 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=12.0 2024-06-22 04:05:59,925 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.81 vs. limit=22.5 2024-06-22 04:06:06,312 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=522591.6666666667, ans=0.0 2024-06-22 04:06:06,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=522591.6666666667, ans=0.125 2024-06-22 04:06:06,720 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.14 vs. limit=12.0 2024-06-22 04:06:08,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=522591.6666666667, ans=0.125 2024-06-22 04:06:09,118 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.13 vs. limit=15.0 2024-06-22 04:06:20,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=522628.3333333333, ans=15.0 2024-06-22 04:06:24,342 INFO [train.py:1028] (1/2) Epoch 29, batch 1800, loss[loss=0.1714, simple_loss=0.24, pruned_loss=0.05142, over 13205.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2475, pruned_loss=0.06246, over 2581666.56 frames. ], batch size: 67, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:06:25,479 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=15.0 2024-06-22 04:06:29,318 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2024-06-22 04:06:35,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=522665.0, ans=0.0 2024-06-22 04:06:36,971 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.301e+02 2.462e+02 2.610e+02 3.115e+02, threshold=4.924e+02, percent-clipped=0.0 2024-06-22 04:06:38,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=522683.3333333333, ans=0.1 2024-06-22 04:06:41,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=522683.3333333333, ans=0.125 2024-06-22 04:06:56,341 INFO [train.py:1028] (1/2) Epoch 29, batch 1850, loss[loss=0.1963, simple_loss=0.2572, pruned_loss=0.06774, over 13237.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2475, pruned_loss=0.06258, over 2583299.25 frames. ], batch size: 83, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:07:07,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=522756.6666666667, ans=0.0 2024-06-22 04:07:08,880 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=15.0 2024-06-22 04:07:28,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=522811.6666666667, ans=0.125 2024-06-22 04:07:31,348 INFO [train.py:1028] (1/2) Epoch 29, batch 1900, loss[loss=0.1837, simple_loss=0.2443, pruned_loss=0.06151, over 13118.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2474, pruned_loss=0.06288, over 2586304.08 frames. ], batch size: 95, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:07:34,769 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=522830.0, ans=0.2 2024-06-22 04:07:38,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=522848.3333333333, ans=0.0 2024-06-22 04:07:41,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=522848.3333333333, ans=0.2 2024-06-22 04:07:44,740 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.375e+02 2.458e+02 2.563e+02 3.310e+02, threshold=4.916e+02, percent-clipped=0.0 2024-06-22 04:07:47,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=522866.6666666667, ans=0.0 2024-06-22 04:07:55,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=522885.0, ans=0.1 2024-06-22 04:08:06,694 INFO [train.py:1028] (1/2) Epoch 29, batch 1950, loss[loss=0.1878, simple_loss=0.2517, pruned_loss=0.06196, over 13277.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2472, pruned_loss=0.06308, over 2592475.88 frames. ], batch size: 52, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:08:07,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=522921.6666666667, ans=0.2 2024-06-22 04:08:10,695 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=522921.6666666667, ans=0.2 2024-06-22 04:08:11,417 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.18 vs. limit=15.0 2024-06-22 04:08:18,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=522940.0, ans=0.125 2024-06-22 04:08:19,561 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2024-06-22 04:08:22,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=522958.3333333333, ans=0.125 2024-06-22 04:08:23,785 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=522958.3333333333, ans=0.125 2024-06-22 04:08:30,990 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=522976.6666666667, ans=0.025 2024-06-22 04:08:32,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=522995.0, ans=0.125 2024-06-22 04:08:32,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=522995.0, ans=0.0 2024-06-22 04:08:36,224 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=522995.0, ans=0.125 2024-06-22 04:08:38,845 INFO [train.py:1028] (1/2) Epoch 29, batch 2000, loss[loss=0.1841, simple_loss=0.257, pruned_loss=0.0556, over 12396.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2468, pruned_loss=0.06308, over 2587554.64 frames. ], batch size: 22, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:08:39,312 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2024-06-22 04:08:40,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=523013.3333333333, ans=0.1 2024-06-22 04:08:42,175 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=523013.3333333333, ans=0.125 2024-06-22 04:08:43,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=523013.3333333333, ans=0.125 2024-06-22 04:08:45,391 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=523031.6666666667, ans=0.0 2024-06-22 04:08:46,616 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=523031.6666666667, ans=0.0 2024-06-22 04:08:47,796 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=523031.6666666667, ans=0.125 2024-06-22 04:08:48,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=523031.6666666667, ans=0.125 2024-06-22 04:08:51,716 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 2.422e+02 2.535e+02 2.811e+02 3.422e+02, threshold=5.070e+02, percent-clipped=0.0 2024-06-22 04:08:55,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=523050.0, ans=0.1 2024-06-22 04:08:55,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=523050.0, ans=0.2 2024-06-22 04:08:56,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=523050.0, ans=0.2 2024-06-22 04:09:01,357 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=523068.3333333333, ans=0.0 2024-06-22 04:09:01,960 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=523068.3333333333, ans=0.2 2024-06-22 04:09:02,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=523068.3333333333, ans=0.0 2024-06-22 04:09:11,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=523086.6666666667, ans=0.125 2024-06-22 04:09:14,785 INFO [train.py:1028] (1/2) Epoch 29, batch 2050, loss[loss=0.1925, simple_loss=0.254, pruned_loss=0.06549, over 12597.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2475, pruned_loss=0.06337, over 2583406.04 frames. ], batch size: 29, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:09:14,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=523105.0, ans=0.0 2024-06-22 04:09:24,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=523123.3333333333, ans=0.125 2024-06-22 04:09:30,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=523141.6666666667, ans=0.125 2024-06-22 04:09:30,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=523141.6666666667, ans=0.2 2024-06-22 04:09:39,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=523178.3333333333, ans=0.0 2024-06-22 04:09:39,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=523178.3333333333, ans=0.0 2024-06-22 04:09:41,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=523178.3333333333, ans=0.1 2024-06-22 04:09:42,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=523178.3333333333, ans=0.125 2024-06-22 04:09:46,406 INFO [train.py:1028] (1/2) Epoch 29, batch 2100, loss[loss=0.1831, simple_loss=0.2493, pruned_loss=0.0584, over 13258.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2476, pruned_loss=0.06288, over 2586077.15 frames. ], batch size: 59, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:09:46,633 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=523196.6666666667, ans=0.1 2024-06-22 04:09:50,537 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=523196.6666666667, ans=0.125 2024-06-22 04:10:02,412 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.355e+02 2.530e+02 2.727e+02 3.451e+02, threshold=5.060e+02, percent-clipped=0.0 2024-06-22 04:10:14,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=523270.0, ans=0.1 2024-06-22 04:10:16,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=523270.0, ans=0.125 2024-06-22 04:10:18,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=523270.0, ans=0.0 2024-06-22 04:10:22,026 INFO [train.py:1028] (1/2) Epoch 29, batch 2150, loss[loss=0.1755, simple_loss=0.2438, pruned_loss=0.05366, over 13272.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2484, pruned_loss=0.06307, over 2589434.57 frames. ], batch size: 52, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:10:25,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=523288.3333333333, ans=0.125 2024-06-22 04:10:37,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=523325.0, ans=15.0 2024-06-22 04:10:48,714 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.62 vs. limit=15.0 2024-06-22 04:10:50,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=523361.6666666667, ans=0.0 2024-06-22 04:10:54,535 INFO [train.py:1028] (1/2) Epoch 29, batch 2200, loss[loss=0.1894, simple_loss=0.248, pruned_loss=0.06546, over 13236.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2483, pruned_loss=0.06302, over 2589231.49 frames. ], batch size: 83, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 04:10:56,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=523380.0, ans=0.2 2024-06-22 04:11:02,832 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.63 vs. limit=15.0 2024-06-22 04:11:02,868 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.87 vs. limit=10.0 2024-06-22 04:11:08,344 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.342e+02 2.490e+02 2.648e+02 3.564e+02, threshold=4.979e+02, percent-clipped=0.0 2024-06-22 04:11:08,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=523416.6666666667, ans=0.0 2024-06-22 04:11:16,698 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.11 vs. limit=15.0 2024-06-22 04:11:21,577 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.69 vs. limit=15.0 2024-06-22 04:11:28,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=523453.3333333333, ans=0.125 2024-06-22 04:11:29,942 INFO [train.py:1028] (1/2) Epoch 29, batch 2250, loss[loss=0.1761, simple_loss=0.2415, pruned_loss=0.05537, over 13259.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2487, pruned_loss=0.06309, over 2588464.21 frames. ], batch size: 63, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 04:11:36,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=523490.0, ans=0.125 2024-06-22 04:12:01,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=523545.0, ans=0.125 2024-06-22 04:12:05,814 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.78 vs. limit=15.0 2024-06-22 04:12:06,091 INFO [train.py:1028] (1/2) Epoch 29, batch 2300, loss[loss=0.206, simple_loss=0.2644, pruned_loss=0.07382, over 12964.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2487, pruned_loss=0.06281, over 2582789.62 frames. ], batch size: 33, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 04:12:08,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=523563.3333333333, ans=0.0 2024-06-22 04:12:10,701 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.80 vs. limit=15.0 2024-06-22 04:12:20,319 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.336e+02 2.451e+02 2.626e+02 3.376e+02, threshold=4.901e+02, percent-clipped=0.0 2024-06-22 04:12:23,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=523600.0, ans=0.125 2024-06-22 04:12:26,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=523618.3333333333, ans=0.025 2024-06-22 04:12:28,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=523618.3333333333, ans=0.1 2024-06-22 04:12:30,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=523618.3333333333, ans=0.125 2024-06-22 04:12:31,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=523618.3333333333, ans=0.2 2024-06-22 04:12:39,134 INFO [train.py:1028] (1/2) Epoch 29, batch 2350, loss[loss=0.1891, simple_loss=0.2526, pruned_loss=0.06283, over 13216.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2486, pruned_loss=0.06296, over 2585834.29 frames. ], batch size: 67, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 04:12:46,818 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2024-06-22 04:12:48,170 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.72 vs. limit=15.0 2024-06-22 04:12:59,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=523710.0, ans=0.1 2024-06-22 04:13:14,996 INFO [train.py:1028] (1/2) Epoch 29, batch 2400, loss[loss=0.1886, simple_loss=0.247, pruned_loss=0.06514, over 13302.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2476, pruned_loss=0.0626, over 2588217.37 frames. ], batch size: 46, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:13:29,349 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.367e+02 2.501e+02 2.666e+02 3.627e+02, threshold=5.001e+02, percent-clipped=0.0 2024-06-22 04:13:39,363 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.75 vs. limit=15.0 2024-06-22 04:13:48,051 INFO [train.py:1028] (1/2) Epoch 29, batch 2450, loss[loss=0.1856, simple_loss=0.2432, pruned_loss=0.06397, over 13264.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2462, pruned_loss=0.06248, over 2585376.34 frames. ], batch size: 63, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:13:58,379 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.69 vs. limit=15.0 2024-06-22 04:14:03,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=523875.0, ans=0.2 2024-06-22 04:14:13,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=523893.3333333333, ans=0.125 2024-06-22 04:14:15,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=523893.3333333333, ans=0.0 2024-06-22 04:14:16,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=523911.6666666667, ans=0.0 2024-06-22 04:14:22,105 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=523911.6666666667, ans=0.0 2024-06-22 04:14:22,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=523911.6666666667, ans=0.125 2024-06-22 04:14:23,298 INFO [train.py:1028] (1/2) Epoch 29, batch 2500, loss[loss=0.1817, simple_loss=0.2363, pruned_loss=0.06356, over 13203.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2454, pruned_loss=0.06239, over 2588189.71 frames. ], batch size: 83, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:14:37,028 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.318e+02 2.443e+02 2.663e+02 3.793e+02, threshold=4.886e+02, percent-clipped=0.0 2024-06-22 04:14:38,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=523966.6666666667, ans=0.125 2024-06-22 04:14:41,610 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-22 04:14:45,858 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:14:51,828 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=524003.3333333333, ans=0.0 2024-06-22 04:14:55,378 INFO [train.py:1028] (1/2) Epoch 29, batch 2550, loss[loss=0.1822, simple_loss=0.2485, pruned_loss=0.05796, over 12698.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2445, pruned_loss=0.06217, over 2587011.22 frames. ], batch size: 22, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:15:12,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=524058.3333333333, ans=0.1 2024-06-22 04:15:13,719 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=524058.3333333333, ans=0.125 2024-06-22 04:15:25,615 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:15:30,429 INFO [train.py:1028] (1/2) Epoch 29, batch 2600, loss[loss=0.1782, simple_loss=0.2418, pruned_loss=0.0573, over 13204.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2437, pruned_loss=0.06205, over 2587373.67 frames. ], batch size: 52, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:15:32,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=524113.3333333333, ans=0.125 2024-06-22 04:15:33,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=524113.3333333333, ans=0.0 2024-06-22 04:15:38,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=524131.6666666667, ans=0.125 2024-06-22 04:15:48,662 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.354e+02 2.512e+02 2.696e+02 3.404e+02, threshold=5.024e+02, percent-clipped=0.0 2024-06-22 04:15:48,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=524150.0, ans=0.1 2024-06-22 04:15:59,790 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=524168.3333333333, ans=0.0 2024-06-22 04:16:07,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.24 vs. limit=15.0 2024-06-22 04:16:07,404 INFO [train.py:1028] (1/2) Epoch 29, batch 2650, loss[loss=0.1659, simple_loss=0.2161, pruned_loss=0.05783, over 13049.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2426, pruned_loss=0.06167, over 2586541.58 frames. ], batch size: 144, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:16:15,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=524223.3333333333, ans=0.0 2024-06-22 04:16:33,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=524278.3333333333, ans=0.0 2024-06-22 04:16:36,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=524278.3333333333, ans=0.025 2024-06-22 04:16:39,628 INFO [train.py:1028] (1/2) Epoch 29, batch 2700, loss[loss=0.1828, simple_loss=0.2361, pruned_loss=0.06472, over 13247.00 frames. ], tot_loss[loss=0.1818, simple_loss=0.2411, pruned_loss=0.06126, over 2583899.76 frames. ], batch size: 89, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:16:49,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=524315.0, ans=0.2 2024-06-22 04:16:53,792 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.297e+02 2.443e+02 2.651e+02 3.296e+02, threshold=4.885e+02, percent-clipped=0.0 2024-06-22 04:16:56,138 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.41 vs. limit=6.0 2024-06-22 04:16:58,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=524333.3333333334, ans=0.2 2024-06-22 04:17:05,221 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=524351.6666666666, ans=0.0 2024-06-22 04:17:10,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=524370.0, ans=0.125 2024-06-22 04:17:14,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=524370.0, ans=0.0 2024-06-22 04:17:15,683 INFO [train.py:1028] (1/2) Epoch 29, batch 2750, loss[loss=0.1743, simple_loss=0.2348, pruned_loss=0.0569, over 13302.00 frames. ], tot_loss[loss=0.1802, simple_loss=0.2396, pruned_loss=0.06044, over 2581716.11 frames. ], batch size: 43, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:17:15,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=524388.3333333334, ans=0.0 2024-06-22 04:17:20,894 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.05 vs. limit=15.0 2024-06-22 04:17:30,967 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.24 vs. limit=10.0 2024-06-22 04:17:39,571 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=524443.3333333334, ans=0.125 2024-06-22 04:17:40,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=524443.3333333334, ans=0.025 2024-06-22 04:17:42,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=524443.3333333334, ans=0.0 2024-06-22 04:17:46,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=524461.6666666666, ans=0.1 2024-06-22 04:17:51,922 INFO [train.py:1028] (1/2) Epoch 29, batch 2800, loss[loss=0.178, simple_loss=0.2296, pruned_loss=0.06323, over 10871.00 frames. ], tot_loss[loss=0.1801, simple_loss=0.2394, pruned_loss=0.06037, over 2579500.03 frames. ], batch size: 304, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:17:57,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=524480.0, ans=0.2 2024-06-22 04:17:59,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=524498.3333333334, ans=0.2 2024-06-22 04:18:03,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=524498.3333333334, ans=0.05 2024-06-22 04:18:05,411 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.332e+02 2.483e+02 2.805e+02 3.871e+02, threshold=4.966e+02, percent-clipped=0.0 2024-06-22 04:18:07,551 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=524516.6666666666, ans=0.0 2024-06-22 04:18:24,265 INFO [train.py:1028] (1/2) Epoch 29, batch 2850, loss[loss=0.179, simple_loss=0.2446, pruned_loss=0.05669, over 13282.00 frames. ], tot_loss[loss=0.1797, simple_loss=0.2389, pruned_loss=0.06028, over 2578543.60 frames. ], batch size: 49, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:18:28,559 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=524571.6666666666, ans=15.0 2024-06-22 04:18:38,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=524608.3333333334, ans=0.025 2024-06-22 04:18:38,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=524608.3333333334, ans=0.1 2024-06-22 04:18:40,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=524608.3333333334, ans=0.125 2024-06-22 04:18:53,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=524645.0, ans=0.1 2024-06-22 04:18:56,416 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.51 vs. limit=10.0 2024-06-22 04:18:59,212 INFO [train.py:1028] (1/2) Epoch 29, batch 2900, loss[loss=0.1862, simple_loss=0.2461, pruned_loss=0.06311, over 13145.00 frames. ], tot_loss[loss=0.1786, simple_loss=0.2374, pruned_loss=0.05993, over 2586302.64 frames. ], batch size: 55, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:19:13,244 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.363e+02 2.581e+02 2.810e+02 3.242e+02, threshold=5.161e+02, percent-clipped=0.0 2024-06-22 04:19:17,881 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.61 vs. limit=22.5 2024-06-22 04:19:22,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.45 vs. limit=5.0 2024-06-22 04:19:35,656 INFO [train.py:1028] (1/2) Epoch 29, batch 2950, loss[loss=0.1654, simple_loss=0.2329, pruned_loss=0.04893, over 13274.00 frames. ], tot_loss[loss=0.1788, simple_loss=0.2376, pruned_loss=0.06006, over 2580992.15 frames. ], batch size: 43, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:19:36,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=524755.0, ans=0.0 2024-06-22 04:19:37,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=524755.0, ans=0.125 2024-06-22 04:19:39,446 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=12.0 2024-06-22 04:19:39,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=524755.0, ans=10.0 2024-06-22 04:19:49,001 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.87 vs. limit=22.5 2024-06-22 04:19:53,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=524791.6666666666, ans=0.0 2024-06-22 04:19:57,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=524810.0, ans=0.125 2024-06-22 04:20:03,333 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2024-06-22 04:20:06,578 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.97 vs. limit=15.0 2024-06-22 04:20:07,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=524828.3333333334, ans=0.1 2024-06-22 04:20:08,935 INFO [train.py:1028] (1/2) Epoch 29, batch 3000, loss[loss=0.1816, simple_loss=0.2459, pruned_loss=0.05864, over 13217.00 frames. ], tot_loss[loss=0.1785, simple_loss=0.2371, pruned_loss=0.05998, over 2580460.11 frames. ], batch size: 59, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:20:08,936 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 04:20:16,977 INFO [train.py:1060] (1/2) Epoch 29, validation: loss=0.1934, simple_loss=0.2521, pruned_loss=0.06738, over 351949.00 frames. 2024-06-22 04:20:16,977 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 04:20:17,255 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=524846.6666666666, ans=0.0 2024-06-22 04:20:19,745 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=524846.6666666666, ans=0.025 2024-06-22 04:20:27,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=524865.0, ans=0.0 2024-06-22 04:20:30,759 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.302e+02 2.422e+02 2.566e+02 3.144e+02, threshold=4.843e+02, percent-clipped=0.0 2024-06-22 04:20:35,789 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2024-06-22 04:20:36,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=524901.6666666666, ans=0.1 2024-06-22 04:20:39,398 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=524901.6666666666, ans=0.0 2024-06-22 04:20:51,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=524920.0, ans=0.025 2024-06-22 04:20:51,048 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=524920.0, ans=0.1 2024-06-22 04:20:52,755 INFO [train.py:1028] (1/2) Epoch 29, batch 3050, loss[loss=0.1867, simple_loss=0.2443, pruned_loss=0.06459, over 13319.00 frames. ], tot_loss[loss=0.1781, simple_loss=0.2362, pruned_loss=0.06, over 2580475.58 frames. ], batch size: 46, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:20:54,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=524938.3333333334, ans=0.0 2024-06-22 04:20:57,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=524938.3333333334, ans=0.1 2024-06-22 04:21:13,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=524993.3333333334, ans=0.025 2024-06-22 04:21:18,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=525011.6666666666, ans=0.125 2024-06-22 04:21:20,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=525011.6666666666, ans=0.05 2024-06-22 04:21:25,174 INFO [train.py:1028] (1/2) Epoch 29, batch 3100, loss[loss=0.1629, simple_loss=0.2189, pruned_loss=0.05346, over 13039.00 frames. ], tot_loss[loss=0.1768, simple_loss=0.235, pruned_loss=0.05936, over 2580378.44 frames. ], batch size: 144, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:21:34,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=525048.3333333334, ans=0.125 2024-06-22 04:21:41,984 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.877e+02 2.305e+02 2.452e+02 2.604e+02 3.487e+02, threshold=4.903e+02, percent-clipped=0.0 2024-06-22 04:21:49,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=525085.0, ans=0.125 2024-06-22 04:21:53,900 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=525103.3333333334, ans=0.1 2024-06-22 04:21:54,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=525103.3333333334, ans=0.125 2024-06-22 04:22:00,797 INFO [train.py:1028] (1/2) Epoch 29, batch 3150, loss[loss=0.1763, simple_loss=0.2293, pruned_loss=0.06162, over 12912.00 frames. ], tot_loss[loss=0.176, simple_loss=0.2339, pruned_loss=0.05905, over 2582917.35 frames. ], batch size: 158, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:22:06,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=525121.6666666666, ans=0.125 2024-06-22 04:22:16,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=525158.3333333334, ans=0.1 2024-06-22 04:22:21,238 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=525176.6666666666, ans=0.0 2024-06-22 04:22:23,555 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=525176.6666666666, ans=0.0 2024-06-22 04:22:28,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=525195.0, ans=0.1 2024-06-22 04:22:32,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=525213.3333333334, ans=0.025 2024-06-22 04:22:33,272 INFO [train.py:1028] (1/2) Epoch 29, batch 3200, loss[loss=0.1584, simple_loss=0.2181, pruned_loss=0.04941, over 13102.00 frames. ], tot_loss[loss=0.1751, simple_loss=0.2331, pruned_loss=0.05856, over 2583429.05 frames. ], batch size: 55, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:22:35,421 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=525213.3333333334, ans=0.125 2024-06-22 04:22:41,091 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=525231.6666666666, ans=0.1 2024-06-22 04:22:41,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=525231.6666666666, ans=0.0 2024-06-22 04:22:48,568 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=525250.0, ans=0.07 2024-06-22 04:22:48,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=525250.0, ans=0.1 2024-06-22 04:22:49,806 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.234e+02 2.367e+02 2.507e+02 2.902e+02, threshold=4.735e+02, percent-clipped=0.0 2024-06-22 04:22:53,108 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:23:02,681 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=525286.6666666666, ans=0.125 2024-06-22 04:23:03,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=525286.6666666666, ans=0.125 2024-06-22 04:23:04,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=525286.6666666666, ans=0.1 2024-06-22 04:23:09,055 INFO [train.py:1028] (1/2) Epoch 29, batch 3250, loss[loss=0.1547, simple_loss=0.2129, pruned_loss=0.04821, over 13257.00 frames. ], tot_loss[loss=0.1752, simple_loss=0.233, pruned_loss=0.05872, over 2587040.27 frames. ], batch size: 72, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:23:11,222 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=525305.0, ans=0.04949747468305833 2024-06-22 04:23:11,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=525305.0, ans=0.125 2024-06-22 04:23:19,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=525323.3333333334, ans=0.2 2024-06-22 04:23:21,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=525341.6666666666, ans=0.125 2024-06-22 04:23:22,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=525341.6666666666, ans=0.2 2024-06-22 04:23:35,262 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=525360.0, ans=0.0 2024-06-22 04:23:37,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=525360.0, ans=0.125 2024-06-22 04:23:45,880 INFO [train.py:1028] (1/2) Epoch 29, batch 3300, loss[loss=0.1773, simple_loss=0.235, pruned_loss=0.05978, over 12717.00 frames. ], tot_loss[loss=0.1745, simple_loss=0.2322, pruned_loss=0.05839, over 2583666.04 frames. ], batch size: 176, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:23:48,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=525396.6666666666, ans=0.125 2024-06-22 04:23:49,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=525396.6666666666, ans=0.125 2024-06-22 04:24:00,180 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.998e+02 2.296e+02 2.474e+02 2.646e+02 3.545e+02, threshold=4.948e+02, percent-clipped=0.0 2024-06-22 04:24:02,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=525433.3333333334, ans=0.125 2024-06-22 04:24:06,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=525451.6666666666, ans=0.2 2024-06-22 04:24:07,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=525451.6666666666, ans=0.025 2024-06-22 04:24:14,199 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.47 vs. limit=22.5 2024-06-22 04:24:18,970 INFO [train.py:1028] (1/2) Epoch 29, batch 3350, loss[loss=0.1833, simple_loss=0.2403, pruned_loss=0.06309, over 12938.00 frames. ], tot_loss[loss=0.174, simple_loss=0.2314, pruned_loss=0.05831, over 2578654.36 frames. ], batch size: 158, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:24:19,498 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2024-06-22 04:24:26,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=525506.6666666666, ans=0.0 2024-06-22 04:24:35,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=525525.0, ans=0.04949747468305833 2024-06-22 04:24:47,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=525543.3333333334, ans=0.125 2024-06-22 04:24:49,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=525561.6666666666, ans=0.125 2024-06-22 04:24:54,975 INFO [train.py:1028] (1/2) Epoch 29, batch 3400, loss[loss=0.1875, simple_loss=0.2531, pruned_loss=0.06096, over 12651.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.2312, pruned_loss=0.05853, over 2575868.28 frames. ], batch size: 22, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:25:08,944 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.331e+02 2.488e+02 2.682e+02 3.920e+02, threshold=4.975e+02, percent-clipped=0.0 2024-06-22 04:25:24,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=525653.3333333334, ans=0.125 2024-06-22 04:25:30,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=525653.3333333334, ans=0.0 2024-06-22 04:25:31,623 INFO [train.py:1028] (1/2) Epoch 29, batch 3450, loss[loss=0.1843, simple_loss=0.2413, pruned_loss=0.06367, over 12767.00 frames. ], tot_loss[loss=0.174, simple_loss=0.2311, pruned_loss=0.05849, over 2578134.46 frames. ], batch size: 176, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:25:39,123 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.94 vs. limit=6.0 2024-06-22 04:25:56,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=525726.6666666666, ans=0.025 2024-06-22 04:26:04,636 INFO [train.py:1028] (1/2) Epoch 29, batch 3500, loss[loss=0.1562, simple_loss=0.2137, pruned_loss=0.04934, over 12940.00 frames. ], tot_loss[loss=0.1736, simple_loss=0.2307, pruned_loss=0.05827, over 2576616.72 frames. ], batch size: 33, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:26:18,798 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.237e+02 2.354e+02 2.539e+02 3.070e+02, threshold=4.709e+02, percent-clipped=0.0 2024-06-22 04:26:32,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=525836.6666666666, ans=0.1 2024-06-22 04:26:40,997 INFO [train.py:1028] (1/2) Epoch 29, batch 3550, loss[loss=0.1676, simple_loss=0.2182, pruned_loss=0.05853, over 13199.00 frames. ], tot_loss[loss=0.1729, simple_loss=0.2301, pruned_loss=0.05787, over 2577591.36 frames. ], batch size: 95, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:26:43,151 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=525855.0, ans=0.0 2024-06-22 04:26:59,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=525910.0, ans=0.125 2024-06-22 04:27:03,899 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2024-06-22 04:27:13,322 INFO [train.py:1028] (1/2) Epoch 29, batch 3600, loss[loss=0.1718, simple_loss=0.2335, pruned_loss=0.05507, over 13300.00 frames. ], tot_loss[loss=0.1727, simple_loss=0.2297, pruned_loss=0.05791, over 2580732.59 frames. ], batch size: 49, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:27:21,605 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.77 vs. limit=6.0 2024-06-22 04:27:24,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=525965.0, ans=0.125 2024-06-22 04:27:24,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=525965.0, ans=0.0 2024-06-22 04:27:30,480 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.342e+02 2.497e+02 2.754e+02 4.411e+02, threshold=4.994e+02, percent-clipped=0.0 2024-06-22 04:27:39,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=526001.6666666666, ans=0.0 2024-06-22 04:27:43,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=526020.0, ans=0.0 2024-06-22 04:27:46,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=526020.0, ans=0.1 2024-06-22 04:27:49,279 INFO [train.py:1028] (1/2) Epoch 29, batch 3650, loss[loss=0.1718, simple_loss=0.2248, pruned_loss=0.05946, over 13028.00 frames. ], tot_loss[loss=0.1723, simple_loss=0.2294, pruned_loss=0.05757, over 2578528.77 frames. ], batch size: 102, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:27:52,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=526038.3333333334, ans=0.09899494936611666 2024-06-22 04:28:00,760 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.66 vs. limit=22.5 2024-06-22 04:28:06,210 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=526075.0, ans=0.125 2024-06-22 04:28:13,055 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:28:14,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=526093.3333333334, ans=0.0 2024-06-22 04:28:22,104 INFO [train.py:1028] (1/2) Epoch 29, batch 3700, loss[loss=0.1559, simple_loss=0.2175, pruned_loss=0.04717, over 13245.00 frames. ], tot_loss[loss=0.1722, simple_loss=0.229, pruned_loss=0.05765, over 2584393.39 frames. ], batch size: 72, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:28:24,389 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=526130.0, ans=0.0 2024-06-22 04:28:32,545 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.18 vs. limit=15.0 2024-06-22 04:28:36,041 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.281e+02 2.384e+02 2.618e+02 3.123e+02, threshold=4.768e+02, percent-clipped=0.0 2024-06-22 04:28:38,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=526166.6666666666, ans=0.0 2024-06-22 04:28:38,788 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=526166.6666666666, ans=0.0 2024-06-22 04:28:41,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=526185.0, ans=0.0 2024-06-22 04:28:57,536 INFO [train.py:1028] (1/2) Epoch 29, batch 3750, loss[loss=0.1889, simple_loss=0.2504, pruned_loss=0.06374, over 12609.00 frames. ], tot_loss[loss=0.1715, simple_loss=0.2286, pruned_loss=0.05721, over 2586497.29 frames. ], batch size: 22, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:29:00,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=526221.6666666666, ans=0.125 2024-06-22 04:29:11,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=526258.3333333334, ans=0.1 2024-06-22 04:29:12,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=526258.3333333334, ans=0.125 2024-06-22 04:29:12,542 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2024-06-22 04:29:16,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=526276.6666666666, ans=0.2 2024-06-22 04:29:17,078 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.12 vs. limit=15.0 2024-06-22 04:29:30,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=526295.0, ans=0.2 2024-06-22 04:29:32,856 INFO [train.py:1028] (1/2) Epoch 29, batch 3800, loss[loss=0.1836, simple_loss=0.2312, pruned_loss=0.06805, over 13240.00 frames. ], tot_loss[loss=0.1715, simple_loss=0.2286, pruned_loss=0.05718, over 2584290.06 frames. ], batch size: 83, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:29:34,562 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2024-06-22 04:29:35,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=526313.3333333334, ans=0.2 2024-06-22 04:29:36,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=526313.3333333334, ans=0.0 2024-06-22 04:29:37,781 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.98 vs. limit=6.0 2024-06-22 04:29:40,423 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.89 vs. limit=6.0 2024-06-22 04:29:46,327 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.276e+02 2.400e+02 2.630e+02 3.441e+02, threshold=4.801e+02, percent-clipped=0.0 2024-06-22 04:30:01,866 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.19 vs. limit=10.0 2024-06-22 04:30:02,695 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.30 vs. limit=22.5 2024-06-22 04:30:03,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=526386.6666666666, ans=0.125 2024-06-22 04:30:05,618 INFO [train.py:1028] (1/2) Epoch 29, batch 3850, loss[loss=0.1574, simple_loss=0.2112, pruned_loss=0.0518, over 13041.00 frames. ], tot_loss[loss=0.1702, simple_loss=0.2276, pruned_loss=0.05643, over 2583873.41 frames. ], batch size: 144, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:30:09,083 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=526405.0, ans=0.125 2024-06-22 04:30:10,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=526405.0, ans=0.125 2024-06-22 04:30:19,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=526441.6666666666, ans=0.0 2024-06-22 04:30:21,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=526441.6666666666, ans=0.1 2024-06-22 04:30:22,534 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=526441.6666666666, ans=0.2 2024-06-22 04:30:25,038 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=526460.0, ans=0.025 2024-06-22 04:30:29,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=526460.0, ans=0.125 2024-06-22 04:30:36,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=526478.3333333334, ans=0.0 2024-06-22 04:30:38,097 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.12 vs. limit=15.0 2024-06-22 04:30:39,071 INFO [train.py:1028] (1/2) Epoch 29, batch 3900, loss[loss=0.1763, simple_loss=0.2266, pruned_loss=0.06296, over 13186.00 frames. ], tot_loss[loss=0.1712, simple_loss=0.2287, pruned_loss=0.05689, over 2587049.46 frames. ], batch size: 83, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:30:46,812 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:30:47,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=526515.0, ans=0.025 2024-06-22 04:30:48,862 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=526515.0, ans=0.0 2024-06-22 04:30:50,644 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=526515.0, ans=0.0 2024-06-22 04:30:56,314 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.223e+02 2.363e+02 2.548e+02 3.559e+02, threshold=4.726e+02, percent-clipped=0.0 2024-06-22 04:31:11,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=526570.0, ans=0.0 2024-06-22 04:31:13,363 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.16 vs. limit=22.5 2024-06-22 04:31:16,327 INFO [train.py:1028] (1/2) Epoch 29, batch 3950, loss[loss=0.1807, simple_loss=0.2243, pruned_loss=0.06854, over 13102.00 frames. ], tot_loss[loss=0.1702, simple_loss=0.2277, pruned_loss=0.0564, over 2588589.08 frames. ], batch size: 132, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:31:23,642 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.54 vs. limit=15.0 2024-06-22 04:31:24,850 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=526606.6666666666, ans=0.125 2024-06-22 04:31:32,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=526625.0, ans=0.0 2024-06-22 04:31:33,882 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=526625.0, ans=0.125 2024-06-22 04:31:36,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=526625.0, ans=0.1 2024-06-22 04:31:37,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=526625.0, ans=15.0 2024-06-22 04:31:38,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=526625.0, ans=0.125 2024-06-22 04:31:38,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=526625.0, ans=0.2 2024-06-22 04:31:42,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=526643.3333333334, ans=10.0 2024-06-22 04:31:51,447 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.18 vs. limit=15.0 2024-06-22 04:31:53,060 INFO [train.py:1028] (1/2) Epoch 29, batch 4000, loss[loss=0.1712, simple_loss=0.2311, pruned_loss=0.05571, over 12969.00 frames. ], tot_loss[loss=0.1697, simple_loss=0.2269, pruned_loss=0.05627, over 2584033.80 frames. ], batch size: 39, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:31:57,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=526680.0, ans=0.125 2024-06-22 04:32:06,885 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.311e+02 2.432e+02 2.634e+02 3.544e+02, threshold=4.864e+02, percent-clipped=0.0 2024-06-22 04:32:13,487 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.66 vs. limit=6.0 2024-06-22 04:32:14,730 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.58 vs. limit=15.0 2024-06-22 04:32:14,748 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.43 vs. limit=15.0 2024-06-22 04:32:17,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=526735.0, ans=0.125 2024-06-22 04:32:23,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=526753.3333333334, ans=0.125 2024-06-22 04:32:24,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=526753.3333333334, ans=0.025 2024-06-22 04:32:25,237 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:32:26,350 INFO [train.py:1028] (1/2) Epoch 29, batch 4050, loss[loss=0.1831, simple_loss=0.2297, pruned_loss=0.06822, over 10947.00 frames. ], tot_loss[loss=0.1702, simple_loss=0.2272, pruned_loss=0.05661, over 2581562.91 frames. ], batch size: 303, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:32:32,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=526790.0, ans=0.125 2024-06-22 04:32:37,280 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2024-06-22 04:32:39,205 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.07 vs. limit=15.0 2024-06-22 04:32:41,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=526808.3333333334, ans=0.2 2024-06-22 04:32:45,396 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=526826.6666666666, ans=0.125 2024-06-22 04:32:57,451 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=526845.0, ans=0.0 2024-06-22 04:33:01,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=526845.0, ans=0.125 2024-06-22 04:33:03,533 INFO [train.py:1028] (1/2) Epoch 29, batch 4100, loss[loss=0.1725, simple_loss=0.2245, pruned_loss=0.06027, over 13179.00 frames. ], tot_loss[loss=0.1706, simple_loss=0.2273, pruned_loss=0.057, over 2577400.16 frames. ], batch size: 103, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:33:07,447 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2024-06-22 04:33:12,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=526881.6666666666, ans=0.0 2024-06-22 04:33:15,457 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=526881.6666666666, ans=0.125 2024-06-22 04:33:17,965 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.021e+02 2.284e+02 2.397e+02 2.647e+02 3.168e+02, threshold=4.794e+02, percent-clipped=0.0 2024-06-22 04:33:20,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=526900.0, ans=0.2 2024-06-22 04:33:24,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=526918.3333333334, ans=0.5 2024-06-22 04:33:34,981 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=526936.6666666666, ans=0.0 2024-06-22 04:33:42,047 INFO [train.py:1028] (1/2) Epoch 29, batch 4150, loss[loss=0.17, simple_loss=0.2312, pruned_loss=0.05446, over 13175.00 frames. ], tot_loss[loss=0.1703, simple_loss=0.2271, pruned_loss=0.05671, over 2576001.57 frames. ], batch size: 55, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:33:42,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=526955.0, ans=0.0 2024-06-22 04:33:44,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=526955.0, ans=0.09899494936611666 2024-06-22 04:33:50,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=526973.3333333334, ans=0.0 2024-06-22 04:34:15,432 INFO [train.py:1028] (1/2) Epoch 29, batch 4200, loss[loss=0.1746, simple_loss=0.2247, pruned_loss=0.06227, over 13220.00 frames. ], tot_loss[loss=0.1702, simple_loss=0.2269, pruned_loss=0.05678, over 2579274.02 frames. ], batch size: 103, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:34:21,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=527065.0, ans=0.025 2024-06-22 04:34:25,508 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=527065.0, ans=0.1 2024-06-22 04:34:29,331 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.227e+02 2.373e+02 2.534e+02 3.280e+02, threshold=4.745e+02, percent-clipped=0.0 2024-06-22 04:34:48,846 INFO [train.py:1028] (1/2) Epoch 29, batch 4250, loss[loss=0.1575, simple_loss=0.2244, pruned_loss=0.04526, over 13386.00 frames. ], tot_loss[loss=0.1693, simple_loss=0.2262, pruned_loss=0.0562, over 2581593.13 frames. ], batch size: 46, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:35:01,910 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.24 vs. limit=15.0 2024-06-22 04:35:07,149 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.04 vs. limit=15.0 2024-06-22 04:35:12,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=527193.3333333334, ans=0.025 2024-06-22 04:35:24,618 INFO [train.py:1028] (1/2) Epoch 29, batch 4300, loss[loss=0.1803, simple_loss=0.2366, pruned_loss=0.06203, over 13168.00 frames. ], tot_loss[loss=0.169, simple_loss=0.2257, pruned_loss=0.05616, over 2582558.82 frames. ], batch size: 59, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:35:28,211 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.04 vs. limit=15.0 2024-06-22 04:35:32,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=527248.3333333334, ans=15.0 2024-06-22 04:35:34,212 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.81 vs. limit=15.0 2024-06-22 04:35:41,777 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.249e+02 2.335e+02 2.591e+02 4.091e+02, threshold=4.670e+02, percent-clipped=0.0 2024-06-22 04:35:49,211 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=527285.0, ans=0.125 2024-06-22 04:35:51,208 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.89 vs. limit=15.0 2024-06-22 04:36:00,364 INFO [train.py:1028] (1/2) Epoch 29, batch 4350, loss[loss=0.172, simple_loss=0.2262, pruned_loss=0.05897, over 13261.00 frames. ], tot_loss[loss=0.1688, simple_loss=0.2252, pruned_loss=0.05621, over 2586777.36 frames. ], batch size: 59, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:36:03,597 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.62 vs. limit=22.5 2024-06-22 04:36:23,507 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.77 vs. limit=15.0 2024-06-22 04:36:26,394 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=527395.0, ans=0.0 2024-06-22 04:36:27,509 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2024-06-22 04:36:33,622 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.31 vs. limit=22.5 2024-06-22 04:36:33,886 INFO [train.py:1028] (1/2) Epoch 29, batch 4400, loss[loss=0.1572, simple_loss=0.2111, pruned_loss=0.0516, over 13257.00 frames. ], tot_loss[loss=0.1689, simple_loss=0.2252, pruned_loss=0.05632, over 2587483.32 frames. ], batch size: 83, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:36:35,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=527413.3333333334, ans=0.125 2024-06-22 04:36:37,350 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=15.0 2024-06-22 04:36:44,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=527431.6666666666, ans=0.125 2024-06-22 04:36:46,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=527450.0, ans=0.1 2024-06-22 04:36:47,256 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.303e+02 2.449e+02 2.649e+02 3.365e+02, threshold=4.898e+02, percent-clipped=0.0 2024-06-22 04:37:00,310 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=527486.6666666666, ans=0.1 2024-06-22 04:37:11,781 INFO [train.py:1028] (1/2) Epoch 29, batch 4450, loss[loss=0.1813, simple_loss=0.2402, pruned_loss=0.0612, over 12819.00 frames. ], tot_loss[loss=0.1693, simple_loss=0.2255, pruned_loss=0.05656, over 2581705.60 frames. ], batch size: 33, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:37:14,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=527505.0, ans=0.125 2024-06-22 04:37:18,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=527523.3333333334, ans=0.0 2024-06-22 04:37:19,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=527523.3333333334, ans=0.125 2024-06-22 04:37:46,411 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2024-06-22 04:37:47,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=527596.6666666666, ans=0.2 2024-06-22 04:37:47,952 INFO [train.py:1028] (1/2) Epoch 29, batch 4500, loss[loss=0.1622, simple_loss=0.2183, pruned_loss=0.05305, over 13272.00 frames. ], tot_loss[loss=0.1691, simple_loss=0.225, pruned_loss=0.05654, over 2586565.28 frames. ], batch size: 89, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:37:52,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=527596.6666666666, ans=0.125 2024-06-22 04:38:02,146 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.236e+02 2.334e+02 2.463e+02 3.396e+02, threshold=4.668e+02, percent-clipped=0.0 2024-06-22 04:38:06,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=527633.3333333334, ans=0.0 2024-06-22 04:38:17,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=527670.0, ans=0.2 2024-06-22 04:38:20,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=527670.0, ans=0.025 2024-06-22 04:38:21,765 INFO [train.py:1028] (1/2) Epoch 29, batch 4550, loss[loss=0.1545, simple_loss=0.2145, pruned_loss=0.04719, over 13240.00 frames. ], tot_loss[loss=0.1691, simple_loss=0.2252, pruned_loss=0.05648, over 2589945.06 frames. ], batch size: 52, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:38:25,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=527688.3333333334, ans=0.1 2024-06-22 04:38:42,733 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.75 vs. limit=15.0 2024-06-22 04:38:43,768 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=527743.3333333334, ans=0.0 2024-06-22 04:38:46,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=527743.3333333334, ans=0.0 2024-06-22 04:38:47,897 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.49 vs. limit=22.5 2024-06-22 04:38:50,819 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=527761.6666666666, ans=0.125 2024-06-22 04:38:52,478 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2024-06-22 04:38:54,139 INFO [train.py:1028] (1/2) Epoch 29, batch 4600, loss[loss=0.1887, simple_loss=0.2382, pruned_loss=0.06958, over 12538.00 frames. ], tot_loss[loss=0.1696, simple_loss=0.2258, pruned_loss=0.05676, over 2584926.87 frames. ], batch size: 202, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:39:06,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=527798.3333333334, ans=0.125 2024-06-22 04:39:10,739 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.306e+02 2.444e+02 2.636e+02 3.242e+02, threshold=4.889e+02, percent-clipped=0.0 2024-06-22 04:39:20,675 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.16 vs. limit=6.0 2024-06-22 04:39:21,005 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=527835.0, ans=0.0 2024-06-22 04:39:29,265 INFO [train.py:1028] (1/2) Epoch 29, batch 4650, loss[loss=0.1745, simple_loss=0.2183, pruned_loss=0.06537, over 13106.00 frames. ], tot_loss[loss=0.1687, simple_loss=0.2247, pruned_loss=0.05634, over 2589003.63 frames. ], batch size: 132, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:39:40,240 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=527890.0, ans=10.0 2024-06-22 04:39:47,745 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2024-06-22 04:40:05,381 INFO [train.py:1028] (1/2) Epoch 29, batch 4700, loss[loss=0.1508, simple_loss=0.2085, pruned_loss=0.04652, over 12434.00 frames. ], tot_loss[loss=0.1685, simple_loss=0.2244, pruned_loss=0.05627, over 2584291.00 frames. ], batch size: 25, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:40:24,277 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.254e+02 2.407e+02 2.567e+02 3.679e+02, threshold=4.813e+02, percent-clipped=0.0 2024-06-22 04:40:25,828 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.57 vs. limit=12.0 2024-06-22 04:40:41,378 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=528036.6666666666, ans=0.125 2024-06-22 04:40:43,272 INFO [train.py:1028] (1/2) Epoch 29, batch 4750, loss[loss=0.177, simple_loss=0.2238, pruned_loss=0.0651, over 12632.00 frames. ], tot_loss[loss=0.1689, simple_loss=0.2245, pruned_loss=0.05666, over 2580819.88 frames. ], batch size: 202, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:40:47,659 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.49 vs. limit=10.0 2024-06-22 04:40:51,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=528073.3333333334, ans=0.0 2024-06-22 04:40:59,410 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-22 04:41:05,122 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=528091.6666666666, ans=0.125 2024-06-22 04:41:06,794 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.90 vs. limit=6.0 2024-06-22 04:41:07,027 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:41:07,791 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=528110.0, ans=0.0 2024-06-22 04:41:13,601 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.43 vs. limit=15.0 2024-06-22 04:41:20,086 INFO [train.py:1028] (1/2) Epoch 29, batch 4800, loss[loss=0.1683, simple_loss=0.2247, pruned_loss=0.05594, over 13273.00 frames. ], tot_loss[loss=0.1691, simple_loss=0.2246, pruned_loss=0.05681, over 2576499.12 frames. ], batch size: 63, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:41:24,526 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.58 vs. limit=10.0 2024-06-22 04:41:34,161 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.194e+02 2.371e+02 2.592e+02 3.538e+02, threshold=4.742e+02, percent-clipped=0.0 2024-06-22 04:41:50,794 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.55 vs. limit=15.0 2024-06-22 04:41:55,980 INFO [train.py:1028] (1/2) Epoch 29, batch 4850, loss[loss=0.1697, simple_loss=0.2235, pruned_loss=0.05793, over 13195.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.2242, pruned_loss=0.05647, over 2573911.94 frames. ], batch size: 89, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:42:08,469 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:42:16,903 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.44 vs. limit=6.0 2024-06-22 04:42:26,551 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.73 vs. limit=15.0 2024-06-22 04:42:27,827 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.68 vs. limit=10.0 2024-06-22 04:42:29,126 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.30 vs. limit=10.0 2024-06-22 04:42:29,380 INFO [train.py:1028] (1/2) Epoch 29, batch 4900, loss[loss=0.1549, simple_loss=0.2203, pruned_loss=0.04471, over 13148.00 frames. ], tot_loss[loss=0.1682, simple_loss=0.2239, pruned_loss=0.0562, over 2575229.17 frames. ], batch size: 59, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:42:30,180 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=528330.0, ans=0.125 2024-06-22 04:42:32,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=528330.0, ans=0.0 2024-06-22 04:42:36,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=528348.3333333334, ans=0.0 2024-06-22 04:42:39,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=528348.3333333334, ans=0.1 2024-06-22 04:42:42,631 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:42:43,105 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.998e+02 2.205e+02 2.336e+02 2.528e+02 3.228e+02, threshold=4.673e+02, percent-clipped=0.0 2024-06-22 04:42:44,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=528366.6666666666, ans=0.125 2024-06-22 04:42:50,501 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=528385.0, ans=0.0 2024-06-22 04:42:51,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=528385.0, ans=0.125 2024-06-22 04:42:53,977 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=528385.0, ans=0.0 2024-06-22 04:43:05,561 INFO [train.py:1028] (1/2) Epoch 29, batch 4950, loss[loss=0.1907, simple_loss=0.2324, pruned_loss=0.07449, over 11046.00 frames. ], tot_loss[loss=0.1689, simple_loss=0.2245, pruned_loss=0.05665, over 2569485.30 frames. ], batch size: 303, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:43:08,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=528421.6666666666, ans=0.09899494936611666 2024-06-22 04:43:13,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=528440.0, ans=0.025 2024-06-22 04:43:23,966 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=528458.3333333334, ans=0.0 2024-06-22 04:43:35,146 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=528495.0, ans=0.125 2024-06-22 04:43:35,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=528495.0, ans=0.125 2024-06-22 04:43:40,446 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=528495.0, ans=0.0 2024-06-22 04:43:41,607 INFO [train.py:1028] (1/2) Epoch 29, batch 5000, loss[loss=0.1753, simple_loss=0.2245, pruned_loss=0.06306, over 13177.00 frames. ], tot_loss[loss=0.1688, simple_loss=0.2246, pruned_loss=0.05654, over 2574265.61 frames. ], batch size: 95, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:43:48,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=528531.6666666666, ans=0.05 2024-06-22 04:43:51,993 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=528531.6666666666, ans=0.125 2024-06-22 04:43:55,882 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.198e+02 2.354e+02 2.504e+02 3.990e+02, threshold=4.708e+02, percent-clipped=0.0 2024-06-22 04:43:59,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=528550.0, ans=0.05 2024-06-22 04:44:00,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=528550.0, ans=0.125 2024-06-22 04:44:02,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=528568.3333333334, ans=0.125 2024-06-22 04:44:15,262 INFO [train.py:1028] (1/2) Epoch 29, batch 5050, loss[loss=0.1666, simple_loss=0.2282, pruned_loss=0.05255, over 12920.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2242, pruned_loss=0.05603, over 2572463.04 frames. ], batch size: 36, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:44:20,231 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=528605.0, ans=0.0 2024-06-22 04:44:20,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=528605.0, ans=0.1 2024-06-22 04:44:22,493 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2024-06-22 04:44:36,678 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=528660.0, ans=0.125 2024-06-22 04:44:48,042 INFO [train.py:1028] (1/2) Epoch 29, batch 5100, loss[loss=0.1699, simple_loss=0.2345, pruned_loss=0.05267, over 12941.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.2245, pruned_loss=0.05636, over 2569452.37 frames. ], batch size: 39, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:44:49,411 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=528696.6666666666, ans=0.0 2024-06-22 04:44:53,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=528696.6666666666, ans=0.125 2024-06-22 04:45:05,189 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.226e+02 2.357e+02 2.573e+02 3.326e+02, threshold=4.714e+02, percent-clipped=0.0 2024-06-22 04:45:10,751 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=528751.6666666666, ans=0.1 2024-06-22 04:45:24,008 INFO [train.py:1028] (1/2) Epoch 29, batch 5150, loss[loss=0.1645, simple_loss=0.2092, pruned_loss=0.0599, over 13084.00 frames. ], tot_loss[loss=0.1684, simple_loss=0.224, pruned_loss=0.05638, over 2571460.70 frames. ], batch size: 132, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:45:31,008 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=528788.3333333334, ans=0.125 2024-06-22 04:45:33,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=528788.3333333334, ans=0.04949747468305833 2024-06-22 04:45:38,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=528806.6666666666, ans=0.125 2024-06-22 04:45:41,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=528825.0, ans=0.2 2024-06-22 04:45:42,346 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.27 vs. limit=15.0 2024-06-22 04:45:49,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=528843.3333333334, ans=0.125 2024-06-22 04:45:51,373 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.62 vs. limit=10.0 2024-06-22 04:45:59,055 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=528861.6666666666, ans=0.1 2024-06-22 04:46:00,746 INFO [train.py:1028] (1/2) Epoch 29, batch 5200, loss[loss=0.1617, simple_loss=0.2148, pruned_loss=0.0543, over 13154.00 frames. ], tot_loss[loss=0.1679, simple_loss=0.2236, pruned_loss=0.05608, over 2575569.29 frames. ], batch size: 95, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:46:00,992 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:46:05,062 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.51 vs. limit=6.0 2024-06-22 04:46:08,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=528898.3333333334, ans=0.0 2024-06-22 04:46:12,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=528898.3333333334, ans=0.0 2024-06-22 04:46:14,587 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.229e+02 2.356e+02 2.469e+02 3.093e+02, threshold=4.712e+02, percent-clipped=0.0 2024-06-22 04:46:19,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=528916.6666666666, ans=0.0 2024-06-22 04:46:31,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=528953.3333333334, ans=0.0 2024-06-22 04:46:31,837 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=528953.3333333334, ans=0.125 2024-06-22 04:46:34,635 INFO [train.py:1028] (1/2) Epoch 29, batch 5250, loss[loss=0.1706, simple_loss=0.2268, pruned_loss=0.05719, over 13245.00 frames. ], tot_loss[loss=0.1685, simple_loss=0.2239, pruned_loss=0.0565, over 2570534.63 frames. ], batch size: 52, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:46:46,776 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.36 vs. limit=22.5 2024-06-22 04:46:47,365 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.60 vs. limit=15.0 2024-06-22 04:46:49,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=529008.3333333334, ans=0.0 2024-06-22 04:46:50,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=529008.3333333334, ans=0.0 2024-06-22 04:46:51,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=529008.3333333334, ans=0.0 2024-06-22 04:47:05,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=529026.6666666666, ans=0.1 2024-06-22 04:47:13,438 INFO [train.py:1028] (1/2) Epoch 29, batch 5300, loss[loss=0.1614, simple_loss=0.2148, pruned_loss=0.05406, over 13060.00 frames. ], tot_loss[loss=0.1678, simple_loss=0.2234, pruned_loss=0.05613, over 2567503.49 frames. ], batch size: 144, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:47:16,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=529063.3333333334, ans=0.025 2024-06-22 04:47:21,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=529081.6666666666, ans=0.025 2024-06-22 04:47:27,571 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.263e+02 2.362e+02 2.545e+02 3.339e+02, threshold=4.725e+02, percent-clipped=0.0 2024-06-22 04:47:38,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=529118.3333333334, ans=0.125 2024-06-22 04:47:38,897 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=15.0 2024-06-22 04:47:46,672 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=529136.6666666666, ans=0.125 2024-06-22 04:47:50,719 INFO [train.py:1028] (1/2) Epoch 29, batch 5350, loss[loss=0.1608, simple_loss=0.2281, pruned_loss=0.04681, over 11389.00 frames. ], tot_loss[loss=0.1678, simple_loss=0.2235, pruned_loss=0.05608, over 2574338.33 frames. ], batch size: 16, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:47:54,878 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=529155.0, ans=0.125 2024-06-22 04:47:55,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=529155.0, ans=0.125 2024-06-22 04:47:55,823 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.30 vs. limit=15.0 2024-06-22 04:48:03,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=529191.6666666666, ans=0.125 2024-06-22 04:48:05,474 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=12.0 2024-06-22 04:48:05,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=529191.6666666666, ans=0.125 2024-06-22 04:48:13,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=529210.0, ans=0.125 2024-06-22 04:48:14,461 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=529210.0, ans=0.125 2024-06-22 04:48:20,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=529228.3333333334, ans=10.0 2024-06-22 04:48:23,121 INFO [train.py:1028] (1/2) Epoch 29, batch 5400, loss[loss=0.1745, simple_loss=0.2209, pruned_loss=0.06407, over 12213.00 frames. ], tot_loss[loss=0.1682, simple_loss=0.2236, pruned_loss=0.05639, over 2566884.06 frames. ], batch size: 241, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:48:25,293 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=529246.6666666666, ans=0.2 2024-06-22 04:48:29,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=529265.0, ans=0.125 2024-06-22 04:48:33,757 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=529265.0, ans=0.0 2024-06-22 04:48:36,859 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.242e+02 2.378e+02 2.630e+02 3.499e+02, threshold=4.756e+02, percent-clipped=0.0 2024-06-22 04:48:43,326 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.61 vs. limit=15.0 2024-06-22 04:48:53,499 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.87 vs. limit=15.0 2024-06-22 04:48:59,377 INFO [train.py:1028] (1/2) Epoch 29, batch 5450, loss[loss=0.1658, simple_loss=0.2187, pruned_loss=0.05643, over 12329.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2237, pruned_loss=0.05623, over 2569707.75 frames. ], batch size: 25, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:49:00,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=529338.3333333334, ans=0.125 2024-06-22 04:49:03,013 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.14 vs. limit=15.0 2024-06-22 04:49:11,975 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.19 vs. limit=10.0 2024-06-22 04:49:23,042 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:49:35,243 INFO [train.py:1028] (1/2) Epoch 29, batch 5500, loss[loss=0.188, simple_loss=0.2325, pruned_loss=0.07173, over 12254.00 frames. ], tot_loss[loss=0.1682, simple_loss=0.2239, pruned_loss=0.05631, over 2562967.31 frames. ], batch size: 240, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:49:35,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=529430.0, ans=0.125 2024-06-22 04:49:40,971 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=529448.3333333334, ans=0.0 2024-06-22 04:49:48,860 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.223e+02 2.348e+02 2.538e+02 2.951e+02, threshold=4.697e+02, percent-clipped=0.0 2024-06-22 04:49:58,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=529485.0, ans=0.0 2024-06-22 04:50:07,794 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=529521.6666666666, ans=0.0 2024-06-22 04:50:08,224 INFO [train.py:1028] (1/2) Epoch 29, batch 5550, loss[loss=0.162, simple_loss=0.227, pruned_loss=0.04847, over 13272.00 frames. ], tot_loss[loss=0.1675, simple_loss=0.2233, pruned_loss=0.0559, over 2567832.74 frames. ], batch size: 43, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:50:12,453 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2024-06-22 04:50:12,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=529521.6666666666, ans=0.125 2024-06-22 04:50:16,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=529540.0, ans=0.125 2024-06-22 04:50:16,824 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=529540.0, ans=0.0 2024-06-22 04:50:16,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=529540.0, ans=0.125 2024-06-22 04:50:24,367 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=529558.3333333334, ans=0.125 2024-06-22 04:50:28,206 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=529576.6666666666, ans=0.125 2024-06-22 04:50:35,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=529595.0, ans=0.2 2024-06-22 04:50:36,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=529595.0, ans=0.1 2024-06-22 04:50:36,669 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.22 vs. limit=15.0 2024-06-22 04:50:39,801 INFO [train.py:1028] (1/2) Epoch 29, batch 5600, loss[loss=0.1676, simple_loss=0.2172, pruned_loss=0.05902, over 13251.00 frames. ], tot_loss[loss=0.1669, simple_loss=0.2226, pruned_loss=0.05557, over 2570217.81 frames. ], batch size: 89, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:50:48,937 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2024-06-22 04:50:53,388 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=529650.0, ans=0.0 2024-06-22 04:50:53,852 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.287e+02 2.446e+02 2.598e+02 3.545e+02, threshold=4.892e+02, percent-clipped=0.0 2024-06-22 04:51:13,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=529686.6666666666, ans=0.0 2024-06-22 04:51:15,547 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=529705.0, ans=0.0 2024-06-22 04:51:15,966 INFO [train.py:1028] (1/2) Epoch 29, batch 5650, loss[loss=0.1754, simple_loss=0.2291, pruned_loss=0.06087, over 12507.00 frames. ], tot_loss[loss=0.167, simple_loss=0.2228, pruned_loss=0.05558, over 2575841.09 frames. ], batch size: 202, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:51:16,973 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2024-06-22 04:51:20,935 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.21 vs. limit=15.0 2024-06-22 04:51:52,306 INFO [train.py:1028] (1/2) Epoch 29, batch 5700, loss[loss=0.169, simple_loss=0.2336, pruned_loss=0.05223, over 13261.00 frames. ], tot_loss[loss=0.167, simple_loss=0.2226, pruned_loss=0.05567, over 2579289.02 frames. ], batch size: 63, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:52:04,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=529833.3333333334, ans=0.1 2024-06-22 04:52:05,424 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.310e+02 2.416e+02 2.647e+02 3.297e+02, threshold=4.832e+02, percent-clipped=0.0 2024-06-22 04:52:12,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=529851.6666666666, ans=0.0 2024-06-22 04:52:19,041 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=529870.0, ans=0.125 2024-06-22 04:52:19,665 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=529870.0, ans=0.025 2024-06-22 04:52:24,560 INFO [train.py:1028] (1/2) Epoch 29, batch 5750, loss[loss=0.1992, simple_loss=0.2462, pruned_loss=0.07608, over 12739.00 frames. ], tot_loss[loss=0.1675, simple_loss=0.2231, pruned_loss=0.05595, over 2579442.33 frames. ], batch size: 176, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:52:26,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=529888.3333333334, ans=0.0 2024-06-22 04:52:35,716 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=529906.6666666666, ans=0.0 2024-06-22 04:52:49,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=529943.3333333334, ans=0.2 2024-06-22 04:52:52,571 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2024-06-22 04:52:58,346 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=529961.6666666666, ans=0.125 2024-06-22 04:53:00,602 INFO [train.py:1028] (1/2) Epoch 29, batch 5800, loss[loss=0.1842, simple_loss=0.2359, pruned_loss=0.06626, over 12826.00 frames. ], tot_loss[loss=0.1692, simple_loss=0.2248, pruned_loss=0.05682, over 2578185.03 frames. ], batch size: 176, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:53:01,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=529980.0, ans=0.125 2024-06-22 04:53:13,875 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.65 vs. limit=22.5 2024-06-22 04:53:13,978 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.327e+02 2.489e+02 2.704e+02 3.501e+02, threshold=4.979e+02, percent-clipped=0.0 2024-06-22 04:53:17,329 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=530016.6666666666, ans=0.0 2024-06-22 04:53:34,995 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=530053.3333333334, ans=0.125 2024-06-22 04:53:36,063 INFO [train.py:1028] (1/2) Epoch 29, batch 5850, loss[loss=0.1982, simple_loss=0.2492, pruned_loss=0.07364, over 12504.00 frames. ], tot_loss[loss=0.171, simple_loss=0.2267, pruned_loss=0.05761, over 2576534.75 frames. ], batch size: 202, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:53:42,975 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.12 vs. limit=10.0 2024-06-22 04:53:45,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=530090.0, ans=0.125 2024-06-22 04:53:47,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=530090.0, ans=0.125 2024-06-22 04:53:52,813 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=530108.3333333334, ans=0.125 2024-06-22 04:53:56,321 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.19 vs. limit=15.0 2024-06-22 04:54:01,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=530145.0, ans=0.125 2024-06-22 04:54:05,299 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2024-06-22 04:54:08,402 INFO [train.py:1028] (1/2) Epoch 29, batch 5900, loss[loss=0.1675, simple_loss=0.2256, pruned_loss=0.05471, over 13101.00 frames. ], tot_loss[loss=0.1726, simple_loss=0.2287, pruned_loss=0.05821, over 2576447.98 frames. ], batch size: 121, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:54:22,666 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.352e+02 2.530e+02 2.765e+02 4.168e+02, threshold=5.060e+02, percent-clipped=0.0 2024-06-22 04:54:32,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=530218.3333333334, ans=0.0 2024-06-22 04:54:41,500 INFO [train.py:1028] (1/2) Epoch 29, batch 5950, loss[loss=0.1588, simple_loss=0.2187, pruned_loss=0.04942, over 13120.00 frames. ], tot_loss[loss=0.1737, simple_loss=0.2302, pruned_loss=0.0586, over 2580656.15 frames. ], batch size: 121, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:55:01,924 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=530291.6666666666, ans=0.125 2024-06-22 04:55:09,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=530310.0, ans=0.125 2024-06-22 04:55:14,042 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=530328.3333333334, ans=0.125 2024-06-22 04:55:15,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=530328.3333333334, ans=0.125 2024-06-22 04:55:17,135 INFO [train.py:1028] (1/2) Epoch 29, batch 6000, loss[loss=0.2017, simple_loss=0.2569, pruned_loss=0.07322, over 12254.00 frames. ], tot_loss[loss=0.1744, simple_loss=0.2311, pruned_loss=0.05886, over 2573419.83 frames. ], batch size: 241, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:55:17,135 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 04:55:22,592 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.5946, 5.0784, 5.3540, 5.4348], device='cuda:1') 2024-06-22 04:55:26,071 INFO [train.py:1060] (1/2) Epoch 29, validation: loss=0.1938, simple_loss=0.2522, pruned_loss=0.06764, over 351949.00 frames. 2024-06-22 04:55:26,072 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 04:55:28,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=530346.6666666666, ans=0.125 2024-06-22 04:55:29,498 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=530346.6666666666, ans=0.025 2024-06-22 04:55:37,874 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.48 vs. limit=15.0 2024-06-22 04:55:39,911 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.348e+02 2.499e+02 2.704e+02 3.423e+02, threshold=4.998e+02, percent-clipped=0.0 2024-06-22 04:55:45,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=530401.6666666666, ans=0.2 2024-06-22 04:55:48,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530401.6666666666, ans=0.1 2024-06-22 04:55:53,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=530420.0, ans=0.09899494936611666 2024-06-22 04:55:55,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=530420.0, ans=0.0 2024-06-22 04:55:55,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=530420.0, ans=0.125 2024-06-22 04:55:59,139 INFO [train.py:1028] (1/2) Epoch 29, batch 6050, loss[loss=0.1801, simple_loss=0.2412, pruned_loss=0.05952, over 12914.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2327, pruned_loss=0.05921, over 2576887.18 frames. ], batch size: 39, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:55:59,241 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=530438.3333333334, ans=0.125 2024-06-22 04:56:01,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=530438.3333333334, ans=0.1 2024-06-22 04:56:05,372 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=530456.6666666666, ans=0.025 2024-06-22 04:56:07,007 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.56 vs. limit=15.0 2024-06-22 04:56:07,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=530456.6666666666, ans=0.025 2024-06-22 04:56:14,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530475.0, ans=0.1 2024-06-22 04:56:15,222 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2024-06-22 04:56:22,495 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=530493.3333333334, ans=0.125 2024-06-22 04:56:29,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=530511.6666666666, ans=0.125 2024-06-22 04:56:32,276 INFO [train.py:1028] (1/2) Epoch 29, batch 6100, loss[loss=0.1773, simple_loss=0.2309, pruned_loss=0.06189, over 13090.00 frames. ], tot_loss[loss=0.1765, simple_loss=0.2338, pruned_loss=0.05958, over 2579619.08 frames. ], batch size: 121, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:56:33,879 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=530530.0, ans=0.2 2024-06-22 04:56:34,487 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=530530.0, ans=0.125 2024-06-22 04:56:41,871 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.43 vs. limit=22.5 2024-06-22 04:56:46,561 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.404e+02 2.542e+02 2.787e+02 3.764e+02, threshold=5.084e+02, percent-clipped=0.0 2024-06-22 04:56:47,354 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:56:47,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=530566.6666666666, ans=0.025 2024-06-22 04:56:48,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=530566.6666666666, ans=0.125 2024-06-22 04:56:49,804 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=530566.6666666666, ans=0.07 2024-06-22 04:56:50,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=530566.6666666666, ans=0.025 2024-06-22 04:56:54,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=530585.0, ans=0.0 2024-06-22 04:57:02,706 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=530603.3333333334, ans=0.2 2024-06-22 04:57:04,416 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.44 vs. limit=15.0 2024-06-22 04:57:06,811 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=530603.3333333334, ans=0.125 2024-06-22 04:57:09,342 INFO [train.py:1028] (1/2) Epoch 29, batch 6150, loss[loss=0.1892, simple_loss=0.2358, pruned_loss=0.07132, over 10925.00 frames. ], tot_loss[loss=0.178, simple_loss=0.2356, pruned_loss=0.06022, over 2577446.05 frames. ], batch size: 304, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:57:20,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=530640.0, ans=0.05 2024-06-22 04:57:22,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=530640.0, ans=0.125 2024-06-22 04:57:27,940 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.23 vs. limit=22.5 2024-06-22 04:57:28,033 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.85 vs. limit=6.0 2024-06-22 04:57:45,343 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2024-06-22 04:57:45,668 INFO [train.py:1028] (1/2) Epoch 29, batch 6200, loss[loss=0.2064, simple_loss=0.2659, pruned_loss=0.0734, over 13253.00 frames. ], tot_loss[loss=0.1793, simple_loss=0.2369, pruned_loss=0.06089, over 2574166.66 frames. ], batch size: 89, lr: 1.98e-03, grad_scale: 128.0 2024-06-22 04:57:54,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=530731.6666666666, ans=0.125 2024-06-22 04:58:00,160 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.503e+02 2.658e+02 3.131e+02 4.462e+02, threshold=5.316e+02, percent-clipped=0.0 2024-06-22 04:58:00,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=530750.0, ans=0.0 2024-06-22 04:58:05,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=530768.3333333334, ans=0.125 2024-06-22 04:58:07,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=530768.3333333334, ans=0.2 2024-06-22 04:58:19,055 INFO [train.py:1028] (1/2) Epoch 29, batch 6250, loss[loss=0.1879, simple_loss=0.2389, pruned_loss=0.06842, over 13199.00 frames. ], tot_loss[loss=0.1798, simple_loss=0.2376, pruned_loss=0.06102, over 2568262.76 frames. ], batch size: 83, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:58:28,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=530823.3333333334, ans=0.1 2024-06-22 04:58:29,593 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.36 vs. limit=22.5 2024-06-22 04:58:43,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=530860.0, ans=0.125 2024-06-22 04:58:49,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=530878.3333333334, ans=0.125 2024-06-22 04:58:51,206 INFO [train.py:1028] (1/2) Epoch 29, batch 6300, loss[loss=0.1804, simple_loss=0.2491, pruned_loss=0.0559, over 10880.00 frames. ], tot_loss[loss=0.181, simple_loss=0.2391, pruned_loss=0.06144, over 2564193.13 frames. ], batch size: 16, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 04:58:53,425 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=530896.6666666666, ans=0.1 2024-06-22 04:59:00,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=530915.0, ans=0.0 2024-06-22 04:59:05,465 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.84 vs. limit=22.5 2024-06-22 04:59:09,813 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.394e+02 2.578e+02 2.772e+02 3.631e+02, threshold=5.156e+02, percent-clipped=0.0 2024-06-22 04:59:18,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=530951.6666666666, ans=0.0 2024-06-22 04:59:20,503 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.73 vs. limit=10.0 2024-06-22 04:59:24,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=530970.0, ans=0.125 2024-06-22 04:59:24,644 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.66 vs. limit=22.5 2024-06-22 04:59:31,600 INFO [train.py:1028] (1/2) Epoch 29, batch 6350, loss[loss=0.1861, simple_loss=0.2412, pruned_loss=0.0655, over 12542.00 frames. ], tot_loss[loss=0.1819, simple_loss=0.2404, pruned_loss=0.06173, over 2572793.09 frames. ], batch size: 202, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 04:59:43,081 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.54 vs. limit=22.5 2024-06-22 04:59:43,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=531006.6666666666, ans=0.0 2024-06-22 05:00:04,783 INFO [train.py:1028] (1/2) Epoch 29, batch 6400, loss[loss=0.1778, simple_loss=0.2381, pruned_loss=0.05878, over 13196.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2427, pruned_loss=0.06256, over 2573834.74 frames. ], batch size: 67, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:00:07,640 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=531080.0, ans=0.125 2024-06-22 05:00:11,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=531098.3333333334, ans=0.0 2024-06-22 05:00:13,270 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.53 vs. limit=22.5 2024-06-22 05:00:21,153 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.390e+02 2.537e+02 2.751e+02 3.479e+02, threshold=5.074e+02, percent-clipped=0.0 2024-06-22 05:00:28,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=531135.0, ans=0.125 2024-06-22 05:00:30,290 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.46 vs. limit=15.0 2024-06-22 05:00:38,816 INFO [train.py:1028] (1/2) Epoch 29, batch 6450, loss[loss=0.2038, simple_loss=0.2589, pruned_loss=0.07439, over 12602.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2439, pruned_loss=0.06306, over 2580631.05 frames. ], batch size: 202, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:00:40,539 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=531171.6666666666, ans=0.0 2024-06-22 05:00:43,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=531171.6666666666, ans=0.125 2024-06-22 05:00:49,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=531190.0, ans=0.125 2024-06-22 05:01:18,561 INFO [train.py:1028] (1/2) Epoch 29, batch 6500, loss[loss=0.2213, simple_loss=0.2616, pruned_loss=0.09049, over 10872.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2456, pruned_loss=0.0636, over 2583180.73 frames. ], batch size: 304, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:01:23,800 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.79 vs. limit=15.0 2024-06-22 05:01:24,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2024-06-22 05:01:27,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=531281.6666666666, ans=0.0 2024-06-22 05:01:37,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=531300.0, ans=0.0 2024-06-22 05:01:37,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=531300.0, ans=0.125 2024-06-22 05:01:39,789 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.392e+02 2.572e+02 2.825e+02 3.701e+02, threshold=5.145e+02, percent-clipped=0.0 2024-06-22 05:01:44,623 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=531318.3333333334, ans=0.125 2024-06-22 05:01:45,166 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:01:56,689 INFO [train.py:1028] (1/2) Epoch 29, batch 6550, loss[loss=0.1698, simple_loss=0.2367, pruned_loss=0.05143, over 12590.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.246, pruned_loss=0.06334, over 2586539.48 frames. ], batch size: 22, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:02:02,152 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.22 vs. limit=6.0 2024-06-22 05:02:04,654 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=531373.3333333334, ans=0.125 2024-06-22 05:02:10,380 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.63 vs. limit=10.0 2024-06-22 05:02:26,353 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=531428.3333333334, ans=0.2 2024-06-22 05:02:26,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=531428.3333333334, ans=0.1 2024-06-22 05:02:28,397 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.47 vs. limit=15.0 2024-06-22 05:02:29,877 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.47 vs. limit=15.0 2024-06-22 05:02:30,058 INFO [train.py:1028] (1/2) Epoch 29, batch 6600, loss[loss=0.1906, simple_loss=0.2587, pruned_loss=0.06127, over 13203.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2457, pruned_loss=0.06313, over 2589059.33 frames. ], batch size: 72, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:02:30,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=531446.6666666666, ans=0.0 2024-06-22 05:02:32,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=531446.6666666666, ans=0.1 2024-06-22 05:02:37,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=531465.0, ans=0.125 2024-06-22 05:02:39,617 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=531465.0, ans=0.0 2024-06-22 05:02:42,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=531465.0, ans=10.0 2024-06-22 05:02:43,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=531483.3333333334, ans=0.0 2024-06-22 05:02:43,976 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.92 vs. limit=15.0 2024-06-22 05:02:46,127 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.123e+02 2.400e+02 2.578e+02 2.805e+02 3.906e+02, threshold=5.156e+02, percent-clipped=0.0 2024-06-22 05:02:47,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=531483.3333333334, ans=0.125 2024-06-22 05:02:57,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=531520.0, ans=0.0 2024-06-22 05:03:03,930 INFO [train.py:1028] (1/2) Epoch 29, batch 6650, loss[loss=0.2031, simple_loss=0.2571, pruned_loss=0.07457, over 12974.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2474, pruned_loss=0.06364, over 2584439.55 frames. ], batch size: 158, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:03:06,080 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=531538.3333333334, ans=0.125 2024-06-22 05:03:12,746 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.51 vs. limit=15.0 2024-06-22 05:03:22,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=531575.0, ans=0.125 2024-06-22 05:03:29,676 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=531593.3333333334, ans=0.0 2024-06-22 05:03:31,020 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=531593.3333333334, ans=0.125 2024-06-22 05:03:33,443 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.61 vs. limit=15.0 2024-06-22 05:03:44,110 INFO [train.py:1028] (1/2) Epoch 29, batch 6700, loss[loss=0.2086, simple_loss=0.2714, pruned_loss=0.07296, over 12717.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2484, pruned_loss=0.06417, over 2584394.21 frames. ], batch size: 177, lr: 1.98e-03, grad_scale: 16.0 2024-06-22 05:03:55,908 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.77 vs. limit=15.0 2024-06-22 05:03:57,759 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=531666.6666666666, ans=0.2 2024-06-22 05:03:57,866 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=531666.6666666666, ans=0.0 2024-06-22 05:04:00,804 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.418e+02 2.540e+02 2.865e+02 4.425e+02, threshold=5.080e+02, percent-clipped=0.0 2024-06-22 05:04:17,793 INFO [train.py:1028] (1/2) Epoch 29, batch 6750, loss[loss=0.2492, simple_loss=0.2986, pruned_loss=0.09986, over 12178.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2492, pruned_loss=0.06447, over 2579863.88 frames. ], batch size: 240, lr: 1.98e-03, grad_scale: 16.0 2024-06-22 05:04:29,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=531740.0, ans=0.125 2024-06-22 05:04:35,251 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.25 vs. limit=15.0 2024-06-22 05:04:41,526 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=531776.6666666666, ans=0.125 2024-06-22 05:04:41,873 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.93 vs. limit=22.5 2024-06-22 05:04:44,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=531795.0, ans=0.1 2024-06-22 05:04:50,628 INFO [train.py:1028] (1/2) Epoch 29, batch 6800, loss[loss=0.1856, simple_loss=0.2515, pruned_loss=0.05991, over 13219.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2503, pruned_loss=0.06458, over 2582148.99 frames. ], batch size: 67, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:04:50,793 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=531813.3333333334, ans=0.0 2024-06-22 05:04:51,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=531813.3333333334, ans=0.125 2024-06-22 05:04:55,129 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=531813.3333333334, ans=0.0 2024-06-22 05:04:57,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=531831.6666666666, ans=0.125 2024-06-22 05:05:04,599 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=531850.0, ans=0.1 2024-06-22 05:05:05,881 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=531850.0, ans=0.1 2024-06-22 05:05:06,290 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.444e+02 2.607e+02 2.819e+02 4.067e+02, threshold=5.214e+02, percent-clipped=0.0 2024-06-22 05:05:10,586 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=531868.3333333334, ans=0.09899494936611666 2024-06-22 05:05:26,715 INFO [train.py:1028] (1/2) Epoch 29, batch 6850, loss[loss=0.1915, simple_loss=0.2601, pruned_loss=0.06145, over 13275.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.251, pruned_loss=0.06445, over 2586797.13 frames. ], batch size: 63, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:05:29,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=531905.0, ans=0.1 2024-06-22 05:05:30,029 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=531905.0, ans=0.1 2024-06-22 05:05:42,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=531941.6666666666, ans=0.1 2024-06-22 05:05:43,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=531941.6666666666, ans=0.1 2024-06-22 05:05:44,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=531941.6666666666, ans=15.0 2024-06-22 05:05:51,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=531960.0, ans=0.0 2024-06-22 05:05:54,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=531960.0, ans=0.95 2024-06-22 05:05:58,532 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=531978.3333333334, ans=0.0 2024-06-22 05:06:00,670 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.19 vs. limit=15.0 2024-06-22 05:06:01,869 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=12.0 2024-06-22 05:06:02,835 INFO [train.py:1028] (1/2) Epoch 29, batch 6900, loss[loss=0.209, simple_loss=0.2681, pruned_loss=0.07492, over 13326.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2518, pruned_loss=0.06485, over 2587778.65 frames. ], batch size: 49, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:06:13,615 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.27 vs. limit=15.0 2024-06-22 05:06:13,989 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=532015.0, ans=0.125 2024-06-22 05:06:16,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=532033.3333333334, ans=0.07 2024-06-22 05:06:19,460 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.453e+02 2.631e+02 2.859e+02 3.828e+02, threshold=5.261e+02, percent-clipped=0.0 2024-06-22 05:06:19,671 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=532033.3333333334, ans=0.07 2024-06-22 05:06:26,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=532051.6666666666, ans=0.1 2024-06-22 05:06:36,538 INFO [train.py:1028] (1/2) Epoch 29, batch 6950, loss[loss=0.1778, simple_loss=0.2407, pruned_loss=0.05745, over 12372.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2522, pruned_loss=0.06466, over 2582743.42 frames. ], batch size: 18, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:06:41,511 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.82 vs. limit=12.0 2024-06-22 05:06:55,211 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.815e+00 2024-06-22 05:06:59,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=532143.3333333334, ans=0.0 2024-06-22 05:07:01,805 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=532143.3333333334, ans=0.125 2024-06-22 05:07:03,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=532161.6666666666, ans=0.125 2024-06-22 05:07:07,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=532161.6666666666, ans=0.0 2024-06-22 05:07:08,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=532161.6666666666, ans=0.0 2024-06-22 05:07:09,753 INFO [train.py:1028] (1/2) Epoch 29, batch 7000, loss[loss=0.2228, simple_loss=0.2789, pruned_loss=0.08338, over 12956.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2524, pruned_loss=0.06461, over 2576982.78 frames. ], batch size: 158, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:07:30,770 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=532216.6666666666, ans=10.0 2024-06-22 05:07:31,230 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.868e+02 2.467e+02 2.672e+02 2.949e+02 3.719e+02, threshold=5.343e+02, percent-clipped=0.0 2024-06-22 05:07:51,371 INFO [train.py:1028] (1/2) Epoch 29, batch 7050, loss[loss=0.2127, simple_loss=0.2699, pruned_loss=0.07774, over 12761.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2535, pruned_loss=0.06491, over 2584266.75 frames. ], batch size: 176, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:07:58,140 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.95 vs. limit=22.5 2024-06-22 05:08:05,361 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=532308.3333333334, ans=0.1 2024-06-22 05:08:17,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=532345.0, ans=0.0 2024-06-22 05:08:23,849 INFO [train.py:1028] (1/2) Epoch 29, batch 7100, loss[loss=0.2065, simple_loss=0.2729, pruned_loss=0.07001, over 13207.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2539, pruned_loss=0.06511, over 2575693.58 frames. ], batch size: 112, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:08:27,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=532363.3333333334, ans=0.0 2024-06-22 05:08:38,550 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.05 vs. limit=15.0 2024-06-22 05:08:39,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=532400.0, ans=0.125 2024-06-22 05:08:40,060 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 2.586e+02 2.799e+02 3.039e+02 3.830e+02, threshold=5.597e+02, percent-clipped=0.0 2024-06-22 05:08:40,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=532400.0, ans=0.2 2024-06-22 05:08:43,766 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.59 vs. limit=15.0 2024-06-22 05:08:52,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=532436.6666666666, ans=0.2 2024-06-22 05:08:56,877 INFO [train.py:1028] (1/2) Epoch 29, batch 7150, loss[loss=0.2241, simple_loss=0.283, pruned_loss=0.08267, over 12491.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2548, pruned_loss=0.06517, over 2573804.82 frames. ], batch size: 202, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:08:57,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=532455.0, ans=0.0 2024-06-22 05:09:01,874 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.28 vs. limit=10.0 2024-06-22 05:09:04,100 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=532473.3333333334, ans=0.0 2024-06-22 05:09:07,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=532473.3333333334, ans=0.125 2024-06-22 05:09:09,375 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=532491.6666666666, ans=0.0 2024-06-22 05:09:20,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=532510.0, ans=0.125 2024-06-22 05:09:25,014 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=532528.3333333334, ans=0.1 2024-06-22 05:09:25,893 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.20 vs. limit=12.0 2024-06-22 05:09:27,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=532528.3333333334, ans=0.2 2024-06-22 05:09:29,817 INFO [train.py:1028] (1/2) Epoch 29, batch 7200, loss[loss=0.1947, simple_loss=0.2605, pruned_loss=0.06445, over 13176.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2554, pruned_loss=0.06538, over 2578827.27 frames. ], batch size: 112, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:09:33,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=532546.6666666666, ans=0.0 2024-06-22 05:09:45,763 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=532583.3333333334, ans=0.125 2024-06-22 05:09:48,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=532583.3333333334, ans=15.0 2024-06-22 05:09:49,540 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.498e+02 2.673e+02 3.022e+02 4.304e+02, threshold=5.345e+02, percent-clipped=0.0 2024-06-22 05:09:55,647 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=532601.6666666666, ans=0.0 2024-06-22 05:09:55,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=532601.6666666666, ans=0.125 2024-06-22 05:10:01,287 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=532620.0, ans=0.0 2024-06-22 05:10:06,618 INFO [train.py:1028] (1/2) Epoch 29, batch 7250, loss[loss=0.1778, simple_loss=0.2449, pruned_loss=0.05537, over 12993.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.256, pruned_loss=0.0656, over 2579495.74 frames. ], batch size: 36, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:10:10,945 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=532638.3333333334, ans=0.1 2024-06-22 05:10:16,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=532638.3333333334, ans=0.025 2024-06-22 05:10:35,746 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.25 vs. limit=22.5 2024-06-22 05:10:43,560 INFO [train.py:1028] (1/2) Epoch 29, batch 7300, loss[loss=0.1817, simple_loss=0.2455, pruned_loss=0.05897, over 12908.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.257, pruned_loss=0.06627, over 2580252.80 frames. ], batch size: 36, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:10:47,965 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.76 vs. limit=22.5 2024-06-22 05:10:53,662 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=532748.3333333334, ans=0.025 2024-06-22 05:10:54,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=532748.3333333334, ans=0.1 2024-06-22 05:10:59,887 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.055e+02 2.426e+02 2.576e+02 2.812e+02 3.503e+02, threshold=5.152e+02, percent-clipped=0.0 2024-06-22 05:11:08,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=532785.0, ans=0.0 2024-06-22 05:11:14,455 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=532803.3333333334, ans=0.0 2024-06-22 05:11:16,241 INFO [train.py:1028] (1/2) Epoch 29, batch 7350, loss[loss=0.2138, simple_loss=0.2742, pruned_loss=0.07664, over 13265.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2576, pruned_loss=0.06657, over 2582770.51 frames. ], batch size: 46, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:11:21,729 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.41 vs. limit=15.0 2024-06-22 05:11:26,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=532840.0, ans=0.2 2024-06-22 05:11:28,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=532840.0, ans=0.025 2024-06-22 05:11:30,246 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=532858.3333333334, ans=0.125 2024-06-22 05:11:45,565 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=532876.6666666666, ans=0.1 2024-06-22 05:11:48,239 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.79 vs. limit=22.5 2024-06-22 05:11:53,626 INFO [train.py:1028] (1/2) Epoch 29, batch 7400, loss[loss=0.2054, simple_loss=0.2752, pruned_loss=0.06775, over 13274.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2579, pruned_loss=0.06655, over 2587773.68 frames. ], batch size: 63, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:11:54,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=532913.3333333334, ans=0.0 2024-06-22 05:11:58,553 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=532913.3333333334, ans=0.125 2024-06-22 05:11:59,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=532931.6666666666, ans=0.1 2024-06-22 05:12:00,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=532931.6666666666, ans=15.0 2024-06-22 05:12:02,900 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:12:02,918 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=532931.6666666666, ans=0.0 2024-06-22 05:12:08,340 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=532950.0, ans=0.1 2024-06-22 05:12:10,499 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 2.485e+02 2.688e+02 2.937e+02 3.615e+02, threshold=5.377e+02, percent-clipped=0.0 2024-06-22 05:12:22,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=532968.3333333334, ans=0.125 2024-06-22 05:12:22,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=532968.3333333334, ans=0.0 2024-06-22 05:12:30,399 INFO [train.py:1028] (1/2) Epoch 29, batch 7450, loss[loss=0.2071, simple_loss=0.2788, pruned_loss=0.06772, over 12535.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.258, pruned_loss=0.06639, over 2581916.24 frames. ], batch size: 29, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:12:40,123 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.39 vs. limit=15.0 2024-06-22 05:12:43,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=533041.6666666666, ans=0.125 2024-06-22 05:12:51,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=533060.0, ans=0.0 2024-06-22 05:13:01,437 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=533078.3333333334, ans=0.5 2024-06-22 05:13:03,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=533096.6666666666, ans=0.125 2024-06-22 05:13:03,938 INFO [train.py:1028] (1/2) Epoch 29, batch 7500, loss[loss=0.205, simple_loss=0.2609, pruned_loss=0.07457, over 10386.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.259, pruned_loss=0.06704, over 2578248.30 frames. ], batch size: 304, lr: 1.98e-03, grad_scale: 16.0 2024-06-22 05:13:04,855 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=533096.6666666666, ans=0.1 2024-06-22 05:13:04,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=533096.6666666666, ans=0.05 2024-06-22 05:13:07,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=533096.6666666666, ans=0.0 2024-06-22 05:13:15,762 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.18 vs. limit=15.0 2024-06-22 05:13:18,065 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=533133.3333333334, ans=0.125 2024-06-22 05:13:21,195 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 2.463e+02 2.636e+02 2.922e+02 4.248e+02, threshold=5.271e+02, percent-clipped=0.0 2024-06-22 05:13:21,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=533133.3333333334, ans=0.125 2024-06-22 05:13:25,179 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=533151.6666666666, ans=0.125 2024-06-22 05:13:27,637 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=533151.6666666666, ans=0.0 2024-06-22 05:13:31,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=533170.0, ans=0.125 2024-06-22 05:13:36,344 INFO [train.py:1028] (1/2) Epoch 29, batch 7550, loss[loss=0.2141, simple_loss=0.2685, pruned_loss=0.07989, over 12999.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2592, pruned_loss=0.06756, over 2577153.61 frames. ], batch size: 158, lr: 1.98e-03, grad_scale: 16.0 2024-06-22 05:13:36,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=533188.3333333334, ans=0.95 2024-06-22 05:13:59,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=8.11 vs. limit=12.0 2024-06-22 05:14:01,620 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=533243.3333333334, ans=0.0 2024-06-22 05:14:08,737 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2024-06-22 05:14:15,402 INFO [train.py:1028] (1/2) Epoch 29, batch 7600, loss[loss=0.1859, simple_loss=0.2508, pruned_loss=0.0605, over 13219.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2595, pruned_loss=0.06746, over 2576887.70 frames. ], batch size: 83, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:14:24,377 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.47 vs. limit=22.5 2024-06-22 05:14:25,442 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=533298.3333333334, ans=0.125 2024-06-22 05:14:30,478 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.70 vs. limit=22.5 2024-06-22 05:14:30,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=533316.6666666666, ans=0.125 2024-06-22 05:14:32,414 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 2.439e+02 2.561e+02 2.764e+02 4.519e+02, threshold=5.123e+02, percent-clipped=0.0 2024-06-22 05:14:41,720 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.49 vs. limit=15.0 2024-06-22 05:14:47,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=533353.3333333334, ans=0.0 2024-06-22 05:14:49,153 INFO [train.py:1028] (1/2) Epoch 29, batch 7650, loss[loss=0.1773, simple_loss=0.2502, pruned_loss=0.05224, over 12859.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.26, pruned_loss=0.06772, over 2573138.18 frames. ], batch size: 33, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:14:51,059 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.15 vs. limit=15.0 2024-06-22 05:14:58,717 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.07 vs. limit=15.0 2024-06-22 05:14:59,369 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.50 vs. limit=15.0 2024-06-22 05:15:03,261 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:15:05,365 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:15:10,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=533426.6666666666, ans=0.125 2024-06-22 05:15:14,170 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=533426.6666666666, ans=0.0 2024-06-22 05:15:14,390 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.31 vs. limit=22.5 2024-06-22 05:15:14,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=533426.6666666666, ans=0.125 2024-06-22 05:15:17,272 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=533445.0, ans=0.1 2024-06-22 05:15:22,195 INFO [train.py:1028] (1/2) Epoch 29, batch 7700, loss[loss=0.197, simple_loss=0.2721, pruned_loss=0.06095, over 13268.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2606, pruned_loss=0.06787, over 2568959.25 frames. ], batch size: 63, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:15:22,587 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.03 vs. limit=22.5 2024-06-22 05:15:23,221 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.89 vs. limit=10.0 2024-06-22 05:15:27,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=533463.3333333334, ans=0.0 2024-06-22 05:15:33,541 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=533481.6666666666, ans=0.125 2024-06-22 05:15:34,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=533500.0, ans=0.0 2024-06-22 05:15:38,496 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.542e+02 2.663e+02 2.887e+02 4.323e+02, threshold=5.325e+02, percent-clipped=0.0 2024-06-22 05:15:46,423 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=533518.3333333334, ans=0.125 2024-06-22 05:15:57,710 INFO [train.py:1028] (1/2) Epoch 29, batch 7750, loss[loss=0.2012, simple_loss=0.2686, pruned_loss=0.06697, over 13215.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2614, pruned_loss=0.06815, over 2574103.91 frames. ], batch size: 72, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:16:10,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=533573.3333333334, ans=0.0 2024-06-22 05:16:14,196 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=533591.6666666666, ans=0.125 2024-06-22 05:16:14,564 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2024-06-22 05:16:20,037 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.71 vs. limit=15.0 2024-06-22 05:16:20,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=533610.0, ans=0.125 2024-06-22 05:16:33,866 INFO [train.py:1028] (1/2) Epoch 29, batch 7800, loss[loss=0.1987, simple_loss=0.2681, pruned_loss=0.06467, over 13163.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2617, pruned_loss=0.06786, over 2578672.08 frames. ], batch size: 95, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:16:39,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=533665.0, ans=0.0 2024-06-22 05:16:43,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=533665.0, ans=0.125 2024-06-22 05:16:51,298 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 2.465e+02 2.627e+02 2.820e+02 3.564e+02, threshold=5.255e+02, percent-clipped=0.0 2024-06-22 05:16:56,968 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2024-06-22 05:16:58,485 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.16 vs. limit=15.0 2024-06-22 05:17:07,424 INFO [train.py:1028] (1/2) Epoch 29, batch 7850, loss[loss=0.2001, simple_loss=0.2701, pruned_loss=0.06504, over 11638.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2621, pruned_loss=0.06814, over 2571717.57 frames. ], batch size: 16, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:17:07,511 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:17:07,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=533738.3333333334, ans=0.2 2024-06-22 05:17:09,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=533738.3333333334, ans=0.0 2024-06-22 05:17:10,192 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=15.0 2024-06-22 05:17:11,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=533738.3333333334, ans=0.125 2024-06-22 05:17:18,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=533756.6666666666, ans=0.125 2024-06-22 05:17:21,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=533775.0, ans=0.2 2024-06-22 05:17:29,573 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:17:30,368 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=533793.3333333334, ans=0.0 2024-06-22 05:17:44,302 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.46 vs. limit=22.5 2024-06-22 05:17:44,583 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=533830.0, ans=0.125 2024-06-22 05:17:45,040 INFO [train.py:1028] (1/2) Epoch 29, batch 7900, loss[loss=0.2003, simple_loss=0.2733, pruned_loss=0.06366, over 13128.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2624, pruned_loss=0.06829, over 2571132.76 frames. ], batch size: 77, lr: 1.98e-03, grad_scale: 16.0 2024-06-22 05:17:46,001 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=533830.0, ans=0.025 2024-06-22 05:18:01,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=533866.6666666666, ans=0.125 2024-06-22 05:18:06,383 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 2.595e+02 2.814e+02 3.073e+02 4.192e+02, threshold=5.629e+02, percent-clipped=0.0 2024-06-22 05:18:11,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=533885.0, ans=0.07 2024-06-22 05:18:20,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=533903.3333333334, ans=0.07 2024-06-22 05:18:21,818 INFO [train.py:1028] (1/2) Epoch 29, batch 7950, loss[loss=0.1997, simple_loss=0.2481, pruned_loss=0.0756, over 10434.00 frames. ], tot_loss[loss=0.2, simple_loss=0.263, pruned_loss=0.06846, over 2574412.53 frames. ], batch size: 303, lr: 1.98e-03, grad_scale: 16.0 2024-06-22 05:18:25,609 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.22 vs. limit=15.0 2024-06-22 05:18:27,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=533940.0, ans=0.125 2024-06-22 05:18:41,291 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=533976.6666666666, ans=0.125 2024-06-22 05:18:41,870 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=533976.6666666666, ans=0.1 2024-06-22 05:18:42,364 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=533976.6666666666, ans=0.025 2024-06-22 05:18:48,946 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=533995.0, ans=0.0 2024-06-22 05:18:54,868 INFO [train.py:1028] (1/2) Epoch 29, batch 8000, loss[loss=0.1788, simple_loss=0.2547, pruned_loss=0.05141, over 12750.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2632, pruned_loss=0.06861, over 2572642.39 frames. ], batch size: 29, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:18:54,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=534013.3333333334, ans=0.125 2024-06-22 05:18:58,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=534013.3333333334, ans=0.0 2024-06-22 05:18:58,322 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=534013.3333333334, ans=0.125 2024-06-22 05:18:59,149 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.82 vs. limit=12.0 2024-06-22 05:19:01,321 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=534031.6666666666, ans=0.125 2024-06-22 05:19:06,957 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=534050.0, ans=0.0 2024-06-22 05:19:12,252 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.191e+02 2.508e+02 2.719e+02 2.945e+02 3.763e+02, threshold=5.438e+02, percent-clipped=0.0 2024-06-22 05:19:19,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=534068.3333333334, ans=0.125 2024-06-22 05:19:23,406 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:19:23,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=534086.6666666666, ans=0.125 2024-06-22 05:19:23,723 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.62 vs. limit=15.0 2024-06-22 05:19:28,006 INFO [train.py:1028] (1/2) Epoch 29, batch 8050, loss[loss=0.188, simple_loss=0.2531, pruned_loss=0.06143, over 13169.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2637, pruned_loss=0.06851, over 2572245.10 frames. ], batch size: 83, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:19:36,755 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=534105.0, ans=0.0 2024-06-22 05:19:47,705 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=534141.6666666666, ans=0.125 2024-06-22 05:19:53,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=534141.6666666666, ans=0.125 2024-06-22 05:19:54,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=534160.0, ans=0.0 2024-06-22 05:20:07,187 INFO [train.py:1028] (1/2) Epoch 29, batch 8100, loss[loss=0.2001, simple_loss=0.2579, pruned_loss=0.07116, over 13146.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2643, pruned_loss=0.06876, over 2576801.57 frames. ], batch size: 112, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:20:08,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=534196.6666666666, ans=0.05 2024-06-22 05:20:25,134 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 2.465e+02 2.659e+02 2.840e+02 4.254e+02, threshold=5.318e+02, percent-clipped=0.0 2024-06-22 05:20:25,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=534233.3333333334, ans=0.0 2024-06-22 05:20:28,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=534251.6666666666, ans=0.2 2024-06-22 05:20:31,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=534251.6666666666, ans=0.0 2024-06-22 05:20:40,746 INFO [train.py:1028] (1/2) Epoch 29, batch 8150, loss[loss=0.1927, simple_loss=0.2536, pruned_loss=0.0659, over 13060.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2643, pruned_loss=0.06856, over 2579981.17 frames. ], batch size: 121, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:20:42,217 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:20:52,087 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=534306.6666666666, ans=0.125 2024-06-22 05:21:12,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=534361.6666666666, ans=0.2 2024-06-22 05:21:14,079 INFO [train.py:1028] (1/2) Epoch 29, batch 8200, loss[loss=0.2118, simple_loss=0.2709, pruned_loss=0.07632, over 13172.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2646, pruned_loss=0.06886, over 2583160.66 frames. ], batch size: 112, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:21:18,361 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.30 vs. limit=10.0 2024-06-22 05:21:22,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=534398.3333333334, ans=0.125 2024-06-22 05:21:22,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=534398.3333333334, ans=0.125 2024-06-22 05:21:31,985 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.499e+02 2.686e+02 2.955e+02 4.578e+02, threshold=5.372e+02, percent-clipped=0.0 2024-06-22 05:21:35,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=534435.0, ans=0.0 2024-06-22 05:21:37,325 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=534435.0, ans=0.0 2024-06-22 05:21:44,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=534453.3333333334, ans=0.0 2024-06-22 05:21:47,827 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=534453.3333333334, ans=0.125 2024-06-22 05:21:48,230 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.30 vs. limit=15.0 2024-06-22 05:21:48,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=534453.3333333334, ans=0.2 2024-06-22 05:21:50,478 INFO [train.py:1028] (1/2) Epoch 29, batch 8250, loss[loss=0.1919, simple_loss=0.2645, pruned_loss=0.05967, over 13260.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2642, pruned_loss=0.06876, over 2582761.32 frames. ], batch size: 52, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:21:55,949 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=534471.6666666666, ans=0.125 2024-06-22 05:21:57,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=534471.6666666666, ans=0.2 2024-06-22 05:21:58,038 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2024-06-22 05:22:03,552 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.37 vs. limit=15.0 2024-06-22 05:22:16,127 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.88 vs. limit=22.5 2024-06-22 05:22:19,821 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=534545.0, ans=0.1 2024-06-22 05:22:25,704 INFO [train.py:1028] (1/2) Epoch 29, batch 8300, loss[loss=0.2004, simple_loss=0.2648, pruned_loss=0.06797, over 13021.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.264, pruned_loss=0.06856, over 2580242.57 frames. ], batch size: 102, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:22:31,685 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=534581.6666666666, ans=0.125 2024-06-22 05:22:33,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=534581.6666666666, ans=0.0 2024-06-22 05:22:43,449 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.500e+02 2.598e+02 2.782e+02 3.928e+02, threshold=5.197e+02, percent-clipped=0.0 2024-06-22 05:22:43,578 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=534600.0, ans=0.0 2024-06-22 05:22:48,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=534618.3333333334, ans=0.0 2024-06-22 05:22:57,786 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.27 vs. limit=10.0 2024-06-22 05:22:58,680 INFO [train.py:1028] (1/2) Epoch 29, batch 8350, loss[loss=0.1972, simple_loss=0.2657, pruned_loss=0.06432, over 13132.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.264, pruned_loss=0.06831, over 2579838.45 frames. ], batch size: 112, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:22:59,138 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.66 vs. limit=15.0 2024-06-22 05:23:00,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=534655.0, ans=0.0 2024-06-22 05:23:08,993 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.57 vs. limit=22.5 2024-06-22 05:23:14,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=534691.6666666666, ans=0.07 2024-06-22 05:23:27,720 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=534728.3333333334, ans=0.0 2024-06-22 05:23:31,708 INFO [train.py:1028] (1/2) Epoch 29, batch 8400, loss[loss=0.1821, simple_loss=0.2465, pruned_loss=0.05887, over 12866.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2639, pruned_loss=0.06862, over 2576739.25 frames. ], batch size: 39, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:23:42,560 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.74 vs. limit=22.5 2024-06-22 05:23:42,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=534765.0, ans=0.0 2024-06-22 05:23:42,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=534765.0, ans=0.1 2024-06-22 05:23:47,351 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.68 vs. limit=15.0 2024-06-22 05:23:49,250 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.43 vs. limit=15.0 2024-06-22 05:23:52,674 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.527e+02 2.785e+02 2.997e+02 3.894e+02, threshold=5.571e+02, percent-clipped=0.0 2024-06-22 05:24:01,837 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.34 vs. limit=6.0 2024-06-22 05:24:09,595 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.27 vs. limit=22.5 2024-06-22 05:24:11,158 INFO [train.py:1028] (1/2) Epoch 29, batch 8450, loss[loss=0.186, simple_loss=0.2523, pruned_loss=0.05986, over 13178.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2645, pruned_loss=0.06886, over 2579369.59 frames. ], batch size: 112, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:24:20,488 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=534856.6666666666, ans=0.0 2024-06-22 05:24:21,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=534856.6666666666, ans=0.125 2024-06-22 05:24:26,627 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=534875.0, ans=0.1 2024-06-22 05:24:26,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=534875.0, ans=0.0 2024-06-22 05:24:36,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=534911.6666666666, ans=0.2 2024-06-22 05:24:43,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=534911.6666666666, ans=0.2 2024-06-22 05:24:43,389 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.70 vs. limit=15.0 2024-06-22 05:24:44,296 INFO [train.py:1028] (1/2) Epoch 29, batch 8500, loss[loss=0.2149, simple_loss=0.2915, pruned_loss=0.06916, over 12739.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2653, pruned_loss=0.06893, over 2577365.67 frames. ], batch size: 29, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:25:02,827 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.520e+02 2.699e+02 2.936e+02 4.127e+02, threshold=5.399e+02, percent-clipped=0.0 2024-06-22 05:25:06,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=534985.0, ans=0.0 2024-06-22 05:25:12,115 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=535003.3333333334, ans=0.1 2024-06-22 05:25:16,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=535003.3333333334, ans=0.125 2024-06-22 05:25:17,671 INFO [train.py:1028] (1/2) Epoch 29, batch 8550, loss[loss=0.1979, simple_loss=0.2673, pruned_loss=0.06427, over 12635.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2654, pruned_loss=0.06874, over 2575976.82 frames. ], batch size: 22, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:25:18,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=535021.6666666666, ans=0.0 2024-06-22 05:25:19,698 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=535021.6666666666, ans=0.125 2024-06-22 05:25:21,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=535021.6666666666, ans=0.5 2024-06-22 05:25:28,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=535040.0, ans=0.0 2024-06-22 05:25:39,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535076.6666666666, ans=0.1 2024-06-22 05:25:57,223 INFO [train.py:1028] (1/2) Epoch 29, batch 8600, loss[loss=0.1975, simple_loss=0.2541, pruned_loss=0.07051, over 13110.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2665, pruned_loss=0.06929, over 2572614.14 frames. ], batch size: 121, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:26:10,235 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2024-06-22 05:26:11,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=535150.0, ans=0.0 2024-06-22 05:26:12,131 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=535150.0, ans=0.0 2024-06-22 05:26:13,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=535150.0, ans=0.1 2024-06-22 05:26:15,199 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.516e+02 2.683e+02 3.020e+02 3.927e+02, threshold=5.365e+02, percent-clipped=0.0 2024-06-22 05:26:28,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=535186.6666666666, ans=0.2 2024-06-22 05:26:30,625 INFO [train.py:1028] (1/2) Epoch 29, batch 8650, loss[loss=0.2063, simple_loss=0.2702, pruned_loss=0.07121, over 13011.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2672, pruned_loss=0.06919, over 2577052.38 frames. ], batch size: 102, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:26:33,544 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.37 vs. limit=22.5 2024-06-22 05:26:33,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=535205.0, ans=0.0 2024-06-22 05:26:40,253 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=535223.3333333334, ans=0.0 2024-06-22 05:26:43,110 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=535241.6666666666, ans=0.125 2024-06-22 05:26:58,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=535278.3333333334, ans=0.025 2024-06-22 05:27:03,402 INFO [train.py:1028] (1/2) Epoch 29, batch 8700, loss[loss=0.1836, simple_loss=0.2602, pruned_loss=0.05348, over 13178.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2675, pruned_loss=0.0699, over 2573587.94 frames. ], batch size: 59, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:27:04,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=535296.6666666666, ans=0.125 2024-06-22 05:27:12,136 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=535315.0, ans=0.0 2024-06-22 05:27:16,004 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535333.3333333334, ans=0.1 2024-06-22 05:27:24,704 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=535333.3333333334, ans=0.0 2024-06-22 05:27:24,857 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2024-06-22 05:27:26,361 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 2.566e+02 2.778e+02 3.014e+02 3.613e+02, threshold=5.555e+02, percent-clipped=0.0 2024-06-22 05:27:27,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=535333.3333333334, ans=0.125 2024-06-22 05:27:31,703 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=535351.6666666666, ans=0.1 2024-06-22 05:27:36,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=535370.0, ans=0.025 2024-06-22 05:27:44,358 INFO [train.py:1028] (1/2) Epoch 29, batch 8750, loss[loss=0.1979, simple_loss=0.2574, pruned_loss=0.06916, over 13126.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2671, pruned_loss=0.06981, over 2569057.97 frames. ], batch size: 121, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:27:47,721 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=535388.3333333334, ans=0.04949747468305833 2024-06-22 05:27:48,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=535388.3333333334, ans=0.0 2024-06-22 05:27:49,202 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=535388.3333333334, ans=0.125 2024-06-22 05:28:00,444 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=535425.0, ans=0.0 2024-06-22 05:28:20,874 INFO [train.py:1028] (1/2) Epoch 29, batch 8800, loss[loss=0.1797, simple_loss=0.2507, pruned_loss=0.05435, over 13208.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2669, pruned_loss=0.06965, over 2573931.12 frames. ], batch size: 72, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:28:35,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=535516.6666666666, ans=0.125 2024-06-22 05:28:39,623 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.463e+02 2.620e+02 2.930e+02 4.131e+02, threshold=5.239e+02, percent-clipped=0.0 2024-06-22 05:28:51,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=535553.3333333334, ans=0.2 2024-06-22 05:28:54,365 INFO [train.py:1028] (1/2) Epoch 29, batch 8850, loss[loss=0.2035, simple_loss=0.2627, pruned_loss=0.07208, over 12536.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2671, pruned_loss=0.07, over 2563079.16 frames. ], batch size: 202, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:28:56,730 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=535571.6666666666, ans=0.0 2024-06-22 05:29:05,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=535590.0, ans=0.125 2024-06-22 05:29:26,709 INFO [scaling.py:1023] (1/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.59 vs. limit=5.0 2024-06-22 05:29:27,586 INFO [train.py:1028] (1/2) Epoch 29, batch 8900, loss[loss=0.1863, simple_loss=0.2548, pruned_loss=0.05894, over 12908.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2672, pruned_loss=0.07005, over 2560994.70 frames. ], batch size: 33, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:29:49,745 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.537e+02 2.681e+02 2.865e+02 4.790e+02, threshold=5.362e+02, percent-clipped=0.0 2024-06-22 05:29:57,236 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=535718.3333333334, ans=0.0 2024-06-22 05:30:01,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=535736.6666666666, ans=0.0 2024-06-22 05:30:05,163 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=22.5 2024-06-22 05:30:08,153 INFO [train.py:1028] (1/2) Epoch 29, batch 8950, loss[loss=0.2176, simple_loss=0.2751, pruned_loss=0.08008, over 12533.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2675, pruned_loss=0.06988, over 2562026.05 frames. ], batch size: 202, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:30:10,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=535755.0, ans=0.125 2024-06-22 05:30:28,100 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:30:42,064 INFO [train.py:1028] (1/2) Epoch 29, batch 9000, loss[loss=0.2115, simple_loss=0.2727, pruned_loss=0.07517, over 13254.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2679, pruned_loss=0.0698, over 2568410.40 frames. ], batch size: 46, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:30:42,065 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 05:30:47,070 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.3415, 2.0239, 2.0099, 1.5089, 1.7563, 1.4007, 2.0948, 1.8677], device='cuda:1') 2024-06-22 05:30:50,157 INFO [train.py:1060] (1/2) Epoch 29, validation: loss=0.1947, simple_loss=0.2528, pruned_loss=0.06827, over 351949.00 frames. 2024-06-22 05:30:50,158 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 05:30:53,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=535846.6666666666, ans=0.1 2024-06-22 05:31:03,628 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=535883.3333333334, ans=0.125 2024-06-22 05:31:08,733 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.510e+02 2.660e+02 3.036e+02 4.689e+02, threshold=5.321e+02, percent-clipped=0.0 2024-06-22 05:31:09,734 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.43 vs. limit=15.0 2024-06-22 05:31:13,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=535901.6666666666, ans=0.125 2024-06-22 05:31:18,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=535920.0, ans=0.0 2024-06-22 05:31:19,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=535920.0, ans=0.0 2024-06-22 05:31:23,109 INFO [train.py:1028] (1/2) Epoch 29, batch 9050, loss[loss=0.1743, simple_loss=0.2346, pruned_loss=0.057, over 11201.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2682, pruned_loss=0.06978, over 2567829.49 frames. ], batch size: 16, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:31:24,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=535938.3333333334, ans=0.1 2024-06-22 05:31:27,276 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.67 vs. limit=15.0 2024-06-22 05:31:30,351 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=535956.6666666666, ans=0.0 2024-06-22 05:31:30,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=535956.6666666666, ans=0.1 2024-06-22 05:31:31,939 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=15.0 2024-06-22 05:31:37,886 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=535975.0, ans=0.125 2024-06-22 05:31:38,610 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=535975.0, ans=0.09899494936611666 2024-06-22 05:31:40,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=535975.0, ans=0.95 2024-06-22 05:31:44,551 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.35 vs. limit=12.0 2024-06-22 05:31:46,443 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=535993.3333333334, ans=0.125 2024-06-22 05:31:52,144 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.39 vs. limit=15.0 2024-06-22 05:31:55,558 INFO [train.py:1028] (1/2) Epoch 29, batch 9100, loss[loss=0.2078, simple_loss=0.2742, pruned_loss=0.07076, over 13295.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2678, pruned_loss=0.06947, over 2569378.12 frames. ], batch size: 72, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:31:58,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=536030.0, ans=0.0 2024-06-22 05:32:02,075 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=536048.3333333334, ans=0.2 2024-06-22 05:32:02,639 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=536048.3333333334, ans=0.125 2024-06-22 05:32:09,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=536066.6666666666, ans=0.0 2024-06-22 05:32:13,594 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.491e+02 2.701e+02 2.937e+02 3.719e+02, threshold=5.402e+02, percent-clipped=0.0 2024-06-22 05:32:17,248 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=536085.0, ans=0.125 2024-06-22 05:32:20,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=536085.0, ans=0.025 2024-06-22 05:32:27,481 INFO [train.py:1028] (1/2) Epoch 29, batch 9150, loss[loss=0.1947, simple_loss=0.269, pruned_loss=0.0602, over 13201.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2677, pruned_loss=0.06928, over 2569908.66 frames. ], batch size: 77, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:32:34,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=536140.0, ans=0.125 2024-06-22 05:32:35,587 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=536140.0, ans=0.125 2024-06-22 05:32:41,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=536158.3333333334, ans=0.0 2024-06-22 05:32:44,279 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=536158.3333333334, ans=0.0 2024-06-22 05:32:44,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=536158.3333333334, ans=0.0 2024-06-22 05:32:54,363 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.96 vs. limit=22.5 2024-06-22 05:33:05,975 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=536195.0, ans=0.0 2024-06-22 05:33:07,182 INFO [train.py:1028] (1/2) Epoch 29, batch 9200, loss[loss=0.2194, simple_loss=0.2881, pruned_loss=0.07534, over 12988.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2682, pruned_loss=0.06946, over 2573205.27 frames. ], batch size: 36, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:33:11,431 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.50 vs. limit=15.0 2024-06-22 05:33:14,513 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.773e+01 2024-06-22 05:33:17,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=536231.6666666666, ans=0.125 2024-06-22 05:33:20,982 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.76 vs. limit=22.5 2024-06-22 05:33:24,987 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.486e+02 2.628e+02 2.903e+02 3.926e+02, threshold=5.256e+02, percent-clipped=0.0 2024-06-22 05:33:25,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=536250.0, ans=0.0 2024-06-22 05:33:27,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.94 vs. limit=15.0 2024-06-22 05:33:34,840 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=536286.6666666666, ans=0.1 2024-06-22 05:33:36,225 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.84 vs. limit=22.5 2024-06-22 05:33:39,150 INFO [train.py:1028] (1/2) Epoch 29, batch 9250, loss[loss=0.1939, simple_loss=0.2683, pruned_loss=0.05977, over 13223.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2686, pruned_loss=0.06947, over 2575255.42 frames. ], batch size: 67, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:33:41,481 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2024-06-22 05:33:54,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=536341.6666666666, ans=0.2 2024-06-22 05:33:57,768 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.01 vs. limit=22.5 2024-06-22 05:34:02,738 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=536360.0, ans=0.125 2024-06-22 05:34:03,738 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2024-06-22 05:34:08,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=536378.3333333334, ans=0.0 2024-06-22 05:34:11,242 INFO [train.py:1028] (1/2) Epoch 29, batch 9300, loss[loss=0.1837, simple_loss=0.2508, pruned_loss=0.05835, over 12936.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2684, pruned_loss=0.06946, over 2571289.78 frames. ], batch size: 39, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:34:13,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=536396.6666666666, ans=0.125 2024-06-22 05:34:18,435 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=536415.0, ans=0.125 2024-06-22 05:34:18,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=536415.0, ans=0.0 2024-06-22 05:34:26,929 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=536433.3333333334, ans=0.125 2024-06-22 05:34:28,188 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=536433.3333333334, ans=0.09899494936611666 2024-06-22 05:34:29,357 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.509e+02 2.642e+02 2.875e+02 3.721e+02, threshold=5.284e+02, percent-clipped=0.0 2024-06-22 05:34:43,321 INFO [train.py:1028] (1/2) Epoch 29, batch 9350, loss[loss=0.1999, simple_loss=0.2701, pruned_loss=0.0649, over 12465.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2692, pruned_loss=0.0698, over 2567337.21 frames. ], batch size: 22, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:34:48,539 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.23 vs. limit=15.0 2024-06-22 05:35:06,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=536543.3333333334, ans=0.0 2024-06-22 05:35:14,282 INFO [train.py:1028] (1/2) Epoch 29, batch 9400, loss[loss=0.2003, simple_loss=0.2719, pruned_loss=0.06441, over 13243.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.269, pruned_loss=0.06983, over 2567251.29 frames. ], batch size: 52, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:35:15,171 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2024-06-22 05:35:16,344 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=536580.0, ans=0.025 2024-06-22 05:35:20,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=536598.3333333334, ans=0.0 2024-06-22 05:35:23,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=536598.3333333334, ans=0.125 2024-06-22 05:35:31,331 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.519e+02 2.637e+02 2.775e+02 3.823e+02, threshold=5.273e+02, percent-clipped=0.0 2024-06-22 05:35:38,792 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=536653.3333333334, ans=0.0 2024-06-22 05:35:45,072 INFO [train.py:1028] (1/2) Epoch 29, batch 9450, loss[loss=0.2114, simple_loss=0.2764, pruned_loss=0.07318, over 13060.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2696, pruned_loss=0.07015, over 2567825.33 frames. ], batch size: 23, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:35:48,618 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=536671.6666666666, ans=0.125 2024-06-22 05:35:49,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=536671.6666666666, ans=0.0 2024-06-22 05:35:49,984 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.59 vs. limit=10.0 2024-06-22 05:35:58,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.28 vs. limit=22.5 2024-06-22 05:36:06,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=536726.6666666666, ans=0.0 2024-06-22 05:36:16,201 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.68 vs. limit=15.0 2024-06-22 05:36:20,132 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=536745.0, ans=0.125 2024-06-22 05:36:21,362 INFO [train.py:1028] (1/2) Epoch 29, batch 9500, loss[loss=0.2017, simple_loss=0.2695, pruned_loss=0.06697, over 13257.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2688, pruned_loss=0.06933, over 2577688.55 frames. ], batch size: 43, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:36:32,686 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=536781.6666666666, ans=0.2 2024-06-22 05:36:34,875 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.88 vs. limit=15.0 2024-06-22 05:36:35,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=536800.0, ans=0.125 2024-06-22 05:36:35,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=536800.0, ans=0.125 2024-06-22 05:36:38,952 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 2.482e+02 2.594e+02 2.841e+02 3.787e+02, threshold=5.189e+02, percent-clipped=0.0 2024-06-22 05:36:46,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=536836.6666666666, ans=0.0 2024-06-22 05:36:47,548 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=536836.6666666666, ans=0.125 2024-06-22 05:36:48,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=536836.6666666666, ans=0.125 2024-06-22 05:36:52,841 INFO [train.py:1028] (1/2) Epoch 29, batch 9550, loss[loss=0.1918, simple_loss=0.269, pruned_loss=0.05729, over 12911.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2688, pruned_loss=0.0694, over 2573796.14 frames. ], batch size: 39, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:36:56,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=536855.0, ans=0.2 2024-06-22 05:36:58,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=536855.0, ans=0.125 2024-06-22 05:37:00,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=536873.3333333334, ans=0.1 2024-06-22 05:37:19,624 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2024-06-22 05:37:22,528 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:37:23,615 INFO [train.py:1028] (1/2) Epoch 29, batch 9600, loss[loss=0.2024, simple_loss=0.2578, pruned_loss=0.07349, over 10606.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2685, pruned_loss=0.06931, over 2572678.65 frames. ], batch size: 304, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:37:28,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=536946.6666666666, ans=0.125 2024-06-22 05:37:31,259 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.18 vs. limit=15.0 2024-06-22 05:37:40,678 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.632e+02 2.821e+02 3.152e+02 4.333e+02, threshold=5.642e+02, percent-clipped=0.0 2024-06-22 05:37:42,406 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.52 vs. limit=15.0 2024-06-22 05:37:45,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=537001.6666666666, ans=0.125 2024-06-22 05:37:49,585 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=537020.0, ans=0.125 2024-06-22 05:37:54,211 INFO [train.py:1028] (1/2) Epoch 29, batch 9650, loss[loss=0.2093, simple_loss=0.2645, pruned_loss=0.07709, over 13109.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2684, pruned_loss=0.07003, over 2562411.34 frames. ], batch size: 132, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:38:06,113 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2024-06-22 05:38:12,294 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2024-06-22 05:38:25,542 INFO [train.py:1028] (1/2) Epoch 29, batch 9700, loss[loss=0.2015, simple_loss=0.2654, pruned_loss=0.06874, over 13044.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2676, pruned_loss=0.06971, over 2554919.81 frames. ], batch size: 144, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:38:38,369 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=537148.3333333334, ans=0.1 2024-06-22 05:38:39,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=537148.3333333334, ans=0.0 2024-06-22 05:38:45,848 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.571e+02 2.807e+02 3.065e+02 4.294e+02, threshold=5.615e+02, percent-clipped=0.0 2024-06-22 05:39:00,063 INFO [train.py:1028] (1/2) Epoch 29, batch 9750, loss[loss=0.197, simple_loss=0.258, pruned_loss=0.068, over 13104.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2664, pruned_loss=0.06904, over 2550913.34 frames. ], batch size: 132, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:39:00,216 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=537221.6666666666, ans=0.125 2024-06-22 05:39:31,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=537313.3333333334, ans=0.2 2024-06-22 05:39:31,618 INFO [train.py:1028] (1/2) Epoch 29, batch 9800, loss[loss=0.1938, simple_loss=0.2605, pruned_loss=0.06353, over 12978.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2657, pruned_loss=0.0686, over 2544282.62 frames. ], batch size: 39, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:39:44,027 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=537350.0, ans=0.1 2024-06-22 05:39:44,758 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=537350.0, ans=0.0 2024-06-22 05:39:45,298 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=537350.0, ans=0.125 2024-06-22 05:39:47,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=537350.0, ans=0.0 2024-06-22 05:39:48,760 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.531e+02 2.680e+02 2.870e+02 3.704e+02, threshold=5.360e+02, percent-clipped=0.0 2024-06-22 05:39:52,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=537368.3333333334, ans=0.125 2024-06-22 05:40:00,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=537386.6666666666, ans=0.125 2024-06-22 05:40:00,752 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=537386.6666666666, ans=0.125 2024-06-22 05:40:02,504 INFO [train.py:1028] (1/2) Epoch 29, batch 9850, loss[loss=0.2101, simple_loss=0.2684, pruned_loss=0.0759, over 12989.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2654, pruned_loss=0.06832, over 2537931.56 frames. ], batch size: 102, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:40:04,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=537405.0, ans=0.0 2024-06-22 05:40:11,869 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=537423.3333333334, ans=0.125 2024-06-22 05:40:15,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=537423.3333333334, ans=0.1 2024-06-22 05:40:33,656 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=537478.3333333334, ans=0.0 2024-06-22 05:40:35,328 INFO [train.py:1028] (1/2) Epoch 29, batch 9900, loss[loss=0.1763, simple_loss=0.2484, pruned_loss=0.0521, over 12862.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2641, pruned_loss=0.068, over 2529543.35 frames. ], batch size: 39, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:40:37,569 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=537496.6666666666, ans=0.0 2024-06-22 05:40:51,473 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.29 vs. limit=15.0 2024-06-22 05:40:53,038 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.480e+02 2.608e+02 2.805e+02 4.117e+02, threshold=5.216e+02, percent-clipped=0.0 2024-06-22 05:40:56,833 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=537551.6666666666, ans=0.125 2024-06-22 05:41:05,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=537570.0, ans=0.125 2024-06-22 05:41:07,173 INFO [train.py:1028] (1/2) Epoch 29, batch 9950, loss[loss=0.2274, simple_loss=0.2898, pruned_loss=0.08253, over 12699.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2637, pruned_loss=0.06851, over 2523268.64 frames. ], batch size: 29, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:41:07,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=537588.3333333334, ans=0.2 2024-06-22 05:41:11,247 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=537588.3333333334, ans=0.1 2024-06-22 05:41:14,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=537606.6666666666, ans=0.125 2024-06-22 05:41:28,691 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=537643.3333333334, ans=0.125 2024-06-22 05:41:29,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=537643.3333333334, ans=0.0 2024-06-22 05:41:29,800 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=537643.3333333334, ans=0.125 2024-06-22 05:41:31,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=537643.3333333334, ans=0.0 2024-06-22 05:41:33,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=537661.6666666666, ans=0.125 2024-06-22 05:41:40,652 INFO [train.py:1028] (1/2) Epoch 29, batch 10000, loss[loss=0.2019, simple_loss=0.2657, pruned_loss=0.06909, over 12842.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2639, pruned_loss=0.06857, over 2485779.50 frames. ], batch size: 22, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:41:45,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=537680.0, ans=0.125 2024-06-22 05:41:45,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=537680.0, ans=0.0 2024-06-22 05:41:54,384 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=537716.6666666666, ans=0.0 2024-06-22 05:41:58,743 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.485e+02 2.655e+02 2.968e+02 3.749e+02, threshold=5.310e+02, percent-clipped=0.0 2024-06-22 05:42:03,332 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=537735.0, ans=0.125 2024-06-22 05:42:05,044 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=537735.0, ans=0.125 2024-06-22 05:42:10,743 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=537753.3333333334, ans=0.125 2024-06-22 05:42:11,794 INFO [train.py:1028] (1/2) Epoch 29, batch 10050, loss[loss=0.2037, simple_loss=0.2652, pruned_loss=0.07108, over 12567.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2638, pruned_loss=0.06914, over 2442225.94 frames. ], batch size: 22, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:42:14,770 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.17 vs. limit=10.0 2024-06-22 05:42:36,132 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=15.0 2024-06-22 05:42:42,293 INFO [train.py:1028] (1/2) Epoch 29, batch 10100, loss[loss=0.1559, simple_loss=0.2241, pruned_loss=0.04384, over 10842.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2629, pruned_loss=0.06825, over 2425752.96 frames. ], batch size: 16, lr: 1.97e-03, grad_scale: 16.0 2024-06-22 05:42:45,134 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=537863.3333333334, ans=0.2 2024-06-22 05:42:49,725 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=27.02 vs. limit=22.5 2024-06-22 05:44:49,205 INFO [train.py:1028] (1/2) Epoch 30, batch 0, loss[loss=0.1632, simple_loss=0.238, pruned_loss=0.04418, over 13011.00 frames. ], tot_loss[loss=0.1632, simple_loss=0.238, pruned_loss=0.04418, over 13011.00 frames. ], batch size: 36, lr: 1.94e-03, grad_scale: 32.0 2024-06-22 05:44:49,206 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 05:44:53,190 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.4.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([3.4974, 3.1036, 1.9854, 3.2078], device='cuda:1') 2024-06-22 05:44:56,330 INFO [train.py:1060] (1/2) Epoch 30, validation: loss=0.1949, simple_loss=0.2533, pruned_loss=0.06824, over 351949.00 frames. 2024-06-22 05:44:56,331 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 05:45:03,902 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.76 vs. limit=15.0 2024-06-22 05:45:05,330 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.422e+02 2.550e+02 2.747e+02 3.297e+02, threshold=5.101e+02, percent-clipped=0.0 2024-06-22 05:45:13,972 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.81 vs. limit=15.0 2024-06-22 05:45:15,839 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.30 vs. limit=22.5 2024-06-22 05:45:16,588 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=22.5 2024-06-22 05:45:18,873 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=537947.6666666666, ans=0.05 2024-06-22 05:45:28,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=537966.0, ans=0.125 2024-06-22 05:45:32,869 INFO [train.py:1028] (1/2) Epoch 30, batch 50, loss[loss=0.173, simple_loss=0.2367, pruned_loss=0.05465, over 12718.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.248, pruned_loss=0.06235, over 575049.40 frames. ], batch size: 29, lr: 1.94e-03, grad_scale: 16.0 2024-06-22 05:45:34,262 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:45:34,536 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.10 vs. limit=15.0 2024-06-22 05:45:40,403 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.44 vs. limit=10.0 2024-06-22 05:45:46,335 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=538021.0, ans=0.0 2024-06-22 05:45:46,342 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=538021.0, ans=0.0 2024-06-22 05:45:47,900 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.06 vs. limit=15.0 2024-06-22 05:45:58,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=538039.3333333334, ans=0.1 2024-06-22 05:46:07,035 INFO [train.py:1028] (1/2) Epoch 30, batch 100, loss[loss=0.2001, simple_loss=0.2687, pruned_loss=0.06575, over 13267.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.246, pruned_loss=0.06229, over 1018095.92 frames. ], batch size: 46, lr: 1.94e-03, grad_scale: 16.0 2024-06-22 05:46:07,929 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2024-06-22 05:46:11,362 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=538076.0, ans=0.125 2024-06-22 05:46:13,533 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2024-06-22 05:46:15,607 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 2.380e+02 2.505e+02 2.685e+02 3.497e+02, threshold=5.010e+02, percent-clipped=0.0 2024-06-22 05:46:23,094 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.62 vs. limit=15.0 2024-06-22 05:46:32,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=538149.3333333334, ans=0.1 2024-06-22 05:46:38,694 INFO [train.py:1028] (1/2) Epoch 30, batch 150, loss[loss=0.1703, simple_loss=0.2323, pruned_loss=0.05415, over 12714.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2456, pruned_loss=0.06112, over 1366692.33 frames. ], batch size: 29, lr: 1.94e-03, grad_scale: 16.0 2024-06-22 05:46:40,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=538167.6666666666, ans=0.0 2024-06-22 05:46:52,712 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=538204.3333333334, ans=0.125 2024-06-22 05:46:57,177 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=538222.6666666666, ans=0.125 2024-06-22 05:47:03,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=538241.0, ans=0.125 2024-06-22 05:47:04,921 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=538241.0, ans=0.0 2024-06-22 05:47:08,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=538241.0, ans=0.125 2024-06-22 05:47:10,369 INFO [train.py:1028] (1/2) Epoch 30, batch 200, loss[loss=0.2109, simple_loss=0.2677, pruned_loss=0.07701, over 12517.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2466, pruned_loss=0.06188, over 1635813.33 frames. ], batch size: 202, lr: 1.94e-03, grad_scale: 16.0 2024-06-22 05:47:13,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=538259.3333333334, ans=0.2 2024-06-22 05:47:22,586 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.312e+02 2.412e+02 2.619e+02 3.031e+02, threshold=4.825e+02, percent-clipped=0.0 2024-06-22 05:47:25,222 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.23 vs. limit=22.5 2024-06-22 05:47:32,180 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:47:37,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=538314.3333333334, ans=0.2 2024-06-22 05:47:38,609 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=538332.6666666666, ans=0.125 2024-06-22 05:47:45,587 INFO [train.py:1028] (1/2) Epoch 30, batch 250, loss[loss=0.1769, simple_loss=0.2276, pruned_loss=0.06312, over 12995.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2457, pruned_loss=0.06162, over 1847059.28 frames. ], batch size: 144, lr: 1.94e-03, grad_scale: 16.0 2024-06-22 05:47:53,049 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=538369.3333333334, ans=0.125 2024-06-22 05:48:03,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=538387.6666666666, ans=0.1 2024-06-22 05:48:06,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=538387.6666666666, ans=0.95 2024-06-22 05:48:08,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=538406.0, ans=0.025 2024-06-22 05:48:18,753 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:48:20,690 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=538442.6666666666, ans=0.125 2024-06-22 05:48:21,093 INFO [train.py:1028] (1/2) Epoch 30, batch 300, loss[loss=0.1775, simple_loss=0.2339, pruned_loss=0.06058, over 13148.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2464, pruned_loss=0.06193, over 2011657.41 frames. ], batch size: 112, lr: 1.94e-03, grad_scale: 16.0 2024-06-22 05:48:21,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=538442.6666666666, ans=0.1 2024-06-22 05:48:24,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.72 vs. limit=12.0 2024-06-22 05:48:29,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=538461.0, ans=0.125 2024-06-22 05:48:30,329 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 2.350e+02 2.494e+02 2.660e+02 3.197e+02, threshold=4.989e+02, percent-clipped=0.0 2024-06-22 05:48:36,674 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=538479.3333333334, ans=0.125 2024-06-22 05:48:38,504 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=538479.3333333334, ans=0.04949747468305833 2024-06-22 05:48:40,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=538497.6666666666, ans=0.125 2024-06-22 05:48:42,182 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=538497.6666666666, ans=0.125 2024-06-22 05:48:48,691 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.44 vs. limit=10.0 2024-06-22 05:48:49,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=538516.0, ans=0.2 2024-06-22 05:48:52,967 INFO [train.py:1028] (1/2) Epoch 30, batch 350, loss[loss=0.1593, simple_loss=0.2256, pruned_loss=0.04652, over 12829.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.246, pruned_loss=0.06169, over 2140317.18 frames. ], batch size: 33, lr: 1.94e-03, grad_scale: 16.0 2024-06-22 05:48:55,261 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=538534.3333333334, ans=0.1 2024-06-22 05:49:00,806 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=538552.6666666666, ans=0.125 2024-06-22 05:49:04,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=538552.6666666666, ans=0.0 2024-06-22 05:49:05,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=538571.0, ans=0.125 2024-06-22 05:49:27,755 INFO [train.py:1028] (1/2) Epoch 30, batch 400, loss[loss=0.2005, simple_loss=0.2722, pruned_loss=0.06442, over 13271.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2455, pruned_loss=0.0613, over 2240115.74 frames. ], batch size: 63, lr: 1.94e-03, grad_scale: 32.0 2024-06-22 05:49:36,795 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.360e+02 2.507e+02 2.786e+02 3.534e+02, threshold=5.014e+02, percent-clipped=0.0 2024-06-22 05:49:39,363 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=538644.3333333334, ans=0.025 2024-06-22 05:49:54,608 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.77 vs. limit=15.0 2024-06-22 05:49:55,073 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=538699.3333333334, ans=0.1 2024-06-22 05:49:56,386 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=538699.3333333334, ans=0.125 2024-06-22 05:49:56,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=538699.3333333334, ans=0.1 2024-06-22 05:50:01,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=538699.3333333334, ans=0.125 2024-06-22 05:50:02,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=538717.6666666666, ans=0.2 2024-06-22 05:50:03,259 INFO [train.py:1028] (1/2) Epoch 30, batch 450, loss[loss=0.1796, simple_loss=0.25, pruned_loss=0.05457, over 13239.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2454, pruned_loss=0.06125, over 2313896.47 frames. ], batch size: 67, lr: 1.94e-03, grad_scale: 32.0 2024-06-22 05:50:07,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=538717.6666666666, ans=0.2 2024-06-22 05:50:14,580 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.96 vs. limit=15.0 2024-06-22 05:50:25,109 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=538772.6666666666, ans=0.125 2024-06-22 05:50:34,430 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=538791.0, ans=0.1 2024-06-22 05:50:35,434 INFO [train.py:1028] (1/2) Epoch 30, batch 500, loss[loss=0.1776, simple_loss=0.2389, pruned_loss=0.05814, over 13157.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2458, pruned_loss=0.06097, over 2376809.15 frames. ], batch size: 121, lr: 1.94e-03, grad_scale: 32.0 2024-06-22 05:50:36,417 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=538809.3333333334, ans=15.0 2024-06-22 05:50:38,841 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=538809.3333333334, ans=0.0 2024-06-22 05:50:39,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=538809.3333333334, ans=0.125 2024-06-22 05:50:44,296 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.331e+02 2.433e+02 2.631e+02 3.376e+02, threshold=4.865e+02, percent-clipped=0.0 2024-06-22 05:50:58,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=538864.3333333334, ans=0.125 2024-06-22 05:51:07,091 INFO [train.py:1028] (1/2) Epoch 30, batch 550, loss[loss=0.1795, simple_loss=0.2353, pruned_loss=0.06188, over 12865.00 frames. ], tot_loss[loss=0.1836, simple_loss=0.2456, pruned_loss=0.06078, over 2421089.97 frames. ], batch size: 158, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:51:07,283 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:51:12,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.71 vs. limit=6.0 2024-06-22 05:51:16,429 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=538919.3333333334, ans=0.125 2024-06-22 05:51:21,549 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=538937.6666666666, ans=0.125 2024-06-22 05:51:22,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=538937.6666666666, ans=0.04949747468305833 2024-06-22 05:51:24,467 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=538937.6666666666, ans=0.125 2024-06-22 05:51:30,687 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:51:34,999 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=538974.3333333334, ans=0.0 2024-06-22 05:51:41,773 INFO [train.py:1028] (1/2) Epoch 30, batch 600, loss[loss=0.1893, simple_loss=0.2438, pruned_loss=0.06742, over 13076.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2456, pruned_loss=0.06065, over 2459518.48 frames. ], batch size: 144, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:51:44,534 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:51:45,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=538992.6666666666, ans=0.2 2024-06-22 05:51:48,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=539011.0, ans=0.125 2024-06-22 05:51:49,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=539011.0, ans=0.125 2024-06-22 05:51:50,752 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.354e+02 2.518e+02 2.774e+02 4.347e+02, threshold=5.035e+02, percent-clipped=0.0 2024-06-22 05:51:55,036 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=539029.3333333334, ans=0.04949747468305833 2024-06-22 05:52:09,933 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=539047.6666666666, ans=0.0 2024-06-22 05:52:13,440 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=539066.0, ans=0.125 2024-06-22 05:52:17,827 INFO [train.py:1028] (1/2) Epoch 30, batch 650, loss[loss=0.1868, simple_loss=0.2504, pruned_loss=0.06157, over 13188.00 frames. ], tot_loss[loss=0.1832, simple_loss=0.2457, pruned_loss=0.06032, over 2489981.88 frames. ], batch size: 59, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:52:30,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=539121.0, ans=0.125 2024-06-22 05:52:44,566 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=539157.6666666666, ans=0.0 2024-06-22 05:52:46,010 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2024-06-22 05:52:49,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=539176.0, ans=0.5 2024-06-22 05:52:50,061 INFO [train.py:1028] (1/2) Epoch 30, batch 700, loss[loss=0.1794, simple_loss=0.2448, pruned_loss=0.05698, over 13276.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2452, pruned_loss=0.06045, over 2512156.91 frames. ], batch size: 46, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:52:50,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=539176.0, ans=0.0 2024-06-22 05:52:59,101 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.362e+02 2.508e+02 2.737e+02 3.151e+02, threshold=5.015e+02, percent-clipped=0.0 2024-06-22 05:53:03,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=539212.6666666666, ans=0.125 2024-06-22 05:53:05,536 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=539212.6666666666, ans=0.025 2024-06-22 05:53:09,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=539231.0, ans=0.125 2024-06-22 05:53:09,692 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=539231.0, ans=0.1 2024-06-22 05:53:11,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=539231.0, ans=0.07 2024-06-22 05:53:15,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=539231.0, ans=0.0 2024-06-22 05:53:22,959 INFO [train.py:1028] (1/2) Epoch 30, batch 750, loss[loss=0.1729, simple_loss=0.2421, pruned_loss=0.05178, over 13260.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2452, pruned_loss=0.06037, over 2528055.79 frames. ], batch size: 63, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:53:27,004 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.74 vs. limit=15.0 2024-06-22 05:53:46,227 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=539322.6666666666, ans=0.1 2024-06-22 05:53:51,251 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2024-06-22 05:53:59,886 INFO [train.py:1028] (1/2) Epoch 30, batch 800, loss[loss=0.1815, simple_loss=0.2535, pruned_loss=0.0548, over 13021.00 frames. ], tot_loss[loss=0.1829, simple_loss=0.245, pruned_loss=0.06042, over 2541722.16 frames. ], batch size: 36, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:54:09,524 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 2.337e+02 2.477e+02 2.640e+02 3.223e+02, threshold=4.954e+02, percent-clipped=0.0 2024-06-22 05:54:10,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=539377.6666666666, ans=0.125 2024-06-22 05:54:35,297 INFO [train.py:1028] (1/2) Epoch 30, batch 850, loss[loss=0.1705, simple_loss=0.2287, pruned_loss=0.0562, over 13102.00 frames. ], tot_loss[loss=0.1827, simple_loss=0.245, pruned_loss=0.06022, over 2551422.74 frames. ], batch size: 95, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:54:40,082 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=539451.0, ans=0.025 2024-06-22 05:54:41,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=539469.3333333334, ans=0.125 2024-06-22 05:54:42,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=539469.3333333334, ans=0.125 2024-06-22 05:54:44,543 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=539469.3333333334, ans=0.125 2024-06-22 05:54:49,190 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=539487.6666666666, ans=0.125 2024-06-22 05:54:57,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=539506.0, ans=0.125 2024-06-22 05:55:07,651 INFO [train.py:1028] (1/2) Epoch 30, batch 900, loss[loss=0.1823, simple_loss=0.247, pruned_loss=0.05884, over 12977.00 frames. ], tot_loss[loss=0.1826, simple_loss=0.2446, pruned_loss=0.06032, over 2556770.52 frames. ], batch size: 36, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:55:11,456 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=539542.6666666666, ans=0.0 2024-06-22 05:55:16,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=539561.0, ans=0.125 2024-06-22 05:55:17,900 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.408e+02 2.515e+02 2.772e+02 3.380e+02, threshold=5.030e+02, percent-clipped=0.0 2024-06-22 05:55:24,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=539579.3333333334, ans=0.1 2024-06-22 05:55:25,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=539579.3333333334, ans=10.0 2024-06-22 05:55:32,538 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=539597.6666666666, ans=0.025 2024-06-22 05:55:34,661 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.17 vs. limit=10.0 2024-06-22 05:55:37,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=539616.0, ans=0.125 2024-06-22 05:55:37,923 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.10 vs. limit=12.0 2024-06-22 05:55:44,905 INFO [train.py:1028] (1/2) Epoch 30, batch 950, loss[loss=0.1772, simple_loss=0.2474, pruned_loss=0.05345, over 12915.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2444, pruned_loss=0.06021, over 2560005.92 frames. ], batch size: 39, lr: 1.93e-03, grad_scale: 16.0 2024-06-22 05:55:48,942 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=539634.3333333334, ans=0.0 2024-06-22 05:55:50,516 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.19 vs. limit=10.0 2024-06-22 05:55:52,244 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=539652.6666666666, ans=0.125 2024-06-22 05:55:55,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=539652.6666666666, ans=0.125 2024-06-22 05:56:05,288 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.13 vs. limit=15.0 2024-06-22 05:56:08,418 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=539689.3333333334, ans=0.5 2024-06-22 05:56:18,902 INFO [train.py:1028] (1/2) Epoch 30, batch 1000, loss[loss=0.1736, simple_loss=0.2392, pruned_loss=0.05396, over 13313.00 frames. ], tot_loss[loss=0.1826, simple_loss=0.2445, pruned_loss=0.06041, over 2561726.85 frames. ], batch size: 49, lr: 1.93e-03, grad_scale: 16.0 2024-06-22 05:56:29,072 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.364e+02 2.532e+02 2.803e+02 4.075e+02, threshold=5.064e+02, percent-clipped=0.0 2024-06-22 05:56:32,902 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=539762.6666666666, ans=0.125 2024-06-22 05:56:34,335 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2024-06-22 05:56:43,905 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.12 vs. limit=15.0 2024-06-22 05:56:50,730 INFO [train.py:1028] (1/2) Epoch 30, batch 1050, loss[loss=0.1674, simple_loss=0.2342, pruned_loss=0.05035, over 13168.00 frames. ], tot_loss[loss=0.1828, simple_loss=0.2447, pruned_loss=0.06043, over 2564910.93 frames. ], batch size: 77, lr: 1.93e-03, grad_scale: 16.0 2024-06-22 05:57:06,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=539854.3333333334, ans=0.0 2024-06-22 05:57:07,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=539854.3333333334, ans=0.0 2024-06-22 05:57:10,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=539872.6666666666, ans=0.025 2024-06-22 05:57:13,371 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=539872.6666666666, ans=0.0 2024-06-22 05:57:20,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=539891.0, ans=0.0 2024-06-22 05:57:22,772 INFO [train.py:1028] (1/2) Epoch 30, batch 1100, loss[loss=0.2045, simple_loss=0.2642, pruned_loss=0.07246, over 13246.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2458, pruned_loss=0.06089, over 2569945.44 frames. ], batch size: 52, lr: 1.93e-03, grad_scale: 16.0 2024-06-22 05:57:32,994 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.191e+02 2.376e+02 2.467e+02 2.614e+02 3.978e+02, threshold=4.933e+02, percent-clipped=0.0 2024-06-22 05:57:33,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=539927.6666666666, ans=0.125 2024-06-22 05:57:34,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=539927.6666666666, ans=0.125 2024-06-22 05:57:44,383 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=539964.3333333334, ans=0.0 2024-06-22 05:57:55,165 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=539982.6666666666, ans=0.0 2024-06-22 05:57:57,550 INFO [train.py:1028] (1/2) Epoch 30, batch 1150, loss[loss=0.1963, simple_loss=0.2646, pruned_loss=0.06404, over 13292.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2459, pruned_loss=0.06085, over 2571333.63 frames. ], batch size: 52, lr: 1.93e-03, grad_scale: 16.0 2024-06-22 05:58:02,308 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=540001.0, ans=0.07 2024-06-22 05:58:12,408 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:58:13,989 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2024-06-22 05:58:32,574 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=540074.3333333334, ans=0.2 2024-06-22 05:58:33,169 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540074.3333333334, ans=0.1 2024-06-22 05:58:34,173 INFO [train.py:1028] (1/2) Epoch 30, batch 1200, loss[loss=0.1607, simple_loss=0.2244, pruned_loss=0.0485, over 13213.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2458, pruned_loss=0.061, over 2573885.62 frames. ], batch size: 77, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:58:39,073 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.47 vs. limit=15.0 2024-06-22 05:58:41,944 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=540111.0, ans=0.2 2024-06-22 05:58:43,415 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=540111.0, ans=0.2 2024-06-22 05:58:44,484 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.319e+02 2.510e+02 2.704e+02 3.574e+02, threshold=5.020e+02, percent-clipped=0.0 2024-06-22 05:58:48,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=540129.3333333334, ans=0.0 2024-06-22 05:58:58,540 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.60 vs. limit=15.0 2024-06-22 05:59:00,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=540166.0, ans=0.1 2024-06-22 05:59:05,732 INFO [train.py:1028] (1/2) Epoch 30, batch 1250, loss[loss=0.1753, simple_loss=0.2392, pruned_loss=0.05571, over 13128.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2452, pruned_loss=0.06048, over 2582805.23 frames. ], batch size: 112, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:59:08,717 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.99 vs. limit=12.0 2024-06-22 05:59:09,744 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=540184.3333333334, ans=0.125 2024-06-22 05:59:18,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540221.0, ans=0.1 2024-06-22 05:59:19,318 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=540221.0, ans=0.125 2024-06-22 05:59:28,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=540239.3333333334, ans=0.04949747468305833 2024-06-22 05:59:34,254 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=540257.6666666666, ans=0.2 2024-06-22 05:59:37,163 INFO [train.py:1028] (1/2) Epoch 30, batch 1300, loss[loss=0.1947, simple_loss=0.2515, pruned_loss=0.06889, over 12724.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.2457, pruned_loss=0.06082, over 2583818.82 frames. ], batch size: 176, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:59:42,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=540276.0, ans=0.125 2024-06-22 05:59:47,266 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=540294.3333333334, ans=0.1 2024-06-22 05:59:47,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=540294.3333333334, ans=0.09899494936611666 2024-06-22 05:59:50,073 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.389e+02 2.492e+02 2.652e+02 3.442e+02, threshold=4.984e+02, percent-clipped=0.0 2024-06-22 05:59:54,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=540312.6666666666, ans=0.07 2024-06-22 05:59:58,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=540331.0, ans=0.125 2024-06-22 06:00:07,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=540349.3333333334, ans=0.0 2024-06-22 06:00:15,477 INFO [train.py:1028] (1/2) Epoch 30, batch 1350, loss[loss=0.1594, simple_loss=0.2323, pruned_loss=0.04321, over 13259.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2451, pruned_loss=0.06053, over 2584993.17 frames. ], batch size: 59, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:00:15,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=540367.6666666666, ans=0.125 2024-06-22 06:00:19,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=540367.6666666666, ans=0.0 2024-06-22 06:00:30,701 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.97 vs. limit=15.0 2024-06-22 06:00:37,007 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=540422.6666666666, ans=0.125 2024-06-22 06:00:37,711 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=540422.6666666666, ans=0.125 2024-06-22 06:00:49,472 INFO [train.py:1028] (1/2) Epoch 30, batch 1400, loss[loss=0.1776, simple_loss=0.2457, pruned_loss=0.05476, over 12358.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2459, pruned_loss=0.06114, over 2585659.03 frames. ], batch size: 25, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:00:49,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=540459.3333333334, ans=0.0 2024-06-22 06:00:58,517 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2024-06-22 06:01:00,082 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.386e+02 2.475e+02 2.724e+02 3.520e+02, threshold=4.950e+02, percent-clipped=0.0 2024-06-22 06:01:04,729 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=540496.0, ans=0.125 2024-06-22 06:01:05,956 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=540496.0, ans=0.125 2024-06-22 06:01:08,168 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=540496.0, ans=0.125 2024-06-22 06:01:22,591 INFO [train.py:1028] (1/2) Epoch 30, batch 1450, loss[loss=0.1816, simple_loss=0.2388, pruned_loss=0.06215, over 13130.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2458, pruned_loss=0.0611, over 2586347.66 frames. ], batch size: 121, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:01:24,830 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.57 vs. limit=10.0 2024-06-22 06:01:25,974 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=540551.0, ans=0.0 2024-06-22 06:01:45,301 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=540606.0, ans=0.025 2024-06-22 06:01:45,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540606.0, ans=0.1 2024-06-22 06:01:57,634 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=540642.6666666666, ans=0.125 2024-06-22 06:01:58,149 INFO [train.py:1028] (1/2) Epoch 30, batch 1500, loss[loss=0.201, simple_loss=0.2574, pruned_loss=0.07229, over 13205.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2457, pruned_loss=0.06144, over 2589250.49 frames. ], batch size: 83, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:01:58,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=540642.6666666666, ans=0.025 2024-06-22 06:01:58,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=540642.6666666666, ans=0.0 2024-06-22 06:02:04,230 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540661.0, ans=0.1 2024-06-22 06:02:08,556 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.410e+02 2.593e+02 2.839e+02 3.544e+02, threshold=5.186e+02, percent-clipped=0.0 2024-06-22 06:02:31,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=540716.0, ans=0.0 2024-06-22 06:02:34,409 INFO [train.py:1028] (1/2) Epoch 30, batch 1550, loss[loss=0.1864, simple_loss=0.2446, pruned_loss=0.0641, over 13051.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2464, pruned_loss=0.06171, over 2584050.13 frames. ], batch size: 102, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:02:46,343 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.14 vs. limit=15.0 2024-06-22 06:02:48,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=540771.0, ans=0.0 2024-06-22 06:02:52,871 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=540771.0, ans=10.0 2024-06-22 06:02:54,220 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=540789.3333333334, ans=0.2 2024-06-22 06:03:05,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=540807.6666666666, ans=0.04949747468305833 2024-06-22 06:03:07,061 INFO [train.py:1028] (1/2) Epoch 30, batch 1600, loss[loss=0.1777, simple_loss=0.2384, pruned_loss=0.05855, over 13159.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2465, pruned_loss=0.06176, over 2579134.38 frames. ], batch size: 77, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:03:09,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=540826.0, ans=0.04949747468305833 2024-06-22 06:03:16,930 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.07 vs. limit=12.0 2024-06-22 06:03:17,154 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.453e+02 2.672e+02 2.946e+02 4.286e+02, threshold=5.345e+02, percent-clipped=0.0 2024-06-22 06:03:19,712 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.86 vs. limit=22.5 2024-06-22 06:03:26,527 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=540881.0, ans=0.0 2024-06-22 06:03:40,252 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=540899.3333333334, ans=0.0 2024-06-22 06:03:42,030 INFO [train.py:1028] (1/2) Epoch 30, batch 1650, loss[loss=0.1913, simple_loss=0.2437, pruned_loss=0.06949, over 13167.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2464, pruned_loss=0.06192, over 2575201.88 frames. ], batch size: 95, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:03:43,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=540917.6666666666, ans=0.0 2024-06-22 06:03:45,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=540917.6666666666, ans=0.0 2024-06-22 06:03:50,870 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:03:53,485 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=540936.0, ans=0.125 2024-06-22 06:03:56,278 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=540954.3333333334, ans=0.2 2024-06-22 06:04:08,026 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540991.0, ans=0.1 2024-06-22 06:04:14,545 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:04:14,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=540991.0, ans=0.125 2024-06-22 06:04:17,537 INFO [train.py:1028] (1/2) Epoch 30, batch 1700, loss[loss=0.1762, simple_loss=0.2434, pruned_loss=0.05454, over 12635.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2464, pruned_loss=0.06167, over 2580813.15 frames. ], batch size: 25, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:04:22,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=541009.3333333334, ans=0.125 2024-06-22 06:04:27,497 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.418e+02 2.543e+02 2.829e+02 4.265e+02, threshold=5.087e+02, percent-clipped=0.0 2024-06-22 06:04:38,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=541064.3333333334, ans=0.125 2024-06-22 06:04:43,452 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=541082.6666666666, ans=0.025 2024-06-22 06:04:45,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=541082.6666666666, ans=0.125 2024-06-22 06:04:49,749 INFO [train.py:1028] (1/2) Epoch 30, batch 1750, loss[loss=0.1786, simple_loss=0.2443, pruned_loss=0.05647, over 12501.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2467, pruned_loss=0.06181, over 2581518.74 frames. ], batch size: 22, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:04:53,877 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=541101.0, ans=0.0 2024-06-22 06:04:57,471 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=541119.3333333334, ans=0.125 2024-06-22 06:04:57,948 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=541119.3333333334, ans=0.125 2024-06-22 06:05:07,489 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=541137.6666666666, ans=0.1 2024-06-22 06:05:17,601 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=541174.3333333334, ans=0.0 2024-06-22 06:05:18,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=541174.3333333334, ans=0.125 2024-06-22 06:05:22,604 INFO [train.py:1028] (1/2) Epoch 30, batch 1800, loss[loss=0.1761, simple_loss=0.2396, pruned_loss=0.05629, over 13207.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2468, pruned_loss=0.06188, over 2582466.01 frames. ], batch size: 67, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:05:26,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=541192.6666666666, ans=0.0 2024-06-22 06:05:28,270 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=541192.6666666666, ans=0.2 2024-06-22 06:05:31,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=541211.0, ans=0.2 2024-06-22 06:05:33,347 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.371e+02 2.539e+02 2.705e+02 3.332e+02, threshold=5.078e+02, percent-clipped=0.0 2024-06-22 06:05:42,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=541229.3333333334, ans=0.0 2024-06-22 06:05:43,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=541229.3333333334, ans=0.125 2024-06-22 06:05:43,633 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.66 vs. limit=15.0 2024-06-22 06:05:56,068 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=541266.0, ans=0.0 2024-06-22 06:05:58,469 INFO [train.py:1028] (1/2) Epoch 30, batch 1850, loss[loss=0.1793, simple_loss=0.2394, pruned_loss=0.05961, over 13194.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.247, pruned_loss=0.06165, over 2585031.11 frames. ], batch size: 83, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:06:23,078 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=541339.3333333334, ans=10.0 2024-06-22 06:06:29,769 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.25 vs. limit=15.0 2024-06-22 06:06:33,792 INFO [train.py:1028] (1/2) Epoch 30, batch 1900, loss[loss=0.1826, simple_loss=0.2447, pruned_loss=0.0603, over 13116.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2465, pruned_loss=0.06155, over 2587521.74 frames. ], batch size: 95, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:06:36,490 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=541376.0, ans=0.125 2024-06-22 06:06:44,665 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.388e+02 2.495e+02 2.679e+02 3.422e+02, threshold=4.990e+02, percent-clipped=0.0 2024-06-22 06:06:55,263 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=541431.0, ans=0.125 2024-06-22 06:07:01,914 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:07:06,949 INFO [train.py:1028] (1/2) Epoch 30, batch 1950, loss[loss=0.1723, simple_loss=0.2425, pruned_loss=0.05099, over 13268.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2456, pruned_loss=0.06159, over 2593630.99 frames. ], batch size: 52, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:07:11,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=541467.6666666666, ans=0.125 2024-06-22 06:07:17,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=541486.0, ans=0.0 2024-06-22 06:07:41,560 INFO [train.py:1028] (1/2) Epoch 30, batch 2000, loss[loss=0.2093, simple_loss=0.2658, pruned_loss=0.07639, over 12593.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2458, pruned_loss=0.06175, over 2589132.02 frames. ], batch size: 22, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:07:47,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=541559.3333333334, ans=0.0 2024-06-22 06:07:51,768 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.361e+02 2.521e+02 2.752e+02 3.459e+02, threshold=5.043e+02, percent-clipped=0.0 2024-06-22 06:08:13,072 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=541632.6666666666, ans=0.2 2024-06-22 06:08:16,592 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=541651.0, ans=0.125 2024-06-22 06:08:17,064 INFO [train.py:1028] (1/2) Epoch 30, batch 2050, loss[loss=0.1672, simple_loss=0.2372, pruned_loss=0.04857, over 12604.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2458, pruned_loss=0.0618, over 2583608.95 frames. ], batch size: 29, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:08:29,853 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=541687.6666666666, ans=0.125 2024-06-22 06:08:41,250 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=541706.0, ans=0.0 2024-06-22 06:08:48,149 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=541724.3333333334, ans=0.025 2024-06-22 06:08:50,131 INFO [train.py:1028] (1/2) Epoch 30, batch 2100, loss[loss=0.1977, simple_loss=0.2647, pruned_loss=0.06534, over 13185.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2465, pruned_loss=0.0616, over 2586031.51 frames. ], batch size: 59, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:08:58,498 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2024-06-22 06:09:00,611 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.093e+02 2.370e+02 2.519e+02 2.733e+02 3.736e+02, threshold=5.037e+02, percent-clipped=0.0 2024-06-22 06:09:16,830 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=541816.0, ans=0.025 2024-06-22 06:09:21,475 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=541816.0, ans=0.125 2024-06-22 06:09:23,257 INFO [train.py:1028] (1/2) Epoch 30, batch 2150, loss[loss=0.1806, simple_loss=0.251, pruned_loss=0.05512, over 13280.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2457, pruned_loss=0.06105, over 2588708.30 frames. ], batch size: 52, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:09:24,467 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2024-06-22 06:09:28,051 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=541834.3333333334, ans=0.0 2024-06-22 06:09:34,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=541852.6666666666, ans=10.0 2024-06-22 06:09:34,937 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=541852.6666666666, ans=0.125 2024-06-22 06:09:35,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=541852.6666666666, ans=0.125 2024-06-22 06:09:35,873 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.59 vs. limit=15.0 2024-06-22 06:09:37,598 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=541871.0, ans=0.0 2024-06-22 06:09:51,339 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=541889.3333333334, ans=0.125 2024-06-22 06:09:59,276 INFO [train.py:1028] (1/2) Epoch 30, batch 2200, loss[loss=0.1924, simple_loss=0.2461, pruned_loss=0.06935, over 13200.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2463, pruned_loss=0.06146, over 2588783.95 frames. ], batch size: 83, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:10:02,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=541926.0, ans=0.05 2024-06-22 06:10:09,395 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.335e+02 2.468e+02 2.613e+02 4.116e+02, threshold=4.937e+02, percent-clipped=0.0 2024-06-22 06:10:26,066 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=541981.0, ans=0.2 2024-06-22 06:10:34,769 INFO [train.py:1028] (1/2) Epoch 30, batch 2250, loss[loss=0.1765, simple_loss=0.2481, pruned_loss=0.05246, over 13258.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2468, pruned_loss=0.06181, over 2587069.32 frames. ], batch size: 63, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:10:34,930 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=542017.6666666666, ans=0.0 2024-06-22 06:10:45,612 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=542036.0, ans=0.0 2024-06-22 06:10:48,917 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=542054.3333333334, ans=0.0 2024-06-22 06:10:51,669 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=542054.3333333334, ans=0.0 2024-06-22 06:10:53,761 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=542072.6666666666, ans=0.0 2024-06-22 06:10:55,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2024-06-22 06:10:56,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=542072.6666666666, ans=0.2 2024-06-22 06:10:56,428 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=542072.6666666666, ans=0.125 2024-06-22 06:10:59,079 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=542072.6666666666, ans=0.125 2024-06-22 06:11:05,691 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:11:07,706 INFO [train.py:1028] (1/2) Epoch 30, batch 2300, loss[loss=0.1619, simple_loss=0.2282, pruned_loss=0.04777, over 12901.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2466, pruned_loss=0.06168, over 2581662.72 frames. ], batch size: 33, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:11:07,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=542109.3333333334, ans=0.0 2024-06-22 06:11:18,257 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.338e+02 2.548e+02 2.789e+02 4.157e+02, threshold=5.095e+02, percent-clipped=0.0 2024-06-22 06:11:29,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=542164.3333333334, ans=0.0 2024-06-22 06:11:42,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=542182.6666666666, ans=0.0 2024-06-22 06:11:43,596 INFO [train.py:1028] (1/2) Epoch 30, batch 2350, loss[loss=0.174, simple_loss=0.2366, pruned_loss=0.05575, over 13209.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2464, pruned_loss=0.0616, over 2585134.10 frames. ], batch size: 67, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:11:47,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542201.0, ans=0.1 2024-06-22 06:11:58,542 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=542237.6666666666, ans=0.125 2024-06-22 06:12:16,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=542274.3333333334, ans=0.125 2024-06-22 06:12:19,676 INFO [train.py:1028] (1/2) Epoch 30, batch 2400, loss[loss=0.1862, simple_loss=0.2464, pruned_loss=0.06302, over 13323.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2456, pruned_loss=0.06174, over 2586914.74 frames. ], batch size: 46, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:12:30,013 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.354e+02 2.481e+02 2.785e+02 3.991e+02, threshold=4.963e+02, percent-clipped=0.0 2024-06-22 06:12:31,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=542311.0, ans=0.0 2024-06-22 06:12:35,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=542329.3333333334, ans=0.09899494936611666 2024-06-22 06:12:36,173 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=542329.3333333334, ans=0.125 2024-06-22 06:12:36,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=542329.3333333334, ans=0.0 2024-06-22 06:12:40,834 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542347.6666666666, ans=0.1 2024-06-22 06:12:51,225 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=542366.0, ans=0.125 2024-06-22 06:12:52,288 INFO [train.py:1028] (1/2) Epoch 30, batch 2450, loss[loss=0.1747, simple_loss=0.2319, pruned_loss=0.05872, over 13280.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2454, pruned_loss=0.06191, over 2582736.94 frames. ], batch size: 63, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:13:05,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=542421.0, ans=0.2 2024-06-22 06:13:11,577 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=542439.3333333334, ans=0.0 2024-06-22 06:13:20,214 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=542457.6666666666, ans=0.2 2024-06-22 06:13:20,280 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=542457.6666666666, ans=0.0 2024-06-22 06:13:23,256 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.83 vs. limit=15.0 2024-06-22 06:13:24,809 INFO [train.py:1028] (1/2) Epoch 30, batch 2500, loss[loss=0.1972, simple_loss=0.2468, pruned_loss=0.0738, over 13226.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2451, pruned_loss=0.06203, over 2586056.69 frames. ], batch size: 83, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:13:25,203 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.12 vs. limit=15.0 2024-06-22 06:13:39,305 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=542494.3333333334, ans=0.125 2024-06-22 06:13:39,793 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.396e+02 2.550e+02 2.760e+02 4.922e+02, threshold=5.101e+02, percent-clipped=0.0 2024-06-22 06:13:40,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542494.3333333334, ans=0.1 2024-06-22 06:13:41,269 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=542494.3333333334, ans=0.0 2024-06-22 06:13:55,976 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=542549.3333333334, ans=0.125 2024-06-22 06:13:58,883 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=542549.3333333334, ans=0.05 2024-06-22 06:14:02,698 INFO [train.py:1028] (1/2) Epoch 30, batch 2550, loss[loss=0.1878, simple_loss=0.2588, pruned_loss=0.0584, over 12516.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2444, pruned_loss=0.06169, over 2586045.73 frames. ], batch size: 22, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:14:06,113 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=542567.6666666666, ans=0.2 2024-06-22 06:14:13,086 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=542586.0, ans=0.1 2024-06-22 06:14:14,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=542586.0, ans=0.0 2024-06-22 06:14:21,519 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2024-06-22 06:14:23,816 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=542604.3333333334, ans=0.07 2024-06-22 06:14:25,207 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=542622.6666666666, ans=0.125 2024-06-22 06:14:26,819 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.53 vs. limit=22.5 2024-06-22 06:14:38,220 INFO [train.py:1028] (1/2) Epoch 30, batch 2600, loss[loss=0.1711, simple_loss=0.2359, pruned_loss=0.05313, over 13246.00 frames. ], tot_loss[loss=0.1832, simple_loss=0.2434, pruned_loss=0.06145, over 2586461.94 frames. ], batch size: 52, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:14:52,328 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=542677.6666666666, ans=0.1 2024-06-22 06:14:54,236 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.340e+02 2.502e+02 2.665e+02 3.339e+02, threshold=5.004e+02, percent-clipped=0.0 2024-06-22 06:14:56,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=542696.0, ans=0.125 2024-06-22 06:14:58,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=542696.0, ans=0.125 2024-06-22 06:14:58,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=542696.0, ans=0.125 2024-06-22 06:15:00,031 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542696.0, ans=0.1 2024-06-22 06:15:05,135 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=542714.3333333334, ans=0.0 2024-06-22 06:15:11,503 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.03 vs. limit=15.0 2024-06-22 06:15:14,356 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.83 vs. limit=15.0 2024-06-22 06:15:16,522 INFO [train.py:1028] (1/2) Epoch 30, batch 2650, loss[loss=0.1587, simple_loss=0.2077, pruned_loss=0.05489, over 13081.00 frames. ], tot_loss[loss=0.1815, simple_loss=0.2416, pruned_loss=0.06074, over 2587714.25 frames. ], batch size: 144, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:15:21,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542751.0, ans=0.1 2024-06-22 06:15:42,842 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=542806.0, ans=0.125 2024-06-22 06:15:50,094 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2024-06-22 06:15:54,470 INFO [train.py:1028] (1/2) Epoch 30, batch 2700, loss[loss=0.1748, simple_loss=0.2321, pruned_loss=0.05878, over 13219.00 frames. ], tot_loss[loss=0.1809, simple_loss=0.2405, pruned_loss=0.06058, over 2586420.56 frames. ], batch size: 89, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:16:05,087 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.341e+02 2.483e+02 2.699e+02 3.731e+02, threshold=4.966e+02, percent-clipped=0.0 2024-06-22 06:16:15,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=542879.3333333334, ans=0.125 2024-06-22 06:16:19,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=542897.6666666666, ans=0.07 2024-06-22 06:16:30,494 INFO [train.py:1028] (1/2) Epoch 30, batch 2750, loss[loss=0.1883, simple_loss=0.2406, pruned_loss=0.06798, over 13248.00 frames. ], tot_loss[loss=0.1797, simple_loss=0.2394, pruned_loss=0.05998, over 2583834.37 frames. ], batch size: 43, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:16:53,387 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542989.3333333334, ans=0.1 2024-06-22 06:17:01,470 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=543007.6666666666, ans=0.0 2024-06-22 06:17:03,205 INFO [train.py:1028] (1/2) Epoch 30, batch 2800, loss[loss=0.1817, simple_loss=0.2349, pruned_loss=0.06427, over 10949.00 frames. ], tot_loss[loss=0.1794, simple_loss=0.2388, pruned_loss=0.06004, over 2580702.31 frames. ], batch size: 304, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:17:12,582 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=543044.3333333334, ans=0.125 2024-06-22 06:17:13,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=543044.3333333334, ans=0.0 2024-06-22 06:17:13,644 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.319e+02 2.458e+02 2.654e+02 3.588e+02, threshold=4.915e+02, percent-clipped=0.0 2024-06-22 06:17:29,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=543081.0, ans=0.125 2024-06-22 06:17:34,497 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=543099.3333333334, ans=0.125 2024-06-22 06:17:39,211 INFO [train.py:1028] (1/2) Epoch 30, batch 2850, loss[loss=0.1829, simple_loss=0.2451, pruned_loss=0.06032, over 13305.00 frames. ], tot_loss[loss=0.1788, simple_loss=0.2381, pruned_loss=0.0598, over 2578667.85 frames. ], batch size: 49, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:17:43,316 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=543117.6666666666, ans=0.025 2024-06-22 06:17:44,465 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=543117.6666666666, ans=0.0 2024-06-22 06:17:47,366 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=543136.0, ans=0.0 2024-06-22 06:18:01,189 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=543172.6666666666, ans=0.0 2024-06-22 06:18:13,575 INFO [train.py:1028] (1/2) Epoch 30, batch 2900, loss[loss=0.1661, simple_loss=0.2322, pruned_loss=0.04998, over 13150.00 frames. ], tot_loss[loss=0.1772, simple_loss=0.2361, pruned_loss=0.05916, over 2587154.73 frames. ], batch size: 55, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:18:17,421 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2024-06-22 06:18:20,299 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.27 vs. limit=15.0 2024-06-22 06:18:24,492 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.317e+02 2.460e+02 2.689e+02 4.047e+02, threshold=4.921e+02, percent-clipped=0.0 2024-06-22 06:18:31,903 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=543246.0, ans=0.0 2024-06-22 06:18:42,642 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=543282.6666666666, ans=0.0 2024-06-22 06:18:46,934 INFO [train.py:1028] (1/2) Epoch 30, batch 2950, loss[loss=0.1561, simple_loss=0.2112, pruned_loss=0.05054, over 13233.00 frames. ], tot_loss[loss=0.1769, simple_loss=0.2359, pruned_loss=0.05901, over 2581515.06 frames. ], batch size: 43, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:18:47,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=543301.0, ans=0.125 2024-06-22 06:18:48,769 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.88 vs. limit=15.0 2024-06-22 06:18:57,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=543319.3333333334, ans=0.5 2024-06-22 06:19:06,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=543356.0, ans=0.0 2024-06-22 06:19:10,115 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.73 vs. limit=15.0 2024-06-22 06:19:24,253 INFO [train.py:1028] (1/2) Epoch 30, batch 3000, loss[loss=0.1642, simple_loss=0.2219, pruned_loss=0.05321, over 13202.00 frames. ], tot_loss[loss=0.1766, simple_loss=0.2353, pruned_loss=0.05898, over 2579772.65 frames. ], batch size: 59, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:19:24,253 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 06:19:32,227 INFO [train.py:1060] (1/2) Epoch 30, validation: loss=0.194, simple_loss=0.252, pruned_loss=0.06799, over 351949.00 frames. 2024-06-22 06:19:32,228 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 06:19:43,404 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.334e+02 2.438e+02 2.640e+02 3.568e+02, threshold=4.875e+02, percent-clipped=0.0 2024-06-22 06:19:44,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=543411.0, ans=0.0 2024-06-22 06:20:09,597 INFO [train.py:1028] (1/2) Epoch 30, batch 3050, loss[loss=0.1861, simple_loss=0.2492, pruned_loss=0.06152, over 13273.00 frames. ], tot_loss[loss=0.1763, simple_loss=0.2345, pruned_loss=0.05902, over 2579468.63 frames. ], batch size: 46, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:20:15,355 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=543502.6666666666, ans=0.0 2024-06-22 06:20:16,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=543502.6666666666, ans=0.125 2024-06-22 06:20:19,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=543502.6666666666, ans=0.125 2024-06-22 06:20:20,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=543502.6666666666, ans=0.125 2024-06-22 06:20:21,392 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.93 vs. limit=6.0 2024-06-22 06:20:25,697 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=543521.0, ans=0.1 2024-06-22 06:20:25,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=543521.0, ans=0.2 2024-06-22 06:20:37,327 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=543557.6666666666, ans=0.2 2024-06-22 06:20:41,559 INFO [train.py:1028] (1/2) Epoch 30, batch 3100, loss[loss=0.1623, simple_loss=0.2151, pruned_loss=0.05478, over 13009.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.234, pruned_loss=0.05844, over 2579404.72 frames. ], batch size: 144, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:20:47,439 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.13 vs. limit=12.0 2024-06-22 06:20:49,107 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=543594.3333333334, ans=0.09899494936611666 2024-06-22 06:20:52,100 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.306e+02 2.496e+02 2.751e+02 3.896e+02, threshold=4.992e+02, percent-clipped=0.0 2024-06-22 06:20:56,217 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2024-06-22 06:20:56,714 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=543612.6666666666, ans=0.125 2024-06-22 06:20:57,922 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=543612.6666666666, ans=0.125 2024-06-22 06:21:08,358 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=543649.3333333334, ans=0.1 2024-06-22 06:21:13,618 INFO [train.py:1028] (1/2) Epoch 30, batch 3150, loss[loss=0.1848, simple_loss=0.238, pruned_loss=0.06578, over 12930.00 frames. ], tot_loss[loss=0.1743, simple_loss=0.2328, pruned_loss=0.05792, over 2581512.86 frames. ], batch size: 158, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:21:15,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=543667.6666666666, ans=15.0 2024-06-22 06:21:31,303 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.88 vs. limit=10.0 2024-06-22 06:21:43,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=543741.0, ans=0.1 2024-06-22 06:21:45,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=543741.0, ans=0.025 2024-06-22 06:21:49,668 INFO [train.py:1028] (1/2) Epoch 30, batch 3200, loss[loss=0.1693, simple_loss=0.2279, pruned_loss=0.05536, over 13177.00 frames. ], tot_loss[loss=0.1736, simple_loss=0.2321, pruned_loss=0.05758, over 2581829.32 frames. ], batch size: 55, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:21:59,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=543777.6666666666, ans=0.125 2024-06-22 06:22:00,251 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.301e+02 2.438e+02 2.621e+02 3.469e+02, threshold=4.876e+02, percent-clipped=0.0 2024-06-22 06:22:04,726 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=543777.6666666666, ans=0.125 2024-06-22 06:22:10,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=543796.0, ans=0.2 2024-06-22 06:22:19,491 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=543832.6666666666, ans=0.125 2024-06-22 06:22:20,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=543832.6666666666, ans=0.125 2024-06-22 06:22:25,105 INFO [train.py:1028] (1/2) Epoch 30, batch 3250, loss[loss=0.1682, simple_loss=0.2313, pruned_loss=0.05255, over 13241.00 frames. ], tot_loss[loss=0.1737, simple_loss=0.232, pruned_loss=0.05768, over 2585468.03 frames. ], batch size: 72, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:22:36,150 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=543869.3333333334, ans=0.125 2024-06-22 06:22:43,550 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=543887.6666666666, ans=0.125 2024-06-22 06:22:50,251 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.473e+01 2024-06-22 06:22:51,459 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=543924.3333333334, ans=0.1 2024-06-22 06:22:53,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=543924.3333333334, ans=0.125 2024-06-22 06:22:58,591 INFO [train.py:1028] (1/2) Epoch 30, batch 3300, loss[loss=0.1938, simple_loss=0.2459, pruned_loss=0.07085, over 12782.00 frames. ], tot_loss[loss=0.1734, simple_loss=0.2316, pruned_loss=0.05753, over 2583179.17 frames. ], batch size: 176, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:22:58,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=543942.6666666666, ans=0.1 2024-06-22 06:22:59,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=543942.6666666666, ans=0.1 2024-06-22 06:23:02,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=543942.6666666666, ans=0.0 2024-06-22 06:23:08,089 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=543961.0, ans=0.025 2024-06-22 06:23:09,237 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.367e+02 2.528e+02 2.813e+02 3.945e+02, threshold=5.056e+02, percent-clipped=0.0 2024-06-22 06:23:09,315 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=543961.0, ans=0.0 2024-06-22 06:23:19,747 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-22 06:23:26,319 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=543997.6666666666, ans=0.125 2024-06-22 06:23:28,404 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=544016.0, ans=0.125 2024-06-22 06:23:29,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544016.0, ans=0.1 2024-06-22 06:23:30,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=544016.0, ans=0.125 2024-06-22 06:23:30,333 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=544016.0, ans=0.125 2024-06-22 06:23:32,734 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544016.0, ans=0.1 2024-06-22 06:23:34,405 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.06 vs. limit=10.0 2024-06-22 06:23:34,625 INFO [train.py:1028] (1/2) Epoch 30, batch 3350, loss[loss=0.1853, simple_loss=0.2397, pruned_loss=0.06544, over 12902.00 frames. ], tot_loss[loss=0.1731, simple_loss=0.2312, pruned_loss=0.05748, over 2578034.41 frames. ], batch size: 158, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:23:35,514 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=544034.3333333334, ans=0.125 2024-06-22 06:23:38,291 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.67 vs. limit=15.0 2024-06-22 06:23:43,909 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=544052.6666666666, ans=0.0 2024-06-22 06:23:49,798 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=544071.0, ans=22.5 2024-06-22 06:24:01,012 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.99 vs. limit=15.0 2024-06-22 06:24:11,201 INFO [train.py:1028] (1/2) Epoch 30, batch 3400, loss[loss=0.1677, simple_loss=0.2276, pruned_loss=0.05392, over 12694.00 frames. ], tot_loss[loss=0.1735, simple_loss=0.2312, pruned_loss=0.05791, over 2576715.09 frames. ], batch size: 22, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:24:21,458 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.325e+02 2.497e+02 2.812e+02 4.183e+02, threshold=4.995e+02, percent-clipped=0.0 2024-06-22 06:24:26,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=544162.6666666666, ans=0.0 2024-06-22 06:24:43,778 INFO [train.py:1028] (1/2) Epoch 30, batch 3450, loss[loss=0.1791, simple_loss=0.2334, pruned_loss=0.06237, over 12804.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2309, pruned_loss=0.05777, over 2576807.80 frames. ], batch size: 176, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:24:44,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=544217.6666666666, ans=0.0 2024-06-22 06:24:44,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=544217.6666666666, ans=0.125 2024-06-22 06:24:45,457 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.91 vs. limit=22.5 2024-06-22 06:24:48,890 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544217.6666666666, ans=0.1 2024-06-22 06:24:54,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=544236.0, ans=0.125 2024-06-22 06:25:01,296 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544254.3333333334, ans=0.1 2024-06-22 06:25:04,682 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=544272.6666666666, ans=0.04949747468305833 2024-06-22 06:25:12,120 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=544291.0, ans=0.125 2024-06-22 06:25:18,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=544291.0, ans=0.0 2024-06-22 06:25:19,397 INFO [train.py:1028] (1/2) Epoch 30, batch 3500, loss[loss=0.1733, simple_loss=0.2327, pruned_loss=0.05695, over 12965.00 frames. ], tot_loss[loss=0.1727, simple_loss=0.2304, pruned_loss=0.05745, over 2575942.66 frames. ], batch size: 33, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:25:19,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=544309.3333333334, ans=0.0 2024-06-22 06:25:29,831 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.344e+02 2.476e+02 2.641e+02 3.205e+02, threshold=4.951e+02, percent-clipped=0.0 2024-06-22 06:25:32,013 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=544346.0, ans=0.0 2024-06-22 06:25:32,878 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.45 vs. limit=22.5 2024-06-22 06:25:33,260 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=544346.0, ans=0.2 2024-06-22 06:25:33,888 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=544346.0, ans=0.1 2024-06-22 06:25:38,544 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=544364.3333333334, ans=10.0 2024-06-22 06:25:49,529 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=544382.6666666666, ans=0.125 2024-06-22 06:25:52,099 INFO [train.py:1028] (1/2) Epoch 30, batch 3550, loss[loss=0.167, simple_loss=0.2262, pruned_loss=0.05391, over 13193.00 frames. ], tot_loss[loss=0.1721, simple_loss=0.2299, pruned_loss=0.05715, over 2577202.48 frames. ], batch size: 95, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:26:01,781 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=544419.3333333334, ans=0.0 2024-06-22 06:26:04,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=544419.3333333334, ans=0.5 2024-06-22 06:26:06,988 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=544419.3333333334, ans=0.0 2024-06-22 06:26:09,650 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=544437.6666666666, ans=0.0 2024-06-22 06:26:11,647 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.27 vs. limit=22.5 2024-06-22 06:26:20,822 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=544474.3333333334, ans=0.2 2024-06-22 06:26:27,405 INFO [train.py:1028] (1/2) Epoch 30, batch 3600, loss[loss=0.157, simple_loss=0.2229, pruned_loss=0.04557, over 13287.00 frames. ], tot_loss[loss=0.172, simple_loss=0.2298, pruned_loss=0.05712, over 2580984.73 frames. ], batch size: 49, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:26:28,983 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=544492.6666666666, ans=0.09899494936611666 2024-06-22 06:26:38,256 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.274e+02 2.363e+02 2.491e+02 3.554e+02, threshold=4.727e+02, percent-clipped=0.0 2024-06-22 06:26:45,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=544529.3333333334, ans=0.125 2024-06-22 06:26:50,017 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=544547.6666666666, ans=0.0 2024-06-22 06:26:51,607 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.93 vs. limit=15.0 2024-06-22 06:26:57,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=544566.0, ans=0.125 2024-06-22 06:27:00,468 INFO [train.py:1028] (1/2) Epoch 30, batch 3650, loss[loss=0.1762, simple_loss=0.2323, pruned_loss=0.06004, over 13182.00 frames. ], tot_loss[loss=0.1717, simple_loss=0.2294, pruned_loss=0.05693, over 2579147.02 frames. ], batch size: 103, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:27:03,388 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.25 vs. limit=15.0 2024-06-22 06:27:03,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=544584.3333333334, ans=0.2 2024-06-22 06:27:09,431 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=544602.6666666666, ans=0.2 2024-06-22 06:27:21,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=544621.0, ans=0.1 2024-06-22 06:27:22,796 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.20 vs. limit=22.5 2024-06-22 06:27:36,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=544676.0, ans=0.125 2024-06-22 06:27:37,209 INFO [train.py:1028] (1/2) Epoch 30, batch 3700, loss[loss=0.1466, simple_loss=0.2056, pruned_loss=0.04375, over 13231.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2281, pruned_loss=0.05644, over 2583558.49 frames. ], batch size: 72, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:27:37,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=544676.0, ans=0.125 2024-06-22 06:27:42,625 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544676.0, ans=0.1 2024-06-22 06:27:47,809 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.265e+02 2.404e+02 2.569e+02 3.591e+02, threshold=4.808e+02, percent-clipped=0.0 2024-06-22 06:27:50,516 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=544712.6666666666, ans=0.2 2024-06-22 06:28:10,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=544749.3333333334, ans=0.125 2024-06-22 06:28:13,763 INFO [train.py:1028] (1/2) Epoch 30, batch 3750, loss[loss=0.1691, simple_loss=0.2293, pruned_loss=0.05442, over 12526.00 frames. ], tot_loss[loss=0.1701, simple_loss=0.2278, pruned_loss=0.05623, over 2585358.96 frames. ], batch size: 22, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:28:16,725 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=544767.6666666666, ans=0.0 2024-06-22 06:28:27,559 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2024-06-22 06:28:30,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=544804.3333333334, ans=0.125 2024-06-22 06:28:31,713 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=544804.3333333334, ans=0.125 2024-06-22 06:28:45,118 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.46 vs. limit=10.0 2024-06-22 06:28:46,207 INFO [train.py:1028] (1/2) Epoch 30, batch 3800, loss[loss=0.1551, simple_loss=0.2143, pruned_loss=0.04795, over 13198.00 frames. ], tot_loss[loss=0.1699, simple_loss=0.2276, pruned_loss=0.05603, over 2583831.57 frames. ], batch size: 83, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:28:57,480 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=544877.6666666666, ans=0.2 2024-06-22 06:28:57,884 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.338e+02 2.569e+02 2.719e+02 3.938e+02, threshold=5.137e+02, percent-clipped=0.0 2024-06-22 06:29:08,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=544914.3333333334, ans=0.125 2024-06-22 06:29:24,700 INFO [train.py:1028] (1/2) Epoch 30, batch 3850, loss[loss=0.1672, simple_loss=0.2206, pruned_loss=0.05686, over 13010.00 frames. ], tot_loss[loss=0.1692, simple_loss=0.2272, pruned_loss=0.05554, over 2583063.27 frames. ], batch size: 144, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:29:27,036 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.78 vs. limit=10.0 2024-06-22 06:29:29,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544951.0, ans=0.1 2024-06-22 06:29:54,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=545024.3333333334, ans=0.025 2024-06-22 06:29:56,400 INFO [train.py:1028] (1/2) Epoch 30, batch 3900, loss[loss=0.1715, simple_loss=0.2245, pruned_loss=0.05929, over 13200.00 frames. ], tot_loss[loss=0.1691, simple_loss=0.2271, pruned_loss=0.05561, over 2585580.30 frames. ], batch size: 83, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:29:58,994 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=545042.6666666666, ans=0.04949747468305833 2024-06-22 06:29:59,994 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.14 vs. limit=22.5 2024-06-22 06:30:07,290 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.279e+02 2.443e+02 2.743e+02 3.515e+02, threshold=4.886e+02, percent-clipped=0.0 2024-06-22 06:30:19,181 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=545079.3333333334, ans=0.125 2024-06-22 06:30:23,697 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.98 vs. limit=22.5 2024-06-22 06:30:33,949 INFO [train.py:1028] (1/2) Epoch 30, batch 3950, loss[loss=0.1663, simple_loss=0.2171, pruned_loss=0.05771, over 13110.00 frames. ], tot_loss[loss=0.1683, simple_loss=0.2262, pruned_loss=0.05521, over 2587632.63 frames. ], batch size: 132, lr: 1.92e-03, grad_scale: 16.0 2024-06-22 06:30:34,818 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=545134.3333333334, ans=0.1 2024-06-22 06:30:57,973 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=545189.3333333334, ans=0.125 2024-06-22 06:31:08,376 INFO [train.py:1028] (1/2) Epoch 30, batch 4000, loss[loss=0.1868, simple_loss=0.2434, pruned_loss=0.06514, over 12950.00 frames. ], tot_loss[loss=0.1682, simple_loss=0.2259, pruned_loss=0.05529, over 2582293.27 frames. ], batch size: 39, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:31:14,311 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=545226.0, ans=0.2 2024-06-22 06:31:18,905 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=545244.3333333334, ans=0.1 2024-06-22 06:31:20,623 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.255e+02 2.431e+02 2.673e+02 3.583e+02, threshold=4.862e+02, percent-clipped=0.0 2024-06-22 06:31:24,925 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=545262.6666666666, ans=0.125 2024-06-22 06:31:36,072 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:31:38,782 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=545299.3333333334, ans=0.2 2024-06-22 06:31:42,180 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2024-06-22 06:31:45,006 INFO [train.py:1028] (1/2) Epoch 30, batch 4050, loss[loss=0.1857, simple_loss=0.2322, pruned_loss=0.06959, over 10971.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.2259, pruned_loss=0.05564, over 2580330.99 frames. ], batch size: 304, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:31:47,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=545317.6666666666, ans=15.0 2024-06-22 06:31:50,765 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.37 vs. limit=22.5 2024-06-22 06:32:12,613 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:32:13,585 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.06 vs. limit=15.0 2024-06-22 06:32:21,333 INFO [train.py:1028] (1/2) Epoch 30, batch 4100, loss[loss=0.1761, simple_loss=0.227, pruned_loss=0.06258, over 13168.00 frames. ], tot_loss[loss=0.169, simple_loss=0.2262, pruned_loss=0.05596, over 2577293.72 frames. ], batch size: 103, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:32:28,232 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=545427.6666666666, ans=0.125 2024-06-22 06:32:30,336 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=545427.6666666666, ans=0.07 2024-06-22 06:32:33,675 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.345e+02 2.523e+02 2.737e+02 3.715e+02, threshold=5.046e+02, percent-clipped=0.0 2024-06-22 06:32:39,138 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=545446.0, ans=0.125 2024-06-22 06:32:40,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=545446.0, ans=0.0 2024-06-22 06:32:48,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=545482.6666666666, ans=0.0 2024-06-22 06:32:54,928 INFO [train.py:1028] (1/2) Epoch 30, batch 4150, loss[loss=0.1623, simple_loss=0.2214, pruned_loss=0.05157, over 13103.00 frames. ], tot_loss[loss=0.1685, simple_loss=0.2256, pruned_loss=0.05567, over 2576361.22 frames. ], batch size: 55, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:33:04,932 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:33:11,208 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=12.0 2024-06-22 06:33:30,622 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=545592.6666666666, ans=0.0 2024-06-22 06:33:31,184 INFO [train.py:1028] (1/2) Epoch 30, batch 4200, loss[loss=0.1775, simple_loss=0.2253, pruned_loss=0.06488, over 13075.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2251, pruned_loss=0.05555, over 2577907.44 frames. ], batch size: 103, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:33:31,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=545592.6666666666, ans=0.125 2024-06-22 06:33:32,950 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=545592.6666666666, ans=15.0 2024-06-22 06:33:35,540 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=545592.6666666666, ans=0.0 2024-06-22 06:33:37,914 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=545611.0, ans=0.125 2024-06-22 06:33:43,045 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.285e+02 2.390e+02 2.552e+02 3.545e+02, threshold=4.781e+02, percent-clipped=0.0 2024-06-22 06:33:48,218 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=545629.3333333334, ans=0.125 2024-06-22 06:33:50,646 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.76 vs. limit=15.0 2024-06-22 06:34:04,134 INFO [train.py:1028] (1/2) Epoch 30, batch 4250, loss[loss=0.1427, simple_loss=0.2043, pruned_loss=0.04051, over 13293.00 frames. ], tot_loss[loss=0.1678, simple_loss=0.2249, pruned_loss=0.05535, over 2580500.50 frames. ], batch size: 46, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:34:22,256 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2024-06-22 06:34:25,395 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=545721.0, ans=0.125 2024-06-22 06:34:28,917 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.65 vs. limit=6.0 2024-06-22 06:34:30,256 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.59 vs. limit=22.5 2024-06-22 06:34:33,324 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=545757.6666666666, ans=0.125 2024-06-22 06:34:36,986 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=545757.6666666666, ans=0.0 2024-06-22 06:34:40,425 INFO [train.py:1028] (1/2) Epoch 30, batch 4300, loss[loss=0.1698, simple_loss=0.2316, pruned_loss=0.05397, over 13205.00 frames. ], tot_loss[loss=0.1673, simple_loss=0.2245, pruned_loss=0.05507, over 2581285.72 frames. ], batch size: 59, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:34:51,648 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=545794.3333333334, ans=0.125 2024-06-22 06:34:51,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=545794.3333333334, ans=0.125 2024-06-22 06:34:52,113 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.278e+02 2.453e+02 2.629e+02 3.559e+02, threshold=4.906e+02, percent-clipped=0.0 2024-06-22 06:34:55,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=545812.6666666666, ans=0.1 2024-06-22 06:35:01,267 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.79 vs. limit=15.0 2024-06-22 06:35:03,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=545831.0, ans=0.035 2024-06-22 06:35:13,433 INFO [train.py:1028] (1/2) Epoch 30, batch 4350, loss[loss=0.1514, simple_loss=0.2131, pruned_loss=0.04482, over 13179.00 frames. ], tot_loss[loss=0.1668, simple_loss=0.2238, pruned_loss=0.05487, over 2585768.59 frames. ], batch size: 59, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:35:31,213 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=545904.3333333334, ans=0.125 2024-06-22 06:35:35,320 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=545904.3333333334, ans=0.2 2024-06-22 06:35:37,286 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=545922.6666666666, ans=0.125 2024-06-22 06:35:39,724 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=545922.6666666666, ans=0.0 2024-06-22 06:35:49,569 INFO [train.py:1028] (1/2) Epoch 30, batch 4400, loss[loss=0.168, simple_loss=0.2216, pruned_loss=0.05724, over 13304.00 frames. ], tot_loss[loss=0.167, simple_loss=0.2238, pruned_loss=0.05514, over 2586546.75 frames. ], batch size: 83, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:36:01,578 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.305e+02 2.446e+02 2.686e+02 4.055e+02, threshold=4.893e+02, percent-clipped=0.0 2024-06-22 06:36:07,167 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=545996.0, ans=0.05 2024-06-22 06:36:08,655 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.09 vs. limit=22.5 2024-06-22 06:36:10,733 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.70 vs. limit=15.0 2024-06-22 06:36:13,177 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.80 vs. limit=10.0 2024-06-22 06:36:26,200 INFO [train.py:1028] (1/2) Epoch 30, batch 4450, loss[loss=0.1621, simple_loss=0.2254, pruned_loss=0.04937, over 12942.00 frames. ], tot_loss[loss=0.1673, simple_loss=0.224, pruned_loss=0.05529, over 2581288.31 frames. ], batch size: 33, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:36:45,453 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:36:47,744 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.53 vs. limit=10.0 2024-06-22 06:36:51,652 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.07 vs. limit=22.5 2024-06-22 06:36:57,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=546124.3333333334, ans=0.0 2024-06-22 06:36:57,783 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=546142.6666666666, ans=0.125 2024-06-22 06:36:58,193 INFO [train.py:1028] (1/2) Epoch 30, batch 4500, loss[loss=0.163, simple_loss=0.22, pruned_loss=0.05298, over 13235.00 frames. ], tot_loss[loss=0.167, simple_loss=0.2236, pruned_loss=0.05515, over 2585541.51 frames. ], batch size: 89, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:37:07,967 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=546161.0, ans=0.125 2024-06-22 06:37:09,867 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.301e+02 2.474e+02 2.689e+02 3.444e+02, threshold=4.949e+02, percent-clipped=0.0 2024-06-22 06:37:25,562 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=546216.0, ans=0.125 2024-06-22 06:37:33,872 INFO [train.py:1028] (1/2) Epoch 30, batch 4550, loss[loss=0.1307, simple_loss=0.1933, pruned_loss=0.0341, over 13285.00 frames. ], tot_loss[loss=0.1669, simple_loss=0.2235, pruned_loss=0.05515, over 2588456.96 frames. ], batch size: 52, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:37:46,240 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.85 vs. limit=15.0 2024-06-22 06:37:55,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=546289.3333333334, ans=0.125 2024-06-22 06:38:03,863 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=546307.6666666666, ans=0.95 2024-06-22 06:38:07,131 INFO [train.py:1028] (1/2) Epoch 30, batch 4600, loss[loss=0.1975, simple_loss=0.239, pruned_loss=0.07805, over 12529.00 frames. ], tot_loss[loss=0.1669, simple_loss=0.2238, pruned_loss=0.05495, over 2584161.99 frames. ], batch size: 202, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:38:23,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=546344.3333333334, ans=0.125 2024-06-22 06:38:24,705 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.296e+02 2.398e+02 2.618e+02 3.492e+02, threshold=4.795e+02, percent-clipped=0.0 2024-06-22 06:38:32,313 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=546381.0, ans=0.125 2024-06-22 06:38:33,826 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2024-06-22 06:38:45,458 INFO [train.py:1028] (1/2) Epoch 30, batch 4650, loss[loss=0.1779, simple_loss=0.234, pruned_loss=0.06089, over 13124.00 frames. ], tot_loss[loss=0.1663, simple_loss=0.2232, pruned_loss=0.05469, over 2588068.32 frames. ], batch size: 132, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:38:54,608 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=546436.0, ans=0.0 2024-06-22 06:38:58,167 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.35 vs. limit=22.5 2024-06-22 06:39:02,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=546454.3333333334, ans=0.0 2024-06-22 06:39:05,225 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.39 vs. limit=12.0 2024-06-22 06:39:10,787 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=546491.0, ans=0.125 2024-06-22 06:39:14,128 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=546491.0, ans=0.0 2024-06-22 06:39:18,058 INFO [train.py:1028] (1/2) Epoch 30, batch 4700, loss[loss=0.1892, simple_loss=0.2541, pruned_loss=0.0622, over 12822.00 frames. ], tot_loss[loss=0.1662, simple_loss=0.2232, pruned_loss=0.05459, over 2584043.27 frames. ], batch size: 26, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:39:30,952 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=546527.6666666666, ans=0.125 2024-06-22 06:39:33,256 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.253e+02 2.363e+02 2.510e+02 3.512e+02, threshold=4.725e+02, percent-clipped=0.0 2024-06-22 06:39:35,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=546546.0, ans=0.5 2024-06-22 06:39:45,852 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=546564.3333333334, ans=0.125 2024-06-22 06:39:51,834 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2024-06-22 06:39:54,223 INFO [train.py:1028] (1/2) Epoch 30, batch 4750, loss[loss=0.1847, simple_loss=0.2337, pruned_loss=0.0678, over 12571.00 frames. ], tot_loss[loss=0.1662, simple_loss=0.223, pruned_loss=0.05474, over 2581388.43 frames. ], batch size: 202, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:40:02,968 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=546619.3333333334, ans=0.0 2024-06-22 06:40:05,741 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=546619.3333333334, ans=15.0 2024-06-22 06:40:24,410 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=546656.0, ans=0.125 2024-06-22 06:40:25,664 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=546674.3333333334, ans=0.125 2024-06-22 06:40:28,602 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.15 vs. limit=15.0 2024-06-22 06:40:32,367 INFO [train.py:1028] (1/2) Epoch 30, batch 4800, loss[loss=0.1439, simple_loss=0.2041, pruned_loss=0.04183, over 13261.00 frames. ], tot_loss[loss=0.1665, simple_loss=0.2231, pruned_loss=0.05497, over 2577649.24 frames. ], batch size: 63, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:40:36,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=546692.6666666666, ans=0.0 2024-06-22 06:40:44,295 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.304e+02 2.431e+02 2.591e+02 3.257e+02, threshold=4.862e+02, percent-clipped=0.0 2024-06-22 06:40:45,280 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.52 vs. limit=15.0 2024-06-22 06:40:45,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=546729.3333333334, ans=0.125 2024-06-22 06:40:47,835 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=546729.3333333334, ans=0.05 2024-06-22 06:40:52,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=546747.6666666666, ans=0.125 2024-06-22 06:40:55,654 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2024-06-22 06:41:04,983 INFO [train.py:1028] (1/2) Epoch 30, batch 4850, loss[loss=0.1685, simple_loss=0.2186, pruned_loss=0.05915, over 13216.00 frames. ], tot_loss[loss=0.166, simple_loss=0.2227, pruned_loss=0.05468, over 2576494.97 frames. ], batch size: 89, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:41:05,203 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=546784.3333333334, ans=0.0 2024-06-22 06:41:33,163 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2024-06-22 06:41:33,924 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2024-06-22 06:41:40,126 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=22.5 2024-06-22 06:41:42,416 INFO [train.py:1028] (1/2) Epoch 30, batch 4900, loss[loss=0.1634, simple_loss=0.2307, pruned_loss=0.0481, over 13171.00 frames. ], tot_loss[loss=0.1664, simple_loss=0.223, pruned_loss=0.0549, over 2576924.36 frames. ], batch size: 59, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:41:54,244 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.252e+02 2.436e+02 2.671e+02 3.763e+02, threshold=4.871e+02, percent-clipped=0.0 2024-06-22 06:41:58,035 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=546912.6666666666, ans=0.2 2024-06-22 06:42:03,288 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=546931.0, ans=0.125 2024-06-22 06:42:05,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=546931.0, ans=0.125 2024-06-22 06:42:08,982 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=546949.3333333334, ans=0.025 2024-06-22 06:42:11,124 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.52 vs. limit=22.5 2024-06-22 06:42:15,181 INFO [train.py:1028] (1/2) Epoch 30, batch 4950, loss[loss=0.1762, simple_loss=0.2205, pruned_loss=0.06592, over 10976.00 frames. ], tot_loss[loss=0.1669, simple_loss=0.2233, pruned_loss=0.0552, over 2570617.92 frames. ], batch size: 304, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:42:22,934 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=546967.6666666666, ans=0.125 2024-06-22 06:42:30,102 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.52 vs. limit=22.5 2024-06-22 06:42:35,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=547004.3333333334, ans=0.0 2024-06-22 06:42:39,474 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=547022.6666666666, ans=0.1 2024-06-22 06:42:42,166 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=547022.6666666666, ans=0.125 2024-06-22 06:42:50,437 INFO [train.py:1028] (1/2) Epoch 30, batch 5000, loss[loss=0.1661, simple_loss=0.2236, pruned_loss=0.05435, over 13160.00 frames. ], tot_loss[loss=0.1663, simple_loss=0.2228, pruned_loss=0.05492, over 2575861.37 frames. ], batch size: 95, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:42:59,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=547077.6666666666, ans=0.2 2024-06-22 06:43:03,090 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.363e+02 2.524e+02 2.722e+02 3.510e+02, threshold=5.049e+02, percent-clipped=0.0 2024-06-22 06:43:07,390 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=547096.0, ans=0.2 2024-06-22 06:43:18,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=547132.6666666666, ans=0.0 2024-06-22 06:43:18,688 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:43:27,435 INFO [train.py:1028] (1/2) Epoch 30, batch 5050, loss[loss=0.1667, simple_loss=0.2229, pruned_loss=0.05521, over 12897.00 frames. ], tot_loss[loss=0.1659, simple_loss=0.2225, pruned_loss=0.05462, over 2572387.02 frames. ], batch size: 36, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:43:29,736 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=547151.0, ans=0.125 2024-06-22 06:43:30,533 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.73 vs. limit=22.5 2024-06-22 06:43:33,507 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=547169.3333333334, ans=0.125 2024-06-22 06:43:33,878 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.18 vs. limit=12.0 2024-06-22 06:43:34,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=547169.3333333334, ans=0.0 2024-06-22 06:43:34,867 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=547169.3333333334, ans=0.125 2024-06-22 06:43:35,348 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=547169.3333333334, ans=0.125 2024-06-22 06:43:38,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=547169.3333333334, ans=0.2 2024-06-22 06:43:45,846 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=547187.6666666666, ans=0.125 2024-06-22 06:43:46,422 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=547206.0, ans=0.0 2024-06-22 06:43:47,935 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=547206.0, ans=0.125 2024-06-22 06:43:48,266 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.96 vs. limit=15.0 2024-06-22 06:44:00,743 INFO [train.py:1028] (1/2) Epoch 30, batch 5100, loss[loss=0.1702, simple_loss=0.2355, pruned_loss=0.05246, over 12947.00 frames. ], tot_loss[loss=0.1664, simple_loss=0.2229, pruned_loss=0.05495, over 2569084.47 frames. ], batch size: 39, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:44:00,845 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=547242.6666666666, ans=0.02 2024-06-22 06:44:01,596 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=547242.6666666666, ans=0.125 2024-06-22 06:44:05,346 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2024-06-22 06:44:12,641 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.365e+02 2.515e+02 2.693e+02 3.527e+02, threshold=5.030e+02, percent-clipped=0.0 2024-06-22 06:44:26,580 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=547297.6666666666, ans=0.125 2024-06-22 06:44:32,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=547316.0, ans=0.0 2024-06-22 06:44:33,762 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2024-06-22 06:44:36,565 INFO [train.py:1028] (1/2) Epoch 30, batch 5150, loss[loss=0.1833, simple_loss=0.2283, pruned_loss=0.06917, over 13123.00 frames. ], tot_loss[loss=0.1664, simple_loss=0.2227, pruned_loss=0.05506, over 2571726.28 frames. ], batch size: 132, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:44:39,123 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.30 vs. limit=12.0 2024-06-22 06:44:59,323 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=547389.3333333334, ans=0.125 2024-06-22 06:45:04,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=547407.6666666666, ans=0.125 2024-06-22 06:45:08,937 INFO [train.py:1028] (1/2) Epoch 30, batch 5200, loss[loss=0.1718, simple_loss=0.2314, pruned_loss=0.05614, over 13143.00 frames. ], tot_loss[loss=0.1665, simple_loss=0.2229, pruned_loss=0.0551, over 2574496.73 frames. ], batch size: 95, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:45:13,245 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=547426.0, ans=0.95 2024-06-22 06:45:24,041 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.292e+02 2.447e+02 2.570e+02 3.700e+02, threshold=4.894e+02, percent-clipped=0.0 2024-06-22 06:45:28,025 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.05 vs. limit=15.0 2024-06-22 06:45:31,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=547462.6666666666, ans=0.125 2024-06-22 06:45:35,141 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=547481.0, ans=0.125 2024-06-22 06:45:39,347 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=547499.3333333334, ans=0.05 2024-06-22 06:45:40,439 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=547499.3333333334, ans=0.2 2024-06-22 06:45:45,757 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2024-06-22 06:45:45,951 INFO [train.py:1028] (1/2) Epoch 30, batch 5250, loss[loss=0.1696, simple_loss=0.2325, pruned_loss=0.05339, over 13249.00 frames. ], tot_loss[loss=0.1668, simple_loss=0.2231, pruned_loss=0.05524, over 2570384.80 frames. ], batch size: 52, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:45:49,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=547517.6666666666, ans=0.2 2024-06-22 06:45:59,107 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:45:59,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=547554.3333333334, ans=0.125 2024-06-22 06:46:06,959 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=547572.6666666666, ans=0.125 2024-06-22 06:46:12,927 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=547591.0, ans=0.0 2024-06-22 06:46:23,130 INFO [train.py:1028] (1/2) Epoch 30, batch 5300, loss[loss=0.1674, simple_loss=0.2238, pruned_loss=0.0555, over 13050.00 frames. ], tot_loss[loss=0.1664, simple_loss=0.2227, pruned_loss=0.0551, over 2566658.25 frames. ], batch size: 144, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:46:24,803 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.49 vs. limit=15.0 2024-06-22 06:46:28,515 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=547609.3333333334, ans=0.0 2024-06-22 06:46:33,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=547627.6666666666, ans=0.2 2024-06-22 06:46:34,924 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.257e+02 2.353e+02 2.499e+02 2.882e+02, threshold=4.707e+02, percent-clipped=0.0 2024-06-22 06:46:37,152 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=547646.0, ans=0.0 2024-06-22 06:46:56,704 INFO [train.py:1028] (1/2) Epoch 30, batch 5350, loss[loss=0.1645, simple_loss=0.2243, pruned_loss=0.05233, over 11655.00 frames. ], tot_loss[loss=0.1661, simple_loss=0.2222, pruned_loss=0.05498, over 2573795.57 frames. ], batch size: 17, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:47:03,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=547719.3333333334, ans=0.125 2024-06-22 06:47:20,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=547756.0, ans=0.0 2024-06-22 06:47:24,074 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.74 vs. limit=15.0 2024-06-22 06:47:32,907 INFO [train.py:1028] (1/2) Epoch 30, batch 5400, loss[loss=0.187, simple_loss=0.234, pruned_loss=0.06995, over 12192.00 frames. ], tot_loss[loss=0.1664, simple_loss=0.2221, pruned_loss=0.0553, over 2566147.08 frames. ], batch size: 241, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:47:37,919 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=547792.6666666666, ans=0.125 2024-06-22 06:47:39,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=547811.0, ans=0.0 2024-06-22 06:47:42,015 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=547811.0, ans=0.0 2024-06-22 06:47:45,056 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.328e+02 2.500e+02 2.720e+02 3.221e+02, threshold=5.001e+02, percent-clipped=0.0 2024-06-22 06:47:52,499 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=547847.6666666666, ans=0.125 2024-06-22 06:47:55,172 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=547847.6666666666, ans=0.025 2024-06-22 06:47:56,010 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2024-06-22 06:48:06,482 INFO [train.py:1028] (1/2) Epoch 30, batch 5450, loss[loss=0.165, simple_loss=0.2165, pruned_loss=0.05675, over 12443.00 frames. ], tot_loss[loss=0.1666, simple_loss=0.2224, pruned_loss=0.05537, over 2570430.44 frames. ], batch size: 25, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:48:22,062 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.36 vs. limit=15.0 2024-06-22 06:48:28,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=547921.0, ans=0.125 2024-06-22 06:48:31,565 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.38 vs. limit=15.0 2024-06-22 06:48:37,281 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=547957.6666666666, ans=0.125 2024-06-22 06:48:39,953 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=547957.6666666666, ans=0.1 2024-06-22 06:48:43,315 INFO [train.py:1028] (1/2) Epoch 30, batch 5500, loss[loss=0.1908, simple_loss=0.2418, pruned_loss=0.06992, over 12146.00 frames. ], tot_loss[loss=0.1665, simple_loss=0.2223, pruned_loss=0.05532, over 2564342.97 frames. ], batch size: 240, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:48:48,603 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=547976.0, ans=0.125 2024-06-22 06:48:54,814 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.242e+02 2.394e+02 2.576e+02 3.164e+02, threshold=4.787e+02, percent-clipped=0.0 2024-06-22 06:49:03,994 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.39 vs. limit=6.0 2024-06-22 06:49:06,268 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=548031.0, ans=0.0 2024-06-22 06:49:20,240 INFO [train.py:1028] (1/2) Epoch 30, batch 5550, loss[loss=0.1734, simple_loss=0.225, pruned_loss=0.06093, over 13293.00 frames. ], tot_loss[loss=0.1655, simple_loss=0.2215, pruned_loss=0.05477, over 2567228.33 frames. ], batch size: 43, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:49:22,928 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=548067.6666666666, ans=0.125 2024-06-22 06:49:23,150 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.33 vs. limit=10.0 2024-06-22 06:49:24,911 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=548067.6666666666, ans=0.015 2024-06-22 06:49:32,915 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=548104.3333333334, ans=0.125 2024-06-22 06:49:33,589 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=548104.3333333334, ans=0.0 2024-06-22 06:49:35,290 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:49:35,904 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=548104.3333333334, ans=0.0 2024-06-22 06:49:46,466 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.69 vs. limit=15.0 2024-06-22 06:49:50,178 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=548141.0, ans=0.125 2024-06-22 06:49:52,508 INFO [train.py:1028] (1/2) Epoch 30, batch 5600, loss[loss=0.1666, simple_loss=0.223, pruned_loss=0.05509, over 13229.00 frames. ], tot_loss[loss=0.1654, simple_loss=0.2214, pruned_loss=0.05476, over 2569920.96 frames. ], batch size: 89, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:49:53,275 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=548159.3333333334, ans=0.2 2024-06-22 06:49:58,385 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=548159.3333333334, ans=0.125 2024-06-22 06:50:05,081 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.245e+02 2.382e+02 2.569e+02 3.098e+02, threshold=4.763e+02, percent-clipped=0.0 2024-06-22 06:50:16,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=548196.0, ans=0.0 2024-06-22 06:50:18,903 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.77 vs. limit=22.5 2024-06-22 06:50:27,596 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2024-06-22 06:50:31,013 INFO [train.py:1028] (1/2) Epoch 30, batch 5650, loss[loss=0.1948, simple_loss=0.2425, pruned_loss=0.07356, over 12537.00 frames. ], tot_loss[loss=0.1649, simple_loss=0.2212, pruned_loss=0.05435, over 2575400.72 frames. ], batch size: 202, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:50:35,158 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=548251.0, ans=0.09899494936611666 2024-06-22 06:50:37,941 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=548269.3333333334, ans=0.04949747468305833 2024-06-22 06:50:40,002 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=548269.3333333334, ans=0.2 2024-06-22 06:51:01,174 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=548324.3333333334, ans=0.125 2024-06-22 06:51:04,495 INFO [train.py:1028] (1/2) Epoch 30, batch 5700, loss[loss=0.1532, simple_loss=0.2174, pruned_loss=0.04444, over 13241.00 frames. ], tot_loss[loss=0.165, simple_loss=0.2209, pruned_loss=0.05455, over 2579713.06 frames. ], batch size: 63, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:51:07,887 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=548342.6666666666, ans=0.025 2024-06-22 06:51:11,379 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=548361.0, ans=0.125 2024-06-22 06:51:16,333 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.294e+02 2.443e+02 2.624e+02 3.434e+02, threshold=4.885e+02, percent-clipped=0.0 2024-06-22 06:51:22,570 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=548379.3333333334, ans=0.125 2024-06-22 06:51:24,183 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.13 vs. limit=22.5 2024-06-22 06:51:34,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=548416.0, ans=0.0 2024-06-22 06:51:40,550 INFO [train.py:1028] (1/2) Epoch 30, batch 5750, loss[loss=0.1775, simple_loss=0.2272, pruned_loss=0.06393, over 12705.00 frames. ], tot_loss[loss=0.1658, simple_loss=0.2218, pruned_loss=0.05491, over 2580195.52 frames. ], batch size: 176, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:51:55,630 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:51:59,724 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.28 vs. limit=10.0 2024-06-22 06:52:16,348 INFO [train.py:1028] (1/2) Epoch 30, batch 5800, loss[loss=0.1874, simple_loss=0.2403, pruned_loss=0.06725, over 12734.00 frames. ], tot_loss[loss=0.1666, simple_loss=0.2226, pruned_loss=0.05528, over 2580001.30 frames. ], batch size: 176, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:52:25,032 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=548544.3333333334, ans=0.125 2024-06-22 06:52:27,982 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.265e+02 2.415e+02 2.528e+02 3.231e+02, threshold=4.829e+02, percent-clipped=0.0 2024-06-22 06:52:30,875 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=548562.6666666666, ans=0.0 2024-06-22 06:52:37,526 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.95 vs. limit=15.0 2024-06-22 06:52:42,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=548599.3333333334, ans=0.025 2024-06-22 06:52:44,383 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:52:48,821 INFO [train.py:1028] (1/2) Epoch 30, batch 5850, loss[loss=0.1829, simple_loss=0.2401, pruned_loss=0.06284, over 12532.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2243, pruned_loss=0.05593, over 2578303.98 frames. ], batch size: 202, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:53:17,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=548672.6666666666, ans=0.1 2024-06-22 06:53:18,277 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.69 vs. limit=10.0 2024-06-22 06:53:24,830 INFO [train.py:1028] (1/2) Epoch 30, batch 5900, loss[loss=0.1618, simple_loss=0.2153, pruned_loss=0.05417, over 13116.00 frames. ], tot_loss[loss=0.17, simple_loss=0.2266, pruned_loss=0.05668, over 2578561.40 frames. ], batch size: 121, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:53:26,554 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.48 vs. limit=10.0 2024-06-22 06:53:28,947 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=548709.3333333334, ans=0.125 2024-06-22 06:53:29,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=548709.3333333334, ans=0.2 2024-06-22 06:53:30,409 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=548709.3333333334, ans=0.0 2024-06-22 06:53:36,828 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.319e+02 2.458e+02 2.746e+02 3.634e+02, threshold=4.916e+02, percent-clipped=0.0 2024-06-22 06:53:37,227 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.25 vs. limit=15.0 2024-06-22 06:53:49,248 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.75 vs. limit=10.0 2024-06-22 06:53:51,595 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=548782.6666666666, ans=0.125 2024-06-22 06:53:58,332 INFO [train.py:1028] (1/2) Epoch 30, batch 5950, loss[loss=0.1696, simple_loss=0.2287, pruned_loss=0.05529, over 13043.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.228, pruned_loss=0.05727, over 2581377.54 frames. ], batch size: 121, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:54:01,195 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=548801.0, ans=0.125 2024-06-22 06:54:27,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=548874.3333333334, ans=0.125 2024-06-22 06:54:34,786 INFO [train.py:1028] (1/2) Epoch 30, batch 6000, loss[loss=0.2052, simple_loss=0.2506, pruned_loss=0.07986, over 12164.00 frames. ], tot_loss[loss=0.1717, simple_loss=0.2286, pruned_loss=0.05742, over 2574448.57 frames. ], batch size: 240, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:54:34,787 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 06:54:40,668 INFO [zipformer.py:1858] (1/2) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([6.3728, 5.0770, 5.9392, 5.7135], device='cuda:1') 2024-06-22 06:54:42,442 INFO [train.py:1060] (1/2) Epoch 30, validation: loss=0.1955, simple_loss=0.253, pruned_loss=0.06898, over 351949.00 frames. 2024-06-22 06:54:42,485 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 06:54:54,434 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.382e+02 2.554e+02 2.764e+02 3.575e+02, threshold=5.107e+02, percent-clipped=0.0 2024-06-22 06:55:06,359 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=548947.6666666666, ans=0.125 2024-06-22 06:55:16,037 INFO [train.py:1028] (1/2) Epoch 30, batch 6050, loss[loss=0.1719, simple_loss=0.2368, pruned_loss=0.05348, over 12830.00 frames. ], tot_loss[loss=0.1726, simple_loss=0.23, pruned_loss=0.05757, over 2577170.31 frames. ], batch size: 39, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:55:16,104 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=548984.3333333334, ans=0.125 2024-06-22 06:55:29,614 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=549002.6666666666, ans=0.125 2024-06-22 06:55:31,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.50 vs. limit=12.0 2024-06-22 06:55:40,500 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=549039.3333333334, ans=0.0 2024-06-22 06:55:51,877 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2024-06-22 06:55:52,229 INFO [train.py:1028] (1/2) Epoch 30, batch 6100, loss[loss=0.1594, simple_loss=0.2111, pruned_loss=0.05379, over 13187.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.2319, pruned_loss=0.05819, over 2579374.20 frames. ], batch size: 121, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:55:56,511 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=549076.0, ans=0.125 2024-06-22 06:56:04,331 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.373e+02 2.551e+02 2.819e+02 3.758e+02, threshold=5.101e+02, percent-clipped=0.0 2024-06-22 06:56:08,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=549112.6666666666, ans=0.1 2024-06-22 06:56:14,998 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=549131.0, ans=0.125 2024-06-22 06:56:15,600 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=549131.0, ans=0.1 2024-06-22 06:56:24,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=549149.3333333334, ans=0.0 2024-06-22 06:56:28,783 INFO [train.py:1028] (1/2) Epoch 30, batch 6150, loss[loss=0.1673, simple_loss=0.2236, pruned_loss=0.05552, over 10718.00 frames. ], tot_loss[loss=0.1752, simple_loss=0.2332, pruned_loss=0.05856, over 2577382.37 frames. ], batch size: 303, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:56:44,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=549204.3333333334, ans=0.0 2024-06-22 06:56:44,728 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.03 vs. limit=15.0 2024-06-22 06:56:49,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=549222.6666666666, ans=0.1 2024-06-22 06:56:52,994 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.33 vs. limit=22.5 2024-06-22 06:56:54,184 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=549222.6666666666, ans=0.125 2024-06-22 06:56:56,717 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=549241.0, ans=0.2 2024-06-22 06:56:56,777 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=549241.0, ans=0.125 2024-06-22 06:57:02,234 INFO [train.py:1028] (1/2) Epoch 30, batch 6200, loss[loss=0.1874, simple_loss=0.2514, pruned_loss=0.06168, over 13175.00 frames. ], tot_loss[loss=0.1764, simple_loss=0.2349, pruned_loss=0.05894, over 2575015.62 frames. ], batch size: 89, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:57:05,028 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=549259.3333333334, ans=0.1 2024-06-22 06:57:09,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=549277.6666666666, ans=0.025 2024-06-22 06:57:17,564 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=549277.6666666666, ans=0.0 2024-06-22 06:57:18,099 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.467e+02 2.622e+02 2.815e+02 3.651e+02, threshold=5.245e+02, percent-clipped=0.0 2024-06-22 06:57:23,874 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=549296.0, ans=0.125 2024-06-22 06:57:39,104 INFO [train.py:1028] (1/2) Epoch 30, batch 6250, loss[loss=0.1793, simple_loss=0.2364, pruned_loss=0.06109, over 13250.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.2362, pruned_loss=0.05917, over 2568648.35 frames. ], batch size: 83, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:57:48,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=549369.3333333334, ans=0.2 2024-06-22 06:57:52,814 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=549387.6666666666, ans=0.2 2024-06-22 06:57:52,854 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=549387.6666666666, ans=0.125 2024-06-22 06:58:05,448 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=549424.3333333334, ans=0.2 2024-06-22 06:58:08,572 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=549424.3333333334, ans=0.0 2024-06-22 06:58:12,476 INFO [train.py:1028] (1/2) Epoch 30, batch 6300, loss[loss=0.1724, simple_loss=0.2338, pruned_loss=0.05549, over 10924.00 frames. ], tot_loss[loss=0.178, simple_loss=0.237, pruned_loss=0.05953, over 2563788.14 frames. ], batch size: 16, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:58:28,999 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.420e+02 2.576e+02 2.847e+02 4.433e+02, threshold=5.151e+02, percent-clipped=0.0 2024-06-22 06:58:29,197 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=549479.3333333334, ans=0.025 2024-06-22 06:58:42,162 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=549516.0, ans=0.125 2024-06-22 06:58:44,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=549516.0, ans=0.0 2024-06-22 06:58:49,413 INFO [train.py:1028] (1/2) Epoch 30, batch 6350, loss[loss=0.2269, simple_loss=0.2804, pruned_loss=0.08669, over 12567.00 frames. ], tot_loss[loss=0.1797, simple_loss=0.2392, pruned_loss=0.06005, over 2573761.10 frames. ], batch size: 202, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:58:56,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=549552.6666666666, ans=0.125 2024-06-22 06:59:04,780 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=549571.0, ans=0.0 2024-06-22 06:59:10,700 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=549589.3333333334, ans=0.0 2024-06-22 06:59:25,824 INFO [train.py:1028] (1/2) Epoch 30, batch 6400, loss[loss=0.1641, simple_loss=0.2319, pruned_loss=0.04816, over 13190.00 frames. ], tot_loss[loss=0.181, simple_loss=0.2408, pruned_loss=0.06061, over 2574112.40 frames. ], batch size: 67, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:59:38,076 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.430e+02 2.566e+02 2.845e+02 4.157e+02, threshold=5.132e+02, percent-clipped=0.0 2024-06-22 06:59:45,410 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2024-06-22 06:59:47,776 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=549681.0, ans=0.125 2024-06-22 06:59:52,295 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=549699.3333333334, ans=0.125 2024-06-22 06:59:57,761 INFO [train.py:1028] (1/2) Epoch 30, batch 6450, loss[loss=0.2006, simple_loss=0.2644, pruned_loss=0.06842, over 12616.00 frames. ], tot_loss[loss=0.182, simple_loss=0.2419, pruned_loss=0.06102, over 2580589.63 frames. ], batch size: 202, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 07:00:07,127 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=549736.0, ans=0.1 2024-06-22 07:00:10,895 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=549754.3333333334, ans=0.125 2024-06-22 07:00:14,290 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=549754.3333333334, ans=0.035 2024-06-22 07:00:15,085 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=549754.3333333334, ans=0.125 2024-06-22 07:00:32,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=549791.0, ans=0.0 2024-06-22 07:00:34,354 INFO [train.py:1028] (1/2) Epoch 30, batch 6500, loss[loss=0.1843, simple_loss=0.2379, pruned_loss=0.06537, over 10703.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.2438, pruned_loss=0.06155, over 2583885.05 frames. ], batch size: 304, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 07:00:40,861 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=549827.6666666666, ans=0.025 2024-06-22 07:00:47,372 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 2.503e+02 2.680e+02 3.120e+02 3.917e+02, threshold=5.361e+02, percent-clipped=0.0 2024-06-22 07:00:51,373 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.03 vs. limit=22.5 2024-06-22 07:00:53,137 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=549846.0, ans=0.0 2024-06-22 07:00:55,112 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=549864.3333333334, ans=0.95 2024-06-22 07:00:55,118 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=549864.3333333334, ans=0.0 2024-06-22 07:00:58,606 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=549864.3333333334, ans=22.5 2024-06-22 07:00:58,970 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=549864.3333333334, ans=0.125 2024-06-22 07:01:08,048 INFO [train.py:1028] (1/2) Epoch 30, batch 6550, loss[loss=0.1695, simple_loss=0.2316, pruned_loss=0.05366, over 12678.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2447, pruned_loss=0.06163, over 2588322.35 frames. ], batch size: 22, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 07:01:12,265 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:01:23,421 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.51 vs. limit=6.0 2024-06-22 07:01:32,128 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.98 vs. limit=15.0 2024-06-22 07:01:35,424 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=549956.0, ans=0.0 2024-06-22 07:01:38,851 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=549974.3333333334, ans=0.125 2024-06-22 07:01:41,209 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=549974.3333333334, ans=0.2 2024-06-22 07:01:44,229 INFO [train.py:1028] (1/2) Epoch 30, batch 6600, loss[loss=0.1852, simple_loss=0.2439, pruned_loss=0.0633, over 13196.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2449, pruned_loss=0.06183, over 2589969.72 frames. ], batch size: 72, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 07:01:53,179 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=549992.6666666666, ans=0.0 2024-06-22 07:01:54,591 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2024-06-22 07:02:02,085 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 2.465e+02 2.619e+02 2.836e+02 4.476e+02, threshold=5.239e+02, percent-clipped=0.0 2024-06-22 07:02:08,898 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.27 vs. limit=15.0 2024-06-22 07:02:12,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=550047.6666666666, ans=0.125 2024-06-22 07:02:16,626 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=550066.0, ans=0.125 2024-06-22 07:02:16,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=550066.0, ans=0.0 2024-06-22 07:02:22,907 INFO [train.py:1028] (1/2) Epoch 30, batch 6650, loss[loss=0.2011, simple_loss=0.255, pruned_loss=0.0736, over 12876.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2465, pruned_loss=0.06245, over 2583804.77 frames. ], batch size: 158, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 07:02:24,033 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.48 vs. limit=22.5 2024-06-22 07:02:25,980 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=550084.3333333334, ans=0.04949747468305833 2024-06-22 07:02:26,086 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.40 vs. limit=15.0 2024-06-22 07:02:27,228 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=550084.3333333334, ans=0.125 2024-06-22 07:02:30,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=550102.6666666666, ans=0.125 2024-06-22 07:02:31,102 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=550102.6666666666, ans=0.125 2024-06-22 07:02:33,975 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.63 vs. limit=15.0 2024-06-22 07:02:54,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=550157.6666666666, ans=0.125 2024-06-22 07:02:59,205 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=550157.6666666666, ans=0.125 2024-06-22 07:03:00,992 INFO [train.py:1028] (1/2) Epoch 30, batch 6700, loss[loss=0.2289, simple_loss=0.2822, pruned_loss=0.08778, over 12740.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.247, pruned_loss=0.06267, over 2584001.20 frames. ], batch size: 176, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:03:06,646 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=550176.0, ans=0.125 2024-06-22 07:03:13,900 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.466e+02 2.723e+02 3.020e+02 4.955e+02, threshold=5.446e+02, percent-clipped=0.0 2024-06-22 07:03:14,215 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=550212.6666666666, ans=0.0 2024-06-22 07:03:16,887 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.31 vs. limit=10.0 2024-06-22 07:03:16,906 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.64 vs. limit=6.0 2024-06-22 07:03:40,233 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=550267.6666666666, ans=0.0 2024-06-22 07:03:40,742 INFO [train.py:1028] (1/2) Epoch 30, batch 6750, loss[loss=0.2431, simple_loss=0.2872, pruned_loss=0.09947, over 12285.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2484, pruned_loss=0.06349, over 2577623.09 frames. ], batch size: 240, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:03:43,525 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=550267.6666666666, ans=0.1 2024-06-22 07:04:01,518 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=550322.6666666666, ans=0.125 2024-06-22 07:04:11,416 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=550341.0, ans=0.1 2024-06-22 07:04:13,858 INFO [train.py:1028] (1/2) Epoch 30, batch 6800, loss[loss=0.1939, simple_loss=0.2535, pruned_loss=0.06719, over 13168.00 frames. ], tot_loss[loss=0.1887, simple_loss=0.2497, pruned_loss=0.06385, over 2579239.85 frames. ], batch size: 67, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:04:17,072 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2024-06-22 07:04:19,283 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=22.5 2024-06-22 07:04:21,124 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=550377.6666666666, ans=0.0 2024-06-22 07:04:26,187 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.471e+02 2.558e+02 2.700e+02 3.304e+02, threshold=5.116e+02, percent-clipped=0.0 2024-06-22 07:04:39,611 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=550432.6666666666, ans=0.125 2024-06-22 07:04:40,458 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=550432.6666666666, ans=0.125 2024-06-22 07:04:50,242 INFO [train.py:1028] (1/2) Epoch 30, batch 6850, loss[loss=0.2075, simple_loss=0.2752, pruned_loss=0.06987, over 13283.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2498, pruned_loss=0.06364, over 2582755.98 frames. ], batch size: 63, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:04:51,046 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=550451.0, ans=0.125 2024-06-22 07:04:55,856 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=550451.0, ans=0.125 2024-06-22 07:04:57,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=550469.3333333334, ans=0.0 2024-06-22 07:05:09,024 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=550487.6666666666, ans=0.125 2024-06-22 07:05:23,501 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.12 vs. limit=15.0 2024-06-22 07:05:23,748 INFO [train.py:1028] (1/2) Epoch 30, batch 6900, loss[loss=0.1902, simple_loss=0.254, pruned_loss=0.06319, over 13305.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.251, pruned_loss=0.0641, over 2584769.20 frames. ], batch size: 49, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:05:36,095 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.491e+02 2.750e+02 2.896e+02 4.282e+02, threshold=5.500e+02, percent-clipped=0.0 2024-06-22 07:05:46,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=550597.6666666666, ans=0.0 2024-06-22 07:05:59,388 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.61 vs. limit=6.0 2024-06-22 07:05:59,620 INFO [train.py:1028] (1/2) Epoch 30, batch 6950, loss[loss=0.1765, simple_loss=0.2394, pruned_loss=0.05674, over 11532.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2516, pruned_loss=0.06382, over 2579433.87 frames. ], batch size: 17, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:05:59,847 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=550634.3333333334, ans=0.0 2024-06-22 07:06:01,400 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2024-06-22 07:06:09,449 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=550652.6666666666, ans=0.04949747468305833 2024-06-22 07:06:21,434 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=550689.3333333334, ans=0.125 2024-06-22 07:06:24,949 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2024-06-22 07:06:29,354 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=550707.6666666666, ans=0.125 2024-06-22 07:06:31,899 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=550726.0, ans=0.0 2024-06-22 07:06:32,356 INFO [train.py:1028] (1/2) Epoch 30, batch 7000, loss[loss=0.2125, simple_loss=0.2747, pruned_loss=0.07513, over 12984.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2517, pruned_loss=0.06388, over 2576954.95 frames. ], batch size: 158, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:06:39,679 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=550744.3333333334, ans=0.0 2024-06-22 07:06:45,297 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 2.506e+02 2.697e+02 2.954e+02 3.946e+02, threshold=5.393e+02, percent-clipped=0.0 2024-06-22 07:06:45,579 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=550762.6666666666, ans=0.125 2024-06-22 07:06:52,889 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=550762.6666666666, ans=0.0 2024-06-22 07:06:56,008 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2024-06-22 07:06:59,464 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.94 vs. limit=10.0 2024-06-22 07:07:07,070 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=550799.3333333334, ans=0.125 2024-06-22 07:07:07,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=550799.3333333334, ans=0.0 2024-06-22 07:07:09,636 INFO [train.py:1028] (1/2) Epoch 30, batch 7050, loss[loss=0.1913, simple_loss=0.2536, pruned_loss=0.06453, over 12722.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2526, pruned_loss=0.06407, over 2583904.94 frames. ], batch size: 176, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:07:24,094 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=550854.3333333334, ans=0.125 2024-06-22 07:07:37,652 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=550891.0, ans=0.2 2024-06-22 07:07:37,750 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2024-06-22 07:07:42,101 INFO [train.py:1028] (1/2) Epoch 30, batch 7100, loss[loss=0.1962, simple_loss=0.259, pruned_loss=0.06667, over 13182.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2531, pruned_loss=0.0644, over 2576591.29 frames. ], batch size: 112, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:07:50,370 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=550909.3333333334, ans=0.125 2024-06-22 07:07:51,767 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=550927.6666666666, ans=0.0 2024-06-22 07:07:58,181 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.459e+02 2.637e+02 2.810e+02 3.800e+02, threshold=5.273e+02, percent-clipped=0.0 2024-06-22 07:07:59,980 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.34 vs. limit=6.0 2024-06-22 07:08:03,222 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.15 vs. limit=15.0 2024-06-22 07:08:04,011 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.72 vs. limit=12.0 2024-06-22 07:08:08,639 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.29 vs. limit=10.0 2024-06-22 07:08:18,771 INFO [train.py:1028] (1/2) Epoch 30, batch 7150, loss[loss=0.2302, simple_loss=0.2826, pruned_loss=0.08895, over 12481.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2538, pruned_loss=0.06461, over 2574294.50 frames. ], batch size: 202, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:08:24,766 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=551019.3333333334, ans=0.125 2024-06-22 07:08:40,299 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=551056.0, ans=0.0 2024-06-22 07:08:46,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=551074.3333333334, ans=0.95 2024-06-22 07:08:50,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=551074.3333333334, ans=0.125 2024-06-22 07:08:51,576 INFO [train.py:1028] (1/2) Epoch 30, batch 7200, loss[loss=0.2128, simple_loss=0.275, pruned_loss=0.07534, over 13158.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2545, pruned_loss=0.06484, over 2579519.80 frames. ], batch size: 112, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:09:00,642 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2024-06-22 07:09:07,426 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.551e+02 2.712e+02 2.969e+02 3.972e+02, threshold=5.423e+02, percent-clipped=0.0 2024-06-22 07:09:07,643 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=551129.3333333334, ans=0.125 2024-06-22 07:09:09,666 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=551129.3333333334, ans=0.125 2024-06-22 07:09:12,451 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.71 vs. limit=12.0 2024-06-22 07:09:24,099 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:09:27,980 INFO [train.py:1028] (1/2) Epoch 30, batch 7250, loss[loss=0.1719, simple_loss=0.239, pruned_loss=0.05243, over 12932.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2547, pruned_loss=0.06479, over 2580723.16 frames. ], batch size: 36, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:09:28,742 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=551184.3333333334, ans=0.125 2024-06-22 07:09:30,733 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=551184.3333333334, ans=0.125 2024-06-22 07:09:34,019 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=551202.6666666666, ans=0.125 2024-06-22 07:09:40,684 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=551221.0, ans=0.125 2024-06-22 07:09:54,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=551239.3333333334, ans=0.0 2024-06-22 07:09:55,294 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=551239.3333333334, ans=0.1 2024-06-22 07:10:00,908 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=551257.6666666666, ans=0.1 2024-06-22 07:10:04,369 INFO [train.py:1028] (1/2) Epoch 30, batch 7300, loss[loss=0.1848, simple_loss=0.2539, pruned_loss=0.05791, over 12915.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2556, pruned_loss=0.06526, over 2580418.81 frames. ], batch size: 36, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:10:04,597 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=551276.0, ans=0.0 2024-06-22 07:10:05,431 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.74 vs. limit=15.0 2024-06-22 07:10:09,037 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=551276.0, ans=0.0 2024-06-22 07:10:13,867 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.69 vs. limit=15.0 2024-06-22 07:10:14,727 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=551294.3333333334, ans=0.025 2024-06-22 07:10:16,535 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.517e+02 2.777e+02 3.038e+02 4.361e+02, threshold=5.554e+02, percent-clipped=0.0 2024-06-22 07:10:18,302 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=551312.6666666666, ans=0.1 2024-06-22 07:10:32,746 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=551349.3333333334, ans=0.2 2024-06-22 07:10:34,838 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=551349.3333333334, ans=0.125 2024-06-22 07:10:37,606 INFO [train.py:1028] (1/2) Epoch 30, batch 7350, loss[loss=0.2162, simple_loss=0.2774, pruned_loss=0.07754, over 13386.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2564, pruned_loss=0.06548, over 2582622.08 frames. ], batch size: 46, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:10:43,653 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=551386.0, ans=0.05 2024-06-22 07:10:47,309 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.86 vs. limit=12.0 2024-06-22 07:10:57,445 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=551422.6666666666, ans=0.0 2024-06-22 07:11:00,096 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=551422.6666666666, ans=0.125 2024-06-22 07:11:00,701 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=551422.6666666666, ans=0.125 2024-06-22 07:11:05,958 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=551422.6666666666, ans=0.125 2024-06-22 07:11:13,418 INFO [train.py:1028] (1/2) Epoch 30, batch 7400, loss[loss=0.2002, simple_loss=0.2721, pruned_loss=0.06415, over 13272.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2564, pruned_loss=0.06517, over 2587260.01 frames. ], batch size: 63, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:11:26,052 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=551477.6666666666, ans=0.0 2024-06-22 07:11:26,416 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 2.476e+02 2.627e+02 2.896e+02 3.938e+02, threshold=5.254e+02, percent-clipped=0.0 2024-06-22 07:11:40,092 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=551532.6666666666, ans=0.125 2024-06-22 07:11:41,024 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2024-06-22 07:11:42,920 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=551532.6666666666, ans=0.125 2024-06-22 07:11:46,677 INFO [train.py:1028] (1/2) Epoch 30, batch 7450, loss[loss=0.167, simple_loss=0.2251, pruned_loss=0.0545, over 12753.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2561, pruned_loss=0.06503, over 2581008.35 frames. ], batch size: 29, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:11:47,505 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=551551.0, ans=0.2 2024-06-22 07:11:49,849 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.27 vs. limit=22.5 2024-06-22 07:11:51,593 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=551551.0, ans=0.125 2024-06-22 07:12:09,954 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=551606.0, ans=0.0 2024-06-22 07:12:10,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=551606.0, ans=0.125 2024-06-22 07:12:11,217 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=551606.0, ans=0.125 2024-06-22 07:12:15,413 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=551606.0, ans=0.125 2024-06-22 07:12:18,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=551624.3333333334, ans=0.0 2024-06-22 07:12:23,864 INFO [train.py:1028] (1/2) Epoch 30, batch 7500, loss[loss=0.2113, simple_loss=0.2664, pruned_loss=0.07805, over 10502.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2575, pruned_loss=0.06576, over 2578080.19 frames. ], batch size: 304, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:12:25,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=551642.6666666666, ans=0.125 2024-06-22 07:12:29,076 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=551642.6666666666, ans=0.125 2024-06-22 07:12:35,050 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=551661.0, ans=0.07 2024-06-22 07:12:36,205 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.502e+02 2.647e+02 2.944e+02 3.905e+02, threshold=5.293e+02, percent-clipped=0.0 2024-06-22 07:12:37,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=551679.3333333334, ans=0.125 2024-06-22 07:12:37,496 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=551679.3333333334, ans=0.0 2024-06-22 07:12:47,605 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.55 vs. limit=15.0 2024-06-22 07:12:49,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=551716.0, ans=0.2 2024-06-22 07:12:49,789 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=551716.0, ans=0.0 2024-06-22 07:12:54,807 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=551716.0, ans=0.025 2024-06-22 07:12:55,721 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.50 vs. limit=12.0 2024-06-22 07:12:55,978 INFO [train.py:1028] (1/2) Epoch 30, batch 7550, loss[loss=0.1925, simple_loss=0.2514, pruned_loss=0.06686, over 12931.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2579, pruned_loss=0.06609, over 2577215.11 frames. ], batch size: 158, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:13:17,025 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2024-06-22 07:13:32,455 INFO [train.py:1028] (1/2) Epoch 30, batch 7600, loss[loss=0.2163, simple_loss=0.2779, pruned_loss=0.07739, over 13156.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2589, pruned_loss=0.06643, over 2576754.39 frames. ], batch size: 83, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:13:45,517 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.528e+02 2.676e+02 2.983e+02 4.483e+02, threshold=5.353e+02, percent-clipped=0.0 2024-06-22 07:13:45,739 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=551862.6666666666, ans=0.125 2024-06-22 07:13:50,902 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.17 vs. limit=12.0 2024-06-22 07:14:07,979 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=551899.3333333334, ans=0.0 2024-06-22 07:14:11,142 INFO [train.py:1028] (1/2) Epoch 30, batch 7650, loss[loss=0.1825, simple_loss=0.2491, pruned_loss=0.05793, over 13038.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2587, pruned_loss=0.0662, over 2572705.72 frames. ], batch size: 33, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:14:19,427 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.74 vs. limit=15.0 2024-06-22 07:14:33,734 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2024-06-22 07:14:45,608 INFO [train.py:1028] (1/2) Epoch 30, batch 7700, loss[loss=0.187, simple_loss=0.2601, pruned_loss=0.05701, over 13231.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2595, pruned_loss=0.06645, over 2568290.85 frames. ], batch size: 63, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:14:51,288 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.14 vs. limit=15.0 2024-06-22 07:15:01,190 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 2.577e+02 2.720e+02 3.051e+02 4.313e+02, threshold=5.440e+02, percent-clipped=0.0 2024-06-22 07:15:08,978 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2024-06-22 07:15:14,330 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=552064.3333333334, ans=0.1 2024-06-22 07:15:22,111 INFO [train.py:1028] (1/2) Epoch 30, batch 7750, loss[loss=0.1505, simple_loss=0.2173, pruned_loss=0.0419, over 13285.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.26, pruned_loss=0.06667, over 2572674.78 frames. ], batch size: 72, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:15:30,991 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=552119.3333333334, ans=0.125 2024-06-22 07:15:36,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=552137.6666666666, ans=0.0 2024-06-22 07:15:55,887 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.37 vs. limit=15.0 2024-06-22 07:15:58,288 INFO [train.py:1028] (1/2) Epoch 30, batch 7800, loss[loss=0.2075, simple_loss=0.2676, pruned_loss=0.07371, over 13149.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2604, pruned_loss=0.06657, over 2577485.75 frames. ], batch size: 95, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:16:08,476 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=552211.0, ans=0.1 2024-06-22 07:16:10,402 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=552211.0, ans=0.0 2024-06-22 07:16:10,859 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.626e+02 2.735e+02 2.911e+02 3.898e+02, threshold=5.470e+02, percent-clipped=0.0 2024-06-22 07:16:31,980 INFO [train.py:1028] (1/2) Epoch 30, batch 7850, loss[loss=0.1735, simple_loss=0.2284, pruned_loss=0.05933, over 10992.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2607, pruned_loss=0.0669, over 2572322.11 frames. ], batch size: 16, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:16:38,901 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=552302.6666666666, ans=0.0 2024-06-22 07:16:43,338 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=552302.6666666666, ans=0.125 2024-06-22 07:16:43,370 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:16:46,447 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=552321.0, ans=0.125 2024-06-22 07:16:52,309 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=552321.0, ans=0.025 2024-06-22 07:16:52,923 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=552321.0, ans=0.125 2024-06-22 07:17:07,380 INFO [train.py:1028] (1/2) Epoch 30, batch 7900, loss[loss=0.1889, simple_loss=0.2543, pruned_loss=0.06176, over 13165.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2618, pruned_loss=0.06768, over 2572000.27 frames. ], batch size: 77, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:17:19,779 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=552394.3333333334, ans=0.0 2024-06-22 07:17:20,264 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.574e+02 2.807e+02 3.027e+02 4.035e+02, threshold=5.615e+02, percent-clipped=0.0 2024-06-22 07:17:39,754 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=552449.3333333334, ans=0.125 2024-06-22 07:17:44,207 INFO [train.py:1028] (1/2) Epoch 30, batch 7950, loss[loss=0.1945, simple_loss=0.249, pruned_loss=0.07002, over 10641.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2619, pruned_loss=0.06753, over 2575037.54 frames. ], batch size: 304, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:17:45,749 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=552467.6666666666, ans=0.125 2024-06-22 07:17:48,836 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=552467.6666666666, ans=0.1 2024-06-22 07:18:17,595 INFO [train.py:1028] (1/2) Epoch 30, batch 8000, loss[loss=0.1597, simple_loss=0.2271, pruned_loss=0.04617, over 12662.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.262, pruned_loss=0.06726, over 2571807.88 frames. ], batch size: 29, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:18:29,245 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.66 vs. limit=10.0 2024-06-22 07:18:29,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=552577.6666666666, ans=0.2 2024-06-22 07:18:30,012 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.252e+02 2.538e+02 2.746e+02 2.956e+02 4.531e+02, threshold=5.491e+02, percent-clipped=0.0 2024-06-22 07:18:44,077 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=552632.6666666666, ans=0.125 2024-06-22 07:18:53,861 INFO [train.py:1028] (1/2) Epoch 30, batch 8050, loss[loss=0.1902, simple_loss=0.2555, pruned_loss=0.06247, over 13268.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2615, pruned_loss=0.06706, over 2572468.05 frames. ], batch size: 83, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:18:56,102 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.97 vs. limit=15.0 2024-06-22 07:19:15,142 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=552706.0, ans=0.1 2024-06-22 07:19:16,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=552706.0, ans=0.0 2024-06-22 07:19:18,271 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=552706.0, ans=0.1 2024-06-22 07:19:21,615 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=552724.3333333334, ans=0.0 2024-06-22 07:19:23,985 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=552724.3333333334, ans=0.1 2024-06-22 07:19:25,593 INFO [train.py:1028] (1/2) Epoch 30, batch 8100, loss[loss=0.1975, simple_loss=0.2564, pruned_loss=0.06932, over 13194.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2626, pruned_loss=0.06752, over 2576547.51 frames. ], batch size: 112, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:19:30,854 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:19:37,682 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 2.489e+02 2.604e+02 2.730e+02 3.763e+02, threshold=5.208e+02, percent-clipped=0.0 2024-06-22 07:19:41,829 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=552779.3333333334, ans=0.0 2024-06-22 07:19:46,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=552779.3333333334, ans=0.5 2024-06-22 07:20:01,181 INFO [train.py:1028] (1/2) Epoch 30, batch 8150, loss[loss=0.1915, simple_loss=0.2528, pruned_loss=0.06514, over 13072.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2628, pruned_loss=0.06701, over 2579968.32 frames. ], batch size: 121, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:20:11,704 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:20:11,839 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=552852.6666666666, ans=0.125 2024-06-22 07:20:32,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=552907.6666666666, ans=0.025 2024-06-22 07:20:33,351 INFO [train.py:1028] (1/2) Epoch 30, batch 8200, loss[loss=0.2195, simple_loss=0.2793, pruned_loss=0.07985, over 13144.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2632, pruned_loss=0.06708, over 2583106.66 frames. ], batch size: 112, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:20:46,181 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.538e+02 2.710e+02 2.864e+02 3.723e+02, threshold=5.420e+02, percent-clipped=0.0 2024-06-22 07:20:47,921 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.27 vs. limit=6.0 2024-06-22 07:20:51,259 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=552962.6666666666, ans=22.5 2024-06-22 07:20:57,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=552981.0, ans=0.125 2024-06-22 07:21:03,047 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=552999.3333333334, ans=0.0 2024-06-22 07:21:03,636 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=552999.3333333334, ans=0.0 2024-06-22 07:21:09,845 INFO [train.py:1028] (1/2) Epoch 30, batch 8250, loss[loss=0.1762, simple_loss=0.2502, pruned_loss=0.05112, over 13299.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2635, pruned_loss=0.06739, over 2583430.06 frames. ], batch size: 52, lr: 1.91e-03, grad_scale: 64.0 2024-06-22 07:21:22,872 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=553054.3333333334, ans=0.125 2024-06-22 07:21:25,618 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=15.0 2024-06-22 07:21:30,082 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.12 vs. limit=15.0 2024-06-22 07:21:35,349 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=553091.0, ans=0.2 2024-06-22 07:21:40,693 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.87 vs. limit=15.0 2024-06-22 07:21:45,120 INFO [train.py:1028] (1/2) Epoch 30, batch 8300, loss[loss=0.2093, simple_loss=0.2688, pruned_loss=0.07489, over 13121.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.263, pruned_loss=0.06719, over 2580600.36 frames. ], batch size: 103, lr: 1.91e-03, grad_scale: 64.0 2024-06-22 07:21:49,533 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=553109.3333333334, ans=0.125 2024-06-22 07:21:55,033 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2024-06-22 07:21:57,312 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.396e+02 2.590e+02 2.754e+02 3.013e+02 4.165e+02, threshold=5.508e+02, percent-clipped=0.0 2024-06-22 07:21:58,108 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=553146.0, ans=0.125 2024-06-22 07:21:59,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=553146.0, ans=0.0 2024-06-22 07:22:00,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=553146.0, ans=0.125 2024-06-22 07:22:06,089 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.07 vs. limit=15.0 2024-06-22 07:22:06,519 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=553164.3333333334, ans=0.125 2024-06-22 07:22:11,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=553182.6666666666, ans=0.125 2024-06-22 07:22:13,264 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=553182.6666666666, ans=0.1 2024-06-22 07:22:13,826 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=553182.6666666666, ans=0.0 2024-06-22 07:22:15,025 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=553182.6666666666, ans=0.035 2024-06-22 07:22:16,483 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=553182.6666666666, ans=0.125 2024-06-22 07:22:17,588 INFO [train.py:1028] (1/2) Epoch 30, batch 8350, loss[loss=0.191, simple_loss=0.2537, pruned_loss=0.06416, over 13203.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2633, pruned_loss=0.06715, over 2581768.81 frames. ], batch size: 112, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:22:17,906 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.20 vs. limit=10.0 2024-06-22 07:22:19,126 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=553201.0, ans=0.125 2024-06-22 07:22:22,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=553201.0, ans=0.125 2024-06-22 07:22:23,000 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:22:24,432 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=553219.3333333334, ans=0.1 2024-06-22 07:22:26,300 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=553219.3333333334, ans=0.2 2024-06-22 07:22:36,567 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=553256.0, ans=0.0 2024-06-22 07:22:42,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=553256.0, ans=0.0 2024-06-22 07:22:44,365 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=553274.3333333334, ans=0.025 2024-06-22 07:22:50,409 INFO [train.py:1028] (1/2) Epoch 30, batch 8400, loss[loss=0.1862, simple_loss=0.2575, pruned_loss=0.0575, over 12921.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2621, pruned_loss=0.06672, over 2579453.22 frames. ], batch size: 39, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:23:06,550 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.483e+02 2.610e+02 2.783e+02 3.423e+02, threshold=5.220e+02, percent-clipped=0.0 2024-06-22 07:23:11,060 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=553329.3333333334, ans=0.1 2024-06-22 07:23:22,371 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=22.5 2024-06-22 07:23:25,768 INFO [train.py:1028] (1/2) Epoch 30, batch 8450, loss[loss=0.2068, simple_loss=0.2771, pruned_loss=0.06823, over 13167.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2631, pruned_loss=0.06712, over 2581241.29 frames. ], batch size: 112, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:23:35,374 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=553402.6666666666, ans=0.2 2024-06-22 07:23:50,556 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=553439.3333333334, ans=0.2 2024-06-22 07:24:02,258 INFO [train.py:1028] (1/2) Epoch 30, batch 8500, loss[loss=0.1808, simple_loss=0.2463, pruned_loss=0.05766, over 12490.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2641, pruned_loss=0.06739, over 2579451.98 frames. ], batch size: 29, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:24:09,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=553494.3333333334, ans=15.0 2024-06-22 07:24:15,388 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.506e+02 2.648e+02 2.914e+02 3.831e+02, threshold=5.296e+02, percent-clipped=0.0 2024-06-22 07:24:35,324 INFO [train.py:1028] (1/2) Epoch 30, batch 8550, loss[loss=0.1934, simple_loss=0.2551, pruned_loss=0.0658, over 12599.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2637, pruned_loss=0.06708, over 2577004.35 frames. ], batch size: 22, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:24:36,643 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.30 vs. limit=15.0 2024-06-22 07:24:39,613 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=553567.6666666666, ans=0.0 2024-06-22 07:24:41,056 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=553567.6666666666, ans=0.025 2024-06-22 07:24:50,530 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=553604.3333333334, ans=10.0 2024-06-22 07:25:00,204 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.13 vs. limit=15.0 2024-06-22 07:25:02,680 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=553622.6666666666, ans=0.1 2024-06-22 07:25:04,708 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=553622.6666666666, ans=0.125 2024-06-22 07:25:12,585 INFO [train.py:1028] (1/2) Epoch 30, batch 8600, loss[loss=0.2117, simple_loss=0.2693, pruned_loss=0.07707, over 13116.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2643, pruned_loss=0.06731, over 2573478.25 frames. ], batch size: 121, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:25:13,373 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=553659.3333333334, ans=0.07 2024-06-22 07:25:18,376 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=553659.3333333334, ans=0.2 2024-06-22 07:25:26,127 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 2.516e+02 2.631e+02 2.881e+02 4.266e+02, threshold=5.261e+02, percent-clipped=0.0 2024-06-22 07:25:32,585 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2024-06-22 07:25:36,341 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=553714.3333333334, ans=0.125 2024-06-22 07:25:46,460 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=553732.6666666666, ans=0.0 2024-06-22 07:25:49,678 INFO [train.py:1028] (1/2) Epoch 30, batch 8650, loss[loss=0.1876, simple_loss=0.2513, pruned_loss=0.06195, over 13155.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.264, pruned_loss=0.06683, over 2575076.35 frames. ], batch size: 103, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:25:51,635 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=553751.0, ans=0.0 2024-06-22 07:26:03,926 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=553787.6666666666, ans=0.1 2024-06-22 07:26:05,145 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=553787.6666666666, ans=0.2 2024-06-22 07:26:07,930 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.67 vs. limit=12.0 2024-06-22 07:26:22,097 INFO [train.py:1028] (1/2) Epoch 30, batch 8700, loss[loss=0.2041, simple_loss=0.2796, pruned_loss=0.06436, over 13129.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2649, pruned_loss=0.06728, over 2572050.24 frames. ], batch size: 59, lr: 1.91e-03, grad_scale: 16.0 2024-06-22 07:26:26,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=553842.6666666666, ans=0.125 2024-06-22 07:26:30,187 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.67 vs. limit=22.5 2024-06-22 07:26:36,439 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 2.613e+02 2.762e+02 2.977e+02 3.624e+02, threshold=5.524e+02, percent-clipped=0.0 2024-06-22 07:26:53,857 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=553916.0, ans=0.0 2024-06-22 07:26:59,233 INFO [train.py:1028] (1/2) Epoch 30, batch 8750, loss[loss=0.2038, simple_loss=0.2618, pruned_loss=0.07295, over 13073.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2646, pruned_loss=0.06744, over 2569228.30 frames. ], batch size: 121, lr: 1.91e-03, grad_scale: 16.0 2024-06-22 07:27:04,081 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=553934.3333333334, ans=0.09899494936611666 2024-06-22 07:27:23,403 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=553989.3333333334, ans=0.0 2024-06-22 07:27:24,936 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=553989.3333333334, ans=0.0 2024-06-22 07:27:26,406 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=554007.6666666666, ans=0.125 2024-06-22 07:27:30,910 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=554007.6666666666, ans=0.125 2024-06-22 07:27:32,614 INFO [train.py:1028] (1/2) Epoch 30, batch 8800, loss[loss=0.1929, simple_loss=0.263, pruned_loss=0.06141, over 13263.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2653, pruned_loss=0.06783, over 2574193.60 frames. ], batch size: 72, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:27:39,506 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=554026.0, ans=0.125 2024-06-22 07:27:39,776 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.15 vs. limit=22.5 2024-06-22 07:27:45,062 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=554044.3333333334, ans=0.125 2024-06-22 07:27:50,330 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.547e+02 2.772e+02 2.929e+02 3.634e+02, threshold=5.544e+02, percent-clipped=0.0 2024-06-22 07:27:53,256 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=554062.6666666666, ans=0.0 2024-06-22 07:27:54,381 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=554062.6666666666, ans=0.035 2024-06-22 07:28:07,392 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=554099.3333333334, ans=0.125 2024-06-22 07:28:09,306 INFO [train.py:1028] (1/2) Epoch 30, batch 8850, loss[loss=0.2225, simple_loss=0.2833, pruned_loss=0.08082, over 12547.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2658, pruned_loss=0.06828, over 2564739.18 frames. ], batch size: 202, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:28:13,629 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=554117.6666666666, ans=0.1 2024-06-22 07:28:15,901 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.78 vs. limit=12.0 2024-06-22 07:28:30,482 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=554172.6666666666, ans=0.0 2024-06-22 07:28:33,668 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=554172.6666666666, ans=0.0 2024-06-22 07:28:45,568 INFO [train.py:1028] (1/2) Epoch 30, batch 8900, loss[loss=0.1918, simple_loss=0.2549, pruned_loss=0.06441, over 13029.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2664, pruned_loss=0.06855, over 2563031.83 frames. ], batch size: 33, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:28:50,884 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:28:56,466 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=554227.6666666666, ans=0.04949747468305833 2024-06-22 07:28:57,257 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=554227.6666666666, ans=0.125 2024-06-22 07:28:59,027 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 2.603e+02 2.749e+02 2.970e+02 3.767e+02, threshold=5.498e+02, percent-clipped=0.0 2024-06-22 07:29:01,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=554246.0, ans=0.125 2024-06-22 07:29:08,756 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=554264.3333333334, ans=0.125 2024-06-22 07:29:13,809 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=554282.6666666666, ans=0.0 2024-06-22 07:29:16,604 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=554282.6666666666, ans=0.125 2024-06-22 07:29:17,709 INFO [train.py:1028] (1/2) Epoch 30, batch 8950, loss[loss=0.2137, simple_loss=0.273, pruned_loss=0.07721, over 12487.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.266, pruned_loss=0.0683, over 2563073.54 frames. ], batch size: 202, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:29:20,818 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2024-06-22 07:29:22,015 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2024-06-22 07:29:25,913 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554319.3333333334, ans=0.1 2024-06-22 07:29:26,009 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=554319.3333333334, ans=0.1 2024-06-22 07:29:27,906 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=554319.3333333334, ans=0.125 2024-06-22 07:29:31,659 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=554337.6666666666, ans=0.125 2024-06-22 07:29:36,912 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=554337.6666666666, ans=0.125 2024-06-22 07:29:43,699 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=554356.0, ans=0.125 2024-06-22 07:29:49,802 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=554374.3333333334, ans=0.0 2024-06-22 07:29:53,961 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=554392.6666666666, ans=0.2 2024-06-22 07:29:54,351 INFO [train.py:1028] (1/2) Epoch 30, batch 9000, loss[loss=0.1957, simple_loss=0.2644, pruned_loss=0.06347, over 13239.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2668, pruned_loss=0.06842, over 2568047.52 frames. ], batch size: 46, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:29:54,351 INFO [train.py:1051] (1/2) Computing validation loss 2024-06-22 07:30:02,008 INFO [train.py:1060] (1/2) Epoch 30, validation: loss=0.1959, simple_loss=0.2531, pruned_loss=0.06928, over 351949.00 frames. 2024-06-22 07:30:02,008 INFO [train.py:1061] (1/2) Maximum memory allocated so far is 17821MB 2024-06-22 07:30:05,520 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=554392.6666666666, ans=0.125 2024-06-22 07:30:16,283 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 2.569e+02 2.721e+02 2.913e+02 3.344e+02, threshold=5.442e+02, percent-clipped=0.0 2024-06-22 07:30:20,641 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=554429.3333333334, ans=0.0 2024-06-22 07:30:35,689 INFO [train.py:1028] (1/2) Epoch 30, batch 9050, loss[loss=0.1976, simple_loss=0.2595, pruned_loss=0.06779, over 11446.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2674, pruned_loss=0.06867, over 2566145.84 frames. ], batch size: 16, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:30:52,912 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:30:53,528 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=554521.0, ans=0.0 2024-06-22 07:30:54,265 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=554521.0, ans=0.0 2024-06-22 07:30:58,510 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=554539.3333333334, ans=0.025 2024-06-22 07:30:58,523 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=554539.3333333334, ans=0.125 2024-06-22 07:31:01,345 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=554539.3333333334, ans=0.0 2024-06-22 07:31:04,607 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554557.6666666666, ans=0.1 2024-06-22 07:31:09,894 INFO [train.py:1028] (1/2) Epoch 30, batch 9100, loss[loss=0.2053, simple_loss=0.2695, pruned_loss=0.07057, over 13076.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2666, pruned_loss=0.06811, over 2567091.06 frames. ], batch size: 71, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:31:12,054 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=554576.0, ans=0.0 2024-06-22 07:31:12,088 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=554576.0, ans=0.125 2024-06-22 07:31:17,251 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=554594.3333333334, ans=0.125 2024-06-22 07:31:24,206 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.515e+02 2.625e+02 2.856e+02 3.797e+02, threshold=5.251e+02, percent-clipped=0.0 2024-06-22 07:31:34,204 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=554631.0, ans=0.125 2024-06-22 07:31:43,864 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=554649.3333333334, ans=0.2 2024-06-22 07:31:47,435 INFO [train.py:1028] (1/2) Epoch 30, batch 9150, loss[loss=0.1795, simple_loss=0.2516, pruned_loss=0.05375, over 13145.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.267, pruned_loss=0.06817, over 2568021.42 frames. ], batch size: 77, lr: 1.91e-03, grad_scale: 16.0 2024-06-22 07:31:56,653 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.61 vs. limit=22.5 2024-06-22 07:31:59,584 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=554704.3333333334, ans=0.0 2024-06-22 07:32:05,558 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554704.3333333334, ans=0.1 2024-06-22 07:32:11,087 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.62 vs. limit=15.0 2024-06-22 07:32:19,623 INFO [train.py:1028] (1/2) Epoch 30, batch 9200, loss[loss=0.2121, simple_loss=0.2738, pruned_loss=0.07519, over 12831.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2669, pruned_loss=0.06783, over 2571547.06 frames. ], batch size: 36, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:32:20,314 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=554759.3333333334, ans=0.125 2024-06-22 07:32:29,462 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=554777.6666666666, ans=0.0 2024-06-22 07:32:33,806 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.193e+02 2.492e+02 2.653e+02 2.807e+02 3.689e+02, threshold=5.307e+02, percent-clipped=0.0 2024-06-22 07:32:40,775 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=554814.3333333334, ans=0.125 2024-06-22 07:32:42,164 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=554814.3333333334, ans=0.0 2024-06-22 07:32:48,331 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2024-06-22 07:32:48,472 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=554832.6666666666, ans=0.125 2024-06-22 07:32:51,790 INFO [train.py:1028] (1/2) Epoch 30, batch 9250, loss[loss=0.1851, simple_loss=0.2537, pruned_loss=0.05822, over 13178.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2666, pruned_loss=0.06787, over 2573046.55 frames. ], batch size: 67, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:32:54,034 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=554851.0, ans=0.125 2024-06-22 07:32:55,880 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554851.0, ans=0.1 2024-06-22 07:33:03,526 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.15 vs. limit=15.0 2024-06-22 07:33:05,822 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.76 vs. limit=15.0 2024-06-22 07:33:07,621 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=554887.6666666666, ans=0.125 2024-06-22 07:33:10,103 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=554887.6666666666, ans=0.125 2024-06-22 07:33:12,689 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:33:15,993 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:33:24,630 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=554924.3333333334, ans=0.125 2024-06-22 07:33:26,011 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=554924.3333333334, ans=0.0 2024-06-22 07:33:27,093 INFO [train.py:1028] (1/2) Epoch 30, batch 9300, loss[loss=0.1907, simple_loss=0.2562, pruned_loss=0.06267, over 12976.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2668, pruned_loss=0.06779, over 2571286.82 frames. ], batch size: 39, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:33:36,896 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=554961.0, ans=0.125 2024-06-22 07:33:40,906 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.517e+02 2.650e+02 2.815e+02 3.642e+02, threshold=5.301e+02, percent-clipped=0.0 2024-06-22 07:33:47,198 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=554997.6666666666, ans=0.125 2024-06-22 07:33:48,494 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=554997.6666666666, ans=0.125 2024-06-22 07:33:50,477 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=554997.6666666666, ans=0.2 2024-06-22 07:33:53,811 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.80 vs. limit=22.5 2024-06-22 07:33:58,510 INFO [train.py:1028] (1/2) Epoch 30, batch 9350, loss[loss=0.2078, simple_loss=0.2773, pruned_loss=0.06912, over 12551.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2673, pruned_loss=0.0681, over 2568114.45 frames. ], batch size: 22, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:34:01,509 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=555034.3333333334, ans=0.07 2024-06-22 07:34:04,517 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=555052.6666666666, ans=0.125 2024-06-22 07:34:04,603 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:34:07,345 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.94 vs. limit=15.0 2024-06-22 07:34:18,299 INFO [scaling.py:1119] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:34:27,479 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=555107.6666666666, ans=0.025 2024-06-22 07:34:29,130 INFO [train.py:1028] (1/2) Epoch 30, batch 9400, loss[loss=0.1984, simple_loss=0.2743, pruned_loss=0.06124, over 13296.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2676, pruned_loss=0.06821, over 2567357.05 frames. ], batch size: 52, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:34:30,400 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=555126.0, ans=0.035 2024-06-22 07:34:34,140 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=555126.0, ans=0.125 2024-06-22 07:34:42,587 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 2.573e+02 2.770e+02 2.922e+02 3.661e+02, threshold=5.539e+02, percent-clipped=0.0 2024-06-22 07:34:49,687 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=555181.0, ans=0.95 2024-06-22 07:34:57,156 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=555199.3333333334, ans=0.125 2024-06-22 07:34:57,157 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=555199.3333333334, ans=0.0 2024-06-22 07:35:00,142 INFO [train.py:1028] (1/2) Epoch 30, batch 9450, loss[loss=0.2123, simple_loss=0.2835, pruned_loss=0.07058, over 12568.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2688, pruned_loss=0.06881, over 2567831.94 frames. ], batch size: 22, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:35:02,324 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.60 vs. limit=15.0 2024-06-22 07:35:10,283 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=555236.0, ans=0.0 2024-06-22 07:35:12,130 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=555236.0, ans=0.125 2024-06-22 07:35:28,133 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.91 vs. limit=22.5 2024-06-22 07:35:29,101 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=555291.0, ans=0.04949747468305833 2024-06-22 07:35:31,573 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=555291.0, ans=0.2 2024-06-22 07:35:33,270 INFO [train.py:1028] (1/2) Epoch 30, batch 9500, loss[loss=0.2114, simple_loss=0.2789, pruned_loss=0.07195, over 13218.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2685, pruned_loss=0.06828, over 2576817.11 frames. ], batch size: 43, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:35:46,673 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.253e+02 2.528e+02 2.691e+02 2.937e+02 3.844e+02, threshold=5.381e+02, percent-clipped=0.0 2024-06-22 07:35:51,784 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=555364.3333333334, ans=0.125 2024-06-22 07:36:01,685 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.49 vs. limit=15.0 2024-06-22 07:36:03,782 INFO [train.py:1028] (1/2) Epoch 30, batch 9550, loss[loss=0.1866, simple_loss=0.2575, pruned_loss=0.05785, over 12925.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2676, pruned_loss=0.06784, over 2571344.24 frames. ], batch size: 39, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:36:31,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=555474.3333333334, ans=0.0 2024-06-22 07:36:31,531 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=555474.3333333334, ans=0.0 2024-06-22 07:36:34,427 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=555474.3333333334, ans=0.025 2024-06-22 07:36:36,594 INFO [train.py:1028] (1/2) Epoch 30, batch 9600, loss[loss=0.22, simple_loss=0.2733, pruned_loss=0.08336, over 10436.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2669, pruned_loss=0.06765, over 2569586.67 frames. ], batch size: 303, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:36:43,234 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=555511.0, ans=0.07 2024-06-22 07:36:50,253 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 2.470e+02 2.696e+02 2.979e+02 4.468e+02, threshold=5.393e+02, percent-clipped=0.0 2024-06-22 07:37:03,492 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=555566.0, ans=0.0 2024-06-22 07:37:07,479 INFO [train.py:1028] (1/2) Epoch 30, batch 9650, loss[loss=0.1928, simple_loss=0.2537, pruned_loss=0.06592, over 13067.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.266, pruned_loss=0.06787, over 2560615.09 frames. ], batch size: 132, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:37:21,693 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=555621.0, ans=0.0 2024-06-22 07:37:22,199 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=555621.0, ans=0.125 2024-06-22 07:37:29,006 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=555639.3333333334, ans=0.1 2024-06-22 07:37:29,588 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=555639.3333333334, ans=0.2 2024-06-22 07:37:39,583 INFO [train.py:1028] (1/2) Epoch 30, batch 9700, loss[loss=0.1979, simple_loss=0.2542, pruned_loss=0.07082, over 13003.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2652, pruned_loss=0.06788, over 2554883.12 frames. ], batch size: 144, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:37:42,731 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=555676.0, ans=0.0 2024-06-22 07:37:45,750 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=555694.3333333334, ans=0.125 2024-06-22 07:37:46,229 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=555694.3333333334, ans=0.125 2024-06-22 07:37:50,816 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.34 vs. limit=22.5 2024-06-22 07:37:52,806 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.561e+02 2.813e+02 3.120e+02 3.842e+02, threshold=5.627e+02, percent-clipped=0.0 2024-06-22 07:38:09,852 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.80 vs. limit=15.0 2024-06-22 07:38:11,912 INFO [train.py:1028] (1/2) Epoch 30, batch 9750, loss[loss=0.1906, simple_loss=0.2427, pruned_loss=0.06921, over 13043.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2642, pruned_loss=0.06718, over 2551083.58 frames. ], batch size: 132, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:38:12,201 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=555767.6666666666, ans=0.0 2024-06-22 07:38:13,838 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2024-06-22 07:38:14,718 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=555767.6666666666, ans=0.0 2024-06-22 07:38:16,659 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.28 vs. limit=15.0 2024-06-22 07:38:18,277 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=555786.0, ans=0.0 2024-06-22 07:38:28,171 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=555804.3333333334, ans=0.2 2024-06-22 07:38:31,514 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.76 vs. limit=22.5 2024-06-22 07:38:32,023 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=555822.6666666666, ans=0.125 2024-06-22 07:38:34,303 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=555822.6666666666, ans=0.125 2024-06-22 07:38:38,224 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.82 vs. limit=6.0 2024-06-22 07:38:42,722 INFO [train.py:1028] (1/2) Epoch 30, batch 9800, loss[loss=0.1849, simple_loss=0.2497, pruned_loss=0.06004, over 12918.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.264, pruned_loss=0.06687, over 2543380.50 frames. ], batch size: 39, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:38:51,399 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=555877.6666666666, ans=0.2 2024-06-22 07:38:53,274 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=555877.6666666666, ans=0.0 2024-06-22 07:38:56,297 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 2.554e+02 2.711e+02 3.023e+02 3.889e+02, threshold=5.423e+02, percent-clipped=0.0 2024-06-22 07:39:00,240 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.53 vs. limit=15.0 2024-06-22 07:39:04,469 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=555914.3333333334, ans=0.125 2024-06-22 07:39:13,615 INFO [train.py:1028] (1/2) Epoch 30, batch 9850, loss[loss=0.2046, simple_loss=0.2678, pruned_loss=0.07073, over 13011.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2637, pruned_loss=0.06657, over 2537702.79 frames. ], batch size: 102, lr: 1.91e-03, grad_scale: 16.0 2024-06-22 07:39:13,667 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=555951.0, ans=0.125 2024-06-22 07:39:26,126 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.81 vs. limit=10.0 2024-06-22 07:39:29,276 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=555987.6666666666, ans=0.2 2024-06-22 07:39:47,041 INFO [train.py:1028] (1/2) Epoch 30, batch 9900, loss[loss=0.1816, simple_loss=0.2447, pruned_loss=0.0593, over 12888.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2633, pruned_loss=0.06675, over 2530364.39 frames. ], batch size: 39, lr: 1.90e-03, grad_scale: 16.0 2024-06-22 07:39:57,306 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=556061.0, ans=0.025 2024-06-22 07:40:01,482 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.552e+02 2.665e+02 2.824e+02 3.864e+02, threshold=5.329e+02, percent-clipped=0.0 2024-06-22 07:40:04,594 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=556079.3333333334, ans=0.125 2024-06-22 07:40:13,824 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2024-06-22 07:40:16,289 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=556116.0, ans=0.125 2024-06-22 07:40:18,071 INFO [train.py:1028] (1/2) Epoch 30, batch 9950, loss[loss=0.2147, simple_loss=0.2845, pruned_loss=0.07248, over 12680.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2627, pruned_loss=0.06718, over 2524025.56 frames. ], batch size: 29, lr: 1.90e-03, grad_scale: 16.0 2024-06-22 07:40:18,740 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=556134.3333333334, ans=0.1 2024-06-22 07:40:31,057 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=556171.0, ans=0.035 2024-06-22 07:40:34,273 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=556171.0, ans=0.0 2024-06-22 07:40:38,053 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=556189.3333333334, ans=0.125 2024-06-22 07:40:43,223 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=556189.3333333334, ans=0.125 2024-06-22 07:40:44,053 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2024-06-22 07:40:46,667 INFO [scaling.py:1023] (1/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.26 vs. limit=22.5 2024-06-22 07:40:51,341 INFO [train.py:1028] (1/2) Epoch 30, batch 10000, loss[loss=0.185, simple_loss=0.2568, pruned_loss=0.05662, over 12711.00 frames. ], tot_loss[loss=0.199, simple_loss=0.263, pruned_loss=0.06753, over 2486118.68 frames. ], batch size: 22, lr: 1.90e-03, grad_scale: 16.0 2024-06-22 07:40:58,147 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=556244.3333333334, ans=0.0 2024-06-22 07:41:02,675 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=556244.3333333334, ans=0.125 2024-06-22 07:41:06,730 WARNING [optim.py:487] (1/2) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 2.608e+02 2.797e+02 3.027e+02 3.766e+02, threshold=5.594e+02, percent-clipped=0.0 2024-06-22 07:41:23,427 INFO [train.py:1028] (1/2) Epoch 30, batch 10050, loss[loss=0.2204, simple_loss=0.2875, pruned_loss=0.07668, over 12469.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2635, pruned_loss=0.0683, over 2444260.47 frames. ], batch size: 22, lr: 1.90e-03, grad_scale: 16.0 2024-06-22 07:41:30,125 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=556336.0, ans=0.125 2024-06-22 07:41:34,377 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=556336.0, ans=0.125 2024-06-22 07:41:49,532 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=556391.0, ans=0.125 2024-06-22 07:41:53,889 INFO [train.py:1028] (1/2) Epoch 30, batch 10100, loss[loss=0.2083, simple_loss=0.2629, pruned_loss=0.07683, over 11099.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.263, pruned_loss=0.06782, over 2422185.63 frames. ], batch size: 16, lr: 1.90e-03, grad_scale: 16.0 2024-06-22 07:41:58,502 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=556409.3333333334, ans=0.2 2024-06-22 07:42:00,450 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=556427.6666666666, ans=0.2 2024-06-22 07:42:02,350 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=556427.6666666666, ans=0.0 2024-06-22 07:42:02,352 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=556427.6666666666, ans=0.1 2024-06-22 07:42:02,408 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=556427.6666666666, ans=0.0 2024-06-22 07:42:07,475 INFO [train.py:1282] (1/2) Done!